date:20130228

I'd try using Solr in isolation first, you have a couple of other products
in there and you could be having this issue anywhere along the chain. What
is your evidence that your not getting the limit you expect?

Best would be to provide a test case illustrating the problem, it should be
pretty easy to modify one in the stock release...

Best
Erick

On Fri, Feb 22, 2013 at 4:53 AM, sely  wrote:

> I am using Solr4.1.0(using original solrconfig.xml) + Lily1.3 + CDH4
> From solrconfig.xml, it says:
> maxFieldLength was removed in 4.0. To get similar behavior,
> include a LimitTokenCountFilterFactory in your fieldType definition.
> E.g.  maxTokenCount="1"/>
>
> But it doesn't work whatever I set the value of maxTokenCount in
> schema.xml.
> The document can search is only max. 500KB (Field Type is TextField)
> From solr query, it returns the max field length is 500KB.
>
> Does there any option that I should setup too?
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr4-1-0-how-to-config-field-length-tp4042188.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Backtick character in field values, and results

ICUFoldingFilterFactory is "folding" the backtick (grave accent).

See admin/analysis page, it's a lifesaver in these situations!

Best
Erick


On Fri, Feb 22, 2013 at 3:46 PM, Neelesh  wrote:

> With a text_unbroken field
>  omitTermFreqAndPositions="true">   "solr.KeywordTokenizerFactory" />  "solr.ICUFoldingFilterFactory" />  
> A query like
> field:Hello` matches both "Hello" and "Hello`". This does not happen with
> something like +. That is,
> field:Hello+ does not match "Hello", but only matches "Hello+"
> Is there something special about backticks? Are there more such really
> special characters?
>
> Thanks!
> -neelesh
>

Re: solr 4.1 - trying to create 2 collection with 2 different sets of configurations

2013-02-28 Thread Rafał Kuć

Hello!

You can try doing the following:

1. Run Solr with no collection and no cores, just an empty solr.xml

2. If you don't have a ZooKeeper run Solr with -DzkRun

3. Upload you configurations to ZooKeeper, by running

cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:9983 -confdir 
CONFIGURATION_1_DIR -confname COLLECTION_1_NAME

and

cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:9983 -confdir 
CONFIGURATION_2_DIR -confname COLLECTION_2_NAME

4. Create those two collections:

curl 
'http://localhost:8983/solr/admin/collections?action=CREATE&name=COLLECTION_1_NAME&numShards=2&replicationFactor=0'

and

curl 
'http://localhost:8983/solr/admin/collections?action=CREATE&name=COLLECTION_2_NAME&numShards=2&replicationFactor=0'

Of course the CONFIGURATION_1_DIR and CONFIGURATION_2_DIR is the
directory where your configurations are stored and the
COLLECTION_1_NAME and the COLLECTION_2_NAME are your collection names.

Also adjust the numShards and replicationFactor to your needs.

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> solr 4.1 - trying to create 2 collection with 2 different sets of
> configurations.
> Anyone accomplished this?

> if I run bootstraop twice on different conf dirs, I get both of them in
> zookeper, but using collections API to create a collection if
> collection.configName=seconfConf doesnt work.

> any idea?

> thanks.



> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-4-1-trying-to-create-2-collection-with-2-different-sets-of-configurations-tp4043609.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: A few operations questions about the tlog (UpdateLog)

Sure, putting the tlog on a different disk can reduce disk I/O contention,
of course you don't need to bother unless you can demonstrate that your
Solr app is I/O bound. If it's not, you won't see much benefit

Don't know about the compression. Note that tlogs are only guaranteed (at
present) to contain 100 docs so we're probably not talking about huge
numbers here. If you're seeing very large tlogs, you probably should
re-visit your commit strategy.

Best
Erick

On Sun, Feb 24, 2013 at 12:23 PM, Timothy Potter wrote:

> I'm wondering if it makes sense to have the tlog on a separate disk
> from the index, ie. something like:
>
> data
> |__index -> /disk1/index
> |__tlog -> /disk2/tlog
>
> Also, are large documents compressed in the tlog?
>
> Thanks.
> Tim
>

Re: Solr3.5 Vs Solr4.1 - Help please

Thanks guys .. 

Well i did another test. Copied the Index files from perf lab to Dev machine
which has Solr4.1
Now ran solrmeter to generate load on Dev server. We were able to drive the
QPS upto 150 with CPU on avg 35%. but the same index is generating 100% CPU
at 1 QPS in perf lab. 

On a side note. has "fl" parameter to do anything with this? coz with my
test using solrmeter i used fl=*,score and q=masterId:. Where as in perf
lab they have fixed 10 field names in "fl".   



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr3-5-Vs-Solr4-1-Help-please-tp4043543p4043614.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr cloud deployment on tomcat in prod

Anyone can edit the Wiki, contributions welcome!

Best
Erick


On Mon, Feb 25, 2013 at 5:50 PM, varun srivastava wrote:

> Hi,
>  Is there any official documentation around deployment of solr cloud in
> production on tomcat ?
>
> I am looking for anything as detailed as following one .. It will be good
> if someone can take the following tutorial and get it on official solrcloud
> wiki after reviewing each step.
>
>
> http://www.myjeeva.com/2012/10/solrcloud-cluster-single-collection-deployment/
>
>
> Thanks
> Varun
>

Re: solr 4.1 - trying to create 2 collection with 2 different sets of configurations

2013-02-28 Thread adfel70

Thanks
I'm going to try this. 

Have you tried it yourself?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-4-1-trying-to-create-2-collection-with-2-different-sets-of-configurations-tp4043609p4043617.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: A few operations questions about the tlog (UpdateLog)

whats the life cycle of a tlog file. Is it purged after commit (even with
soft commit) ?
I posted 100 docs to solr (standalone) did hard commit. Observed a new tlog
file is created. 
re-posted the same 100 docs and did hard commit. Observed a new tlog file is
created. Old one still exists. 

When do they get purged. Concern is we have at least 20K docs published
every 2hrs so need to understand if its safe to put them in a different
location where we can have a script to purge old files at regular interval.

thanks
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/A-few-operations-questions-about-the-tlog-UpdateLog-tp4042560p4043618.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 4.1 - trying to create 2 collection with 2 different sets of configurations

2013-02-28 Thread Rafał Kuć

Hello!

Yes I did :)

-- 
Regards,
 Rafał Kuć
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch

> Thanks
> I'm going to try this. 

> Have you tried it yourself?



> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-4-1-trying-to-create-2-collection-with-2-different-sets-of-configurations-tp4043609p4043617.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: AW: 170G index, 1.5 billion documents, out of memory on query

Personally I've never seen any single node support 1.5B documents. I advise
biting the bullet and sharding. Even if you do get the simple keyword
search working, the first time you sort I expect it to blow up. Then you'll
try to facet and it'll blow up. Then you'll start using filter queries and
it'll blow up. For instance, each filter query that gets cached requires
maxdoc/8 bytes for the cache. Or 190M or so. And the default limit is 512
in the cache (which you can change, BTW) so your filter cache _alone_ could
require 96G of memory unless you are very careful.

Actually, I'd advise indexing a subset of your data on a target machine
while firing queries at it, then increasing your # of documents until you
find your limits. See:
http://searchhub.org/dev/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Best
Erick

On Tue, Feb 26, 2013 at 12:37 PM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> It really should be unlimited: this setting has nothing to do with how
> much RAM is on the computer.
>
> See
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> On Tue, Feb 26, 2013 at 12:18 PM, zqzuk  wrote:
> > Hi
> > sorry I couldnt do this directly... the way I do this is by subscribing
> to a
> > cluster of computers in our organisation and send the job with required
> > memory. It gets randomly allocated to a node (one single server in the
> > cluster) once executed and it is not possible to connect to that specific
> > node to check.
> >
> > But im pretty sure it wont be "unlimited" but matching the figure I
> > required, which was 40G (the max memory on a single node is 48G anyway).
> So
> > Solr only gets maximum of 40G memory for this index.
> >
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/170G-index-1-5-billion-documents-out-of-memory-on-query-tp4042696p4043110.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: POI error while extracting docx document

I'd guess you have old Tika jars in your classpath.

Best
Erick


On Tue, Feb 26, 2013 at 12:40 PM, Carlos Alexandro Becker <
caarl...@gmail.com> wrote:

> sorry:
>
> http://stackoverflow.com/questions/15095202/extracting-docx-files-with-tika-in-apache-solr-gives-nosuchmethod-error
>
>
> On Tue, Feb 26, 2013 at 2:40 PM, Carlos Alexandro Becker <
> caarl...@gmail.com
> > wrote:
>
> > I've composed a stackoverflow question also..
> >
> >
> >
> > On Tue, Feb 26, 2013 at 2:23 PM, Carlos Alexandro Becker <
> > caarl...@gmail.com> wrote:
> >
> >> Any ideas?
> >>
> >>
> >> On Tue, Feb 26, 2013 at 2:15 PM, Carlos Alexandro Becker <
> >> caarl...@gmail.com> wrote:
> >>
> >>> 4.0.0 and 1.0-beta
> >>>
> >>>
> >>> On Tue, Feb 26, 2013 at 2:12 PM, Swati Swoboda <
> >>> sswob...@igloosoftware.com> wrote:
> >>>
>  Hey Carlos,
> 
>  What version of Solr are you running and what version of openxml4j did
>  you import?
> 
>  Swati
> 
>  -Original Message-
>  From: Carlos Alexandro Becker [mailto:caarl...@gmail.com]
>  Sent: Tuesday, February 26, 2013 12:04 PM
>  To: solr-user
>  Subject: Re: POI error while extracting docx document
> 
>  I've added the openxml4j jar to the project, still don't work. Which
> is
>  the correct version?
> 
> 
>  On Tue, Feb 26, 2013 at 11:23 AM, Carlos Alexandro Becker <
>  caarl...@gmail.com> wrote:
> 
>  > I made solr extract the files content. That's ok, but some files
> (like
>  > .docx files) give me errors, while .pdf files index as expected.
>  >
>  > The error is:
>  >
>  >
>  > 14:20:29,714 ERROR [org.apache.solr.servlet.SolrDispatchFilter]
>  > (http--0.0.0.0-8080-4) null:java.lang.RuntimeException:
>  > java.lang.NoSuchMethodError:
>  >
> org.apache.poi.openxml4j.opc.PackagePart.getRelatedPart(Lorg/apache/po
>  >
> i/openxml4j/opc/PackageRelationship;)Lorg/apache/poi/openxml4j/opc/Pac
>  > kagePart;
>  >  at
>  >
> org.apache.solr.servlet.SolrDispatchFilter.sendError(SolrDispatchFilte
>  > r.java:469)
>  > at
>  >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter
>  > .java:297)
>  >  at
>  >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Appli
>  > cationFilterChain.java:280)
>  > at
>  >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFi
>  > lterChain.java:248)
>  >  at
>  >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperVa
>  > lve.java:275)
>  > at
>  >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextVa
>  > lve.java:161)
>  >  at
>  >
> org.jboss.as.web.security.SecurityContextAssociationValve.invoke(Secur
>  > ityContextAssociationValve.java:153)
>  > at
>  >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.ja
>  > va:155)
>  >  at
>  >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.ja
>  > va:102)
>  > at
>  >
> org.apache.catalina.authenticator.SingleSignOn.invoke(SingleSignOn.jav
>  > a:397)
>  >  at
>  >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValv
>  > e.java:109)
>  > at
>  >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java
>  > :368)
>  >  at
>  >
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:
>  > 877)
>  > at
>  >
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.proces
>  > s(Http11Protocol.java:671)
>  >  at
>  >
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:930
>  > ) at java.lang.Thread.run(Thread.java:722)
>  > Caused by: java.lang.NoSuchMethodError:
>  >
> org.apache.poi.openxml4j.opc.PackagePart.getRelatedPart(Lorg/apache/po
>  >
> i/openxml4j/opc/PackageRelationship;)Lorg/apache/poi/openxml4j/opc/Pac
>  > kagePart;
>  > at
>  >
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.handleEm
>  > beddedParts(AbstractOOXMLExtractor.java:121)
>  >  at
>  >
> org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML
>  > (AbstractOOXMLExtractor.java:107)
>  > at
>  >
> org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOX
>  > MLExtractorFactory.java:112)
>  >  at
>  >
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.j
>  > ava:82) at
>  >
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>  >  at
>  >
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>  > at
>  >
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:12
>  > 0)
>  >  at
>  >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(Extra
>  > ctingDocumentLoader.java:219)
>  > at
>  >

Re: Consistent relevance tie-breaking across clusters?

bq: we don't want to use either the primary key or the record's
update date as the tie-breaker, as it may introduce an new bias into the
ranking algorithm

Are you thinking of adding something to your main clause to force this?
If so, why not just use sorting by adding a sort clause like:

&sort=score desc, datefield desc

I think you'll get what you want... Or I'm misunderstanding...

Best
Erick


On Tue, Feb 26, 2013 at 3:22 PM, Gregg Donovan  wrote:

> We're running into an issue when comparing Solr results for the same
> query across different clusters . When we don't sort by anything other than
> "score desc" we'll see inconsistent tie-breaking across different clusters.
> When we add an explicit secondary sort by the primary key, results are the
> same across clusters. I believe this is to be expected and that Solr/Lucene
> will revert to sorting by doc_id in the case of tied scores if no secondary
> sort is specified. Our indexing process is random enough in how it feeds
> documents to the indexer that we see different primary key ordering in the
> index in different clusters.
>
> In our case, we don't want to use either the primary key or the record's
> update date as the tie-breaker, as it may introduce an new bias into the
> ranking algorithm. I'm considering adding a secondary sort by a hash of the
> primary key, as this will be consistent across clusters but randomly
> distributed.
>
> Questions:
>
> --Has anyone else encountered this problem? If so, how have you solved it?
>
> --What's the best way to hook this in? A copyField plus a custom FieldType?
>
> Thanks.
>
> --Gregg
>

Re: Stored values and date math query

Just to check, your order_prep_time is _indexed_ too, right? It's
a bit confusing but anything you use in your function queries will
be from indexed terms, not stored ones

Best
Erick


On Tue, Feb 26, 2013 at 4:05 PM, Indika Tantrigoda wrote:

> Hi All,
>
> I am trying to use a stored value in the index and add it to a date
> component as follows
>
> sessionAvailableNowQuery = {!edismax}(start_time:[* TO
> 1970-01-01T12:37:030Z] AND end_time:[1970-01-01T12:37:030Z +
> (_val_:order_prep_time)MINUTES TO *] AND consumers:[1 TO *] AND
> session_time_range_available:true)
>
> (order_prep_time is the stored value)
>
> However, when doing so causes the query to return incorrect results. The
> query is part of a if/else query present in the fields (fl) list
>
> _session_available_now_:if(exists(query($sessionAvailableNowQuery)), 100,
> 0)
>
> Is it possible to retrive an integer value from the index and pass it on it
> a date math query ? Is there anything else that needs to be in the query ?
>
> Thanks in advance.
>

Re: solr/admin java.io.IOException: A file, file system or message queue is no longer available.

Very strange. I'm assuming you've re-started Solr after
the directories were removed? I'd expect to see a message
in the log (WARNING) indicating the index dirs were being
created.

Otherwise, permissions errors can manifest themselves in tricky
ways.

not much help...
Erick


On Tue, Feb 26, 2013 at 4:21 PM, Lapera-Valenzuela, Elizabeth [PRI-1PP] <
elizabeth.lap...@primerica.com> wrote:

> Here is the full error:
>
> ** **
>
> [2/26/13 15:08:17:762 EST] 002e SolrDispatchF I
> org.apache.solr.servlet.SolrDispatchFilter init SolrDispatchFilter.init()*
> ***
>
> [2/26/13 15:08:17:861 EST] 002e SolrResourceL I
> org.apache.solr.core.SolrResourceLoader locateSolrHome Using JNDI
> solr.home: /usr/local/pfs/conf/solr
>
> [2/26/13 15:08:17:908 EST] 002e CoreContainer I
> org.apache.solr.core.CoreContainer$Initializer initialize looking for
> solr.xml: /usr/local/pfs/conf/solr/solr.xml
>
> [2/26/13 15:08:17:915 EST] 002e CoreContainer I
> org.apache.solr.core.CoreContainer  New CoreContainer 788803332
>
> [2/26/13 15:08:17:923 EST] 002e CoreContainer I
> org.apache.solr.core.CoreContainer load Loading CoreContainer using Solr
> Home: '/usr/local/pfs/conf/solr/'
>
> [2/26/13 15:08:17:927 EST] 002e SolrResourceL I
> org.apache.solr.core.SolrResourceLoader  new SolrResourceLoader for
> directory: '/usr/local/pfs/conf/solr/'
>
> [2/26/13 15:08:18:515 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting socketTimeout to: 0
>
> [2/26/13 15:08:18:520 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting urlScheme to: http://
>
> [2/26/13 15:08:18:522 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting connTimeout to: 0
>
> [2/26/13 15:08:18:525 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting maxConnectionsPerHost to: 20
>
> [2/26/13 15:08:18:528 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting corePoolSize to: 0
>
> [2/26/13 15:08:18:530 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting maximumPoolSize to: 2147483647
>
> [2/26/13 15:08:18:533 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting maxThreadIdleTime to: 5
>
> [2/26/13 15:08:18:535 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting sizeOfQueue to: -1
>
> [2/26/13 15:08:18:538 EST] 002e HttpShardHand I
> org.apache.solr.handler.component.HttpShardHandlerFactory getParameter
> Setting fairnessPolicy to: false
>
> [2/26/13 15:08:18:614 EST] 002e HttpClientUti I
> org.apache.solr.client.solrj.impl.HttpClientUtil createClient Creating new
> http client,
> config:maxConnectionsPerHost=20&maxConnections=1&socketTimeout=0&connTimeout=0&retry=false
> 
>
> [2/26/13 15:08:18:942 EST] 002e CoreContainer I
> org.apache.solr.core.CoreContainer load Registering Log Listener
>
> [2/26/13 15:08:18:966 EST] 002e CoreContainer I
> org.apache.solr.core.CoreContainer load loading shared library:
> /usr/local/pfs/conf/solr/lib
>
> [2/26/13 15:08:19:035 EST] 0035 CoreContainer I
> org.apache.solr.core.CoreContainer create Creating SolrCore 'client' using
> instanceDir: /usr/local/pfs/conf/solr/client
>
> [2/26/13 15:08:19:051 EST] 0035 SolrResourceL I
> org.apache.solr.core.SolrResourceLoader  new SolrResourceLoader for
> directory: '/usr/local/pfs/conf/solr/client/'
>
> [2/26/13 15:08:19:116 EST] 0035 SolrConfigI
> org.apache.solr.core.SolrConfig initLibs Adding specified lib dirs to
> ClassLoader
>
> [2/26/13 15:08:19:215 EST] 0035 SolrConfigI
> org.apache.solr.core.SolrConfig  Using Lucene MatchVersion: LUCENE_40
> 
>
> [2/26/13 15:08:19:493 EST] 0035 ConfigI
> org.apache.solr.core.SolrConfig  Loaded SolrConfig: solrconfig.xml**
> **
>
> [2/26/13 15:08:19:527 EST] 0035 IndexSchema   I
> org.apache.solr.schema.IndexSchema readSchema Reading Solr Schema
>
> [2/26/13 15:08:19:555 EST] 0035 IndexSchema   I
> org.apache.solr.schema.IndexSchema readSchema Schema name=client
>
> [2/26/13 15:08:20:588 EST] 0035 IndexSchema   I
> org.apache.solr.schema.IndexSchema readSchema unique key field: universalId
> 
>
> [2/26/13 15:08:20:916 EST] 0035 SolrCore  I
> org.apache.solr.core.SolrCore  [client] Opening new SolrCore at
> /usr/local/pfs/conf/solr/client/, dataDir=/usr/local/pfs/solr/client/
>
> [2/26/13 15:08:20:920 EST] 0035 SolrCore  I
> org.apache.solr.core.SolrCore  JMX monitoring not detected for core:
> client
>
> [2/26/13 15:08:20:939 EST] 0035 SolrCore  I
> org.apache.solr

Re: Role of zookeeper at runtime

To update at least one node must be up for each shard,
otherwise updates fail.

Solr replication works fine in 4.x, in fact it's used to synchronize
when bulk updates happen (say you bring up a new node).
The transaction logs are only used to store at least 100 currently
documents for synchronizing.

I haven't personally tried it, but I'd guess it's possible to set up dc2 NOT
as part of a cluster (i.e. not ZK aware) and just have it use old-style
replication. But why do this? "avoiding indexing in both DCs" strikes
me as a false savings. Just set up two independent Solr clusters, one
in each DC and send the does to each DC. Only go to more complex
solutions if you can demonstrate that this doesn't work would be my
first approach.

Best
Erick


On Tue, Feb 26, 2013 at 6:49 PM, varun srivastava wrote:

> So does it means while doing "document add" the state of cluster is fetched
> from zookeeper and then depending upon hash of docid the target shard is
> decided ?
>
> Assume we have 3 shards ( with no replicas) in which 1 went down while
> indexing , so will all the documents will be routed to remaining 2 shards
> or only 2/3 rd of the documents will be indexed ? If answer is remaining 2
> shards will get all the documents , then if later 3rd shard comes up online
> then will solr cloud will do rebalancing ?
>
> Is anywhere in zookeeper we store the range of docids stored in each shard,
> or any other information about actual docs ? We have 2 datacentres (dc1 and
> dc2) which need to be indexed with exactly same data and we update index
> only once a day. Both dc1 and dc2 have exact same solrcloud config and
> machines.
>
>  Can we populate dc2 by just copying all the index binaries from
> solr-cores/core0/data of dc1, to the machines in dc2 ( to avoid indexing
> same documents on dc2). I guess solr replication API doesn't work in
> solrcloud, hence loooking for work around.
>
> Thanks
> Varun
>
> On Tue, Feb 26, 2013 at 3:34 PM, Mark Miller 
> wrote:
>
> > ZooKeeper
> > /
> >  /clusterstate.json - info about the layout and state of the cluster -
> > collections, shards, urls, etc
> >  /collections - config to use for the collection, shard leader voting zk
> > nodes
> >  /configs - sets of config files
> >  /live_nodes - ephemeral nodes, one per Solr node
> >  /overseer - work queue for update clusterstate.json, creating new
> > collections, etc
> >  /overseer_elect - overseer voting zk nodes
> >
> > - Mark
> >
> > On Feb 26, 2013, at 6:18 PM, varun srivastava 
> > wrote:
> >
> > > Hi Mark,
> > > One more question
> > >
> > > While doing solr doc update/add what information is required from
> > zookeeper
> > > ? Can you tell what all information is stored in zookeeper other than
> the
> > > startup configs.
> > >
> > > Thanks
> > > Varun
> > >
> > > On Tue, Feb 26, 2013 at 3:09 PM, Mark Miller 
> > wrote:
> > >
> > >>
> > >> On Feb 26, 2013, at 5:25 PM, varun srivastava  >
> > >> wrote:
> > >>
> > >>> Hi All,
> > >>> I have some questions regarding role of zookeeper in solrcloud
> runtime,
> > >>> while processing the queries .
> > >>>
> > >>> 1) Is zookeeper cluster referred by solr shards for processing every
> > >>> request, or its only used to copy config on startup time ?
> > >>
> > >> No, it's not used per request. Solr talks to ZooKeeper on SolrCore
> > startup
> > >> - to get configs and set itself up. Then it only talks to ZooKeeper
> > when a
> > >> cluster state change happens - in that case, ZooKeeper pings Solr and
> > Solr
> > >> will get an update view of the cluster. That view is cached and used
> for
> > >> requests. In a stable state, Solr is not talking to ZooKeeper other
> than
> > >> the heartbeat they keep to know a node is up.
> > >>
> > >>> 2) How loadbalancing is done between replicas ? Is traffic stat
> shared
> > >>> through zookeeper ?
> > >>
> > >> Basic round robin. Traffic stats are not currently in Zk.
> > >>
> > >>> 3) If for any reason zookeeper cluster goes offline for sometime,
> does
> > >> solr
> > >>> cloud will not be able to server any traffic ?
> > >>
> > >> It will stop allowing updates, but continue serving searches.
> > >>
> > >> - Mark
> > >>
> > >>>
> > >>>
> > >>> Thanks
> > >>> Varun
> > >>
> > >>
> >
> >
>

Re: solrCloud insert data steps

Why do you want to know? General issues or are you having
a specific problem?

But here's the flow as I understand it.

Let's say a leader receives a doc
  - if the doc is for this shard, forward the doc to all replicas and
collect the results before responding
 - if the doc is for a different shard, forward the request to the leader
of that shard, wait for results before responding

Let's say a replica receives a doc
  - forward to the leader for the correct shard, wait for response before
responding to client. Leader does as above.

Hope that helps,
Erick



On Wed, Feb 27, 2013 at 3:03 AM, rulinma  wrote:

> numshards=2
> for example ids(1,4) will go to same shards
> I have shard1(ip1,ip2,ip3) shard2(ip4,ip5,ip6) that each shard has 3 nodes,
> and ip1 as the lead of shard1.
>
> If i insert ids(1,4), I think basic step is:
>
> 1 search lead, and insert ids(1,4) to the lead locally
> 2 at the lead node, search follow and send addDocs request to (ip2,ip3)
> 3 ip2 and ip3 receive reqeust from the ip1 and insert ids(1,4) async
> 4 lead collect response and return
>
> I don't know it is correct? If not, give me some suggestions.
>
> Thanks.
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solrCloud-insert-data-steps-tp4043315.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: query builder for solr UI?

2013-02-28 Thread eShard

sorry,
The easiest way to describe it is specifically we desire a "google-like"
experience.
so if the end user types in a phrase or quotes or +, - (for and, not) etc
etc.
the UI will be flexible enough to build the correct solr query syntax.

How will edismax help?

And I tried simplifying queries by using the copyfield command to copy all
of the metadata to the text field.
So now the only field we have to query is the text field but I doubt that is
going to be a panacea.

Does that make sense?

Thanks,



--
View this message in context: 
http://lucene.472066.n3.nabble.com/query-builder-for-solr-UI-tp4043481p4043643.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: query builder for solr UI?

2013-02-28 Thread Jan Høydahl

Hi,

Have you tried edismax across your original (not text copyfield) fiels? If no, 
try it. If yes, which of your expectations did it not satisfy?

Why would you want to "build" a query yourself, when Solr's queryParser is made 
to do just that for you from the input query string?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

28. feb. 2013 kl. 14:39 skrev eShard :

> sorry,
> The easiest way to describe it is specifically we desire a "google-like"
> experience.
> so if the end user types in a phrase or quotes or +, - (for and, not) etc
> etc.
> the UI will be flexible enough to build the correct solr query syntax.
> 
> How will edismax help?
> 
> And I tried simplifying queries by using the copyfield command to copy all
> of the metadata to the text field.
> So now the only field we have to query is the text field but I doubt that is
> going to be a panacea.
> 
> Does that make sense?
> 
> Thanks,
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/query-builder-for-solr-UI-tp4043481p4043643.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Role of zookeeper at runtime


On Feb 26, 2013, at 6:49 PM, varun srivastava  wrote:

> So does it means while doing "document add" the state of cluster is fetched
> from zookeeper and then depending upon hash of docid the target shard is
> decided ?

We keep the zookeeper info cached locally. We only updated it when ZooKeeper 
tells us it has changed.

> 
> Assume we have 3 shards ( with no replicas) in which 1 went down while
> indexing , so will all the documents will be routed to remaining 2 shards
> or only 2/3 rd of the documents will be indexed ? If answer is remaining 2
> shards will get all the documents , then if later 3rd shard comes up online
> then will solr cloud will do rebalancing ?

All of the updates that hash to the third shard will fail. That is why we have 
replicas - if you have a replica, it will take over as the leader.

> 
> Is anywhere in zookeeper we store the range of docids stored in each shard,
> or any other information about actual docs ?

The range of hashes are stored for each shard in zk.

> We have 2 datacentres (dc1 and
> dc2) which need to be indexed with exactly same data and we update index
> only once a day. Both dc1 and dc2 have exact same solrcloud config and
> machines.
> 
> Can we populate dc2 by just copying all the index binaries from
> solr-cores/core0/data of dc1, to the machines in dc2 ( to avoid indexing
> same documents on dc2). I guess solr replication API doesn't work in
> solrcloud, hence loooking for work around.
> 
> Thanks
> Varun
> 
> On Tue, Feb 26, 2013 at 3:34 PM, Mark Miller  wrote:
> 
>> ZooKeeper
>> /
>> /clusterstate.json - info about the layout and state of the cluster -
>> collections, shards, urls, etc
>> /collections - config to use for the collection, shard leader voting zk
>> nodes
>> /configs - sets of config files
>> /live_nodes - ephemeral nodes, one per Solr node
>> /overseer - work queue for update clusterstate.json, creating new
>> collections, etc
>> /overseer_elect - overseer voting zk nodes
>> 
>> - Mark
>> 
>> On Feb 26, 2013, at 6:18 PM, varun srivastava 
>> wrote:
>> 
>>> Hi Mark,
>>> One more question
>>> 
>>> While doing solr doc update/add what information is required from
>> zookeeper
>>> ? Can you tell what all information is stored in zookeeper other than the
>>> startup configs.
>>> 
>>> Thanks
>>> Varun
>>> 
>>> On Tue, Feb 26, 2013 at 3:09 PM, Mark Miller 
>> wrote:
>>> 
 
 On Feb 26, 2013, at 5:25 PM, varun srivastava 
 wrote:
 
> Hi All,
> I have some questions regarding role of zookeeper in solrcloud runtime,
> while processing the queries .
> 
> 1) Is zookeeper cluster referred by solr shards for processing every
> request, or its only used to copy config on startup time ?
 
 No, it's not used per request. Solr talks to ZooKeeper on SolrCore
>> startup
 - to get configs and set itself up. Then it only talks to ZooKeeper
>> when a
 cluster state change happens - in that case, ZooKeeper pings Solr and
>> Solr
 will get an update view of the cluster. That view is cached and used for
 requests. In a stable state, Solr is not talking to ZooKeeper other than
 the heartbeat they keep to know a node is up.
 
> 2) How loadbalancing is done between replicas ? Is traffic stat shared
> through zookeeper ?
 
 Basic round robin. Traffic stats are not currently in Zk.
 
> 3) If for any reason zookeeper cluster goes offline for sometime, does
 solr
> cloud will not be able to server any traffic ?
 
 It will stop allowing updates, but continue serving searches.
 
 - Mark
 
> 
> 
> Thanks
> Varun
 
 
>> 
>>

Re: solr 4.1 - trying to create 2 collection with 2 different sets of configurations

2013-02-28 Thread Shankar Sundararaju

You may also have to link the config name with collection name in the
zookeeper. Here's the command to do it:

cloud-scripts/zkcli.sh -cmd linkconfig -zkhost localhost:9983 -collection
COLLECTION_1_NAME -confname CONF_1_NAME

Rafal, did having the config name same as collection name allow you to
create collections without having to link the corresponding config names? I
did not try this myself.

Thanks
-Shankar


On Thu, Feb 28, 2013 at 4:41 AM, Rafał Kuć  wrote:

> Hello!
>
> You can try doing the following:
>
> 1. Run Solr with no collection and no cores, just an empty solr.xml
>
> 2. If you don't have a ZooKeeper run Solr with -DzkRun
>
> 3. Upload you configurations to ZooKeeper, by running
>
> cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:9983 -confdir
> CONFIGURATION_1_DIR -confname COLLECTION_1_NAME
>
> and
>
> cloud-scripts/zkcli.sh -cmdupconfig -zkhost localhost:9983 -confdir
> CONFIGURATION_2_DIR -confname COLLECTION_2_NAME
>
> 4. Create those two collections:
>
> curl '
> http://localhost:8983/solr/admin/collections?action=CREATE&name=COLLECTION_1_NAME&numShards=2&replicationFactor=0
> '
>
> and
>
> curl '
> http://localhost:8983/solr/admin/collections?action=CREATE&name=COLLECTION_2_NAME&numShards=2&replicationFactor=0
> '
>
> Of course the CONFIGURATION_1_DIR and CONFIGURATION_2_DIR is the
> directory where your configurations are stored and the
> COLLECTION_1_NAME and the COLLECTION_2_NAME are your collection names.
>
> Also adjust the numShards and replicationFactor to your needs.
>
> --
> Regards,
>  Rafał Kuć
>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - ElasticSearch
>
> > solr 4.1 - trying to create 2 collection with 2 different sets of
> > configurations.
> > Anyone accomplished this?
>
> > if I run bootstraop twice on different conf dirs, I get both of them in
> > zookeper, but using collections API to create a collection if
> > collection.configName=seconfConf doesnt work.
>
> > any idea?
>
> > thanks.
>
>
>
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/solr-4-1-trying-to-create-2-collection-with-2-different-sets-of-configurations-tp4043609.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Regards,
*Shankar Sundararaju
*Sr. Software Architect
ebrary, a ProQuest company
410 Cambridge Avenue, Palo Alto, CA 94306 USA
shan...@ebrary.com | www.ebrary.com | 650-475-8776 (w) | 408-426-3057 (c)

Re: query builder for solr UI?

2013-02-28 Thread eShard

Good question,
if the user types in special characters like the dash - 
How will I know to treat it like a dash or the NOT operator? The first one
will need to be URL encoded the second one won't be resulting in very
different queries.

So I apologize for not being more clear, so really what I'm after is making
it easy for the user to communicate what exactly they are looking for and to
URL encode their input correctly. that's what I meant by "query building"

Thanks,





--
View this message in context: 
http://lucene.472066.n3.nabble.com/query-builder-for-solr-UI-tp4043481p4043659.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: filter query on multi-Valued field

First, what is your unique key field? If it is "id", then only one of these 
documents will be stored since they have the same id values.


Please provide the exact request URL so we can see exactly what the q and fq 
parameters look like. Your fq looks malformed, but it's hard to say for sure 
without the exact literal URL.


It appears that your multi-valued "initials" field actually has only a 
single value which is a string that contains keywords delimited by blanks. 
You have two choices: 1) make "initials" a tokenized/text field, or 2) be 
sure to add the second set of initials as a separate value, such as: 
"initials":["MKN", "JRT"]


-- Jack Krupansky

-Original Message- 
From: Deepak

Sent: Thursday, February 28, 2013 8:05 AM
To: solr-user@lucene.apache.org
Subject: filter query on multi-Valued field

Hi How can I filter results using filter query on multi-Valued field.  Here
are sample two records   { "sub_count":8,
"long_name":"Mike", "first_name":"John", "id":45949,
"sym":"TEST", "type":"T", "last_name":"Account",
"person_id":"3613", "short_name":"Molly", "initials":["ABC
XYZ"], "timestamp":"2013-02-28T02:44:02.235Z" }   {
"sub_count":8, "long_name":"Mike", "first_name":"John",
"id":45949, "sym":"TEST", "type":"T",
"last_name":"Account", "person_id":"3613",
"short_name":"Molly", "initials":["MKN JRT"],
"timestamp":"2013-02-28T02:44:02.235Z" } Note, all values are same
except initials field value.   Solr query criteria q:sym:TEST
fq:initials[MKN] This doesn't return the second record.  What am I missing
here?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/filter-query-on-multi-Valued-field-tp4043621.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr 3.6 - Out Of Memory Exception

2013-02-28 Thread Manivannan Selvadurai

Hi,

  Im using Solr 3.6 on Tomcat 6, Xmx is set to 4096m.

I have indexed about 61075834 documents using shingle filter  with max
shingle size 3. Basically i have a lot of terms. Whenever i request 3-4
queries at a time to to get the termvector component, I get the following
exception.

SEVERE: java.lang.OutOfMemoryError: Java heap space
at
org.apache.lucene.search.HitQueue.getSentinelObject(HitQueue.java:76)
at
org.apache.lucene.search.HitQueue.getSentinelObject(HitQueue.java:22)
at
org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:116)
at org.apache.lucene.search.HitQueue.(HitQueue.java:67)
at
org.apache.lucene.search.TopScoreDocCollector.(TopScoreDocCollector.java:275)
at
org.apache.lucene.search.TopScoreDocCollector.(TopScoreDocCollector.java:37)
at
org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.(TopScoreDocCollector.java:42)
at
org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.(TopScoreDocCollector.java:40)
at
org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:258)
at
org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:238)
at
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1285)
at
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1178)
at
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:377)
at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:394)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:662)


   Even if the query returns data it takes a lot of time around 230
sec (Qtime= 23). Is there any way to optimize my index.




-- 
*With Thanks,*
*Manivannan *

Re: Custom filter for document permissions

2013-02-28 Thread Colin Hebert

Thank you Timothy,

With the indication you gave me (and the help of this article
http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/ ) I
managed to draft my own filter, but it seems that it doesn't work
quite as I expected.

Here is what I've done so far:
https://github.com/ColinHebert/Sakai-Solr/tree/permission/permission/solr/src/main/java/org/sakaiproject/search/solr/permission/filter

But it seems that the filter is applied on every document matched by a
query (rather than doing that on the range of documents I searched
for).

I've done some tests with 10k+ documents and the query
/select?q=*%3A*&fq={!sakai%20userId=admin}&tv=false&start=0&rows=1
takes ages to execute (and in my application I can see that solr is
trying to apply the filter on absolutely every document.

Cheers,
Colin
Colin Hebert


On 26 February 2013 15:30, Timothy Potter  wrote:
> Hi Colin,
>
> I think a filter is definitely the way to go. Moreover, you should
> look into Solr's PostFilter concept which is intended to work with
> "expensive" filters. Have a look at Yonik's blog post on this topic:
> http://yonik.com/posts/advanced-filter-caching-in-solr/
>
> Cheers,
> Tim
>
> On Tue, Feb 26, 2013 at 7:24 AM, Colin Hebert  wrote:
>> Hi,
>>
>> I have some troubles to figure out the right thing when it comes to
>> filtering results for security reasons.
>>
>> I work on this application that contains documents that are not
>> accessible to everyone, so I want to filter the search results, based
>> on the right to read each document for the user making the search
>> query.
>> To do that, right now, I have a filter on the application side that
>> checks for each document returned by a search query, if it is
>> accessible by the current user, and removes it from the result list if
>> it isn't.
>>
>> That isn't really optimal as you might get a result page with 7
>> results instead of 10 because some results were removed (and if you're
>> smart enough you can figure out the content of those hidden documents
>> by doing many search queries).
>>
>> So I can think of two solutions, either I code a paging system in my
>> application that will take care of those holes in the result list, but
>> it adds quite a lot of work that could be useless if solr can take
>> care of that.
>> The second solution is having solr filtering those results before
>> sending them back.
>>
>> The second solution seems a bit more clean to me, but I'm not sure if
>> it is a good practice or not.
>>
>> The permission system in the application is a bit 'wild', some
>> permissions are based on the day of the week, others on the existence
>> or not of another document, so I can't really get out of this
>> situation by storing more information in the index and using standard
>> filters.
>> If creating a custom filter in Solr isn't too bad, what I was thinking
>> of would require the solr server making a request to the application
>> to check if the user (given as a parameter in the query) can access
>> the document (and that should be done on each document).
>> Note that I will have to do that security check anyways, so the time
>> to do a security check isn't (at least shouldn't) be relevant to the
>> performances of a solution over the other.
>> What will have an impact though is the fact that the solr server has
>> to do a request to the application (network connection) for each
>> document.
>>
>> Colin Hebert

Re: query builder for solr UI?

2013-02-28 Thread Jan Høydahl

Again - what problems did you face when attempting this with the eDismax parser?
Are you saying you are unhappy with the way eDisMax interprets -foo as NOT foo?
A dash on its own - is treated like a dash.

Your JavaScript code would anyway need to handle URL encoding properly so that 
a query input for +foo is sent to Solr as q=%2Bfoo, since the plus otherwise 
would be a space :) So simply urlencode the whole user input when constructing 
your URL.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

28. feb. 2013 kl. 15:46 skrev eShard :

> Good question,
> if the user types in special characters like the dash - 
> How will I know to treat it like a dash or the NOT operator? The first one
> will need to be URL encoded the second one won't be resulting in very
> different queries.
> 
> So I apologize for not being more clear, so really what I'm after is making
> it easy for the user to communicate what exactly they are looking for and to
> URL encode their input correctly. that's what I meant by "query building"
> 
> Thanks,
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/query-builder-for-solr-UI-tp4043481p4043659.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: solr 4.1 - trying to create 2 collection with 2 different sets of configurations


On Feb 28, 2013, at 9:41 AM, Shankar Sundararaju  wrote:

> did having the config name same as collection name allow you to
> create collections without having to link the corresponding config names? I
> did not try this myself.

It should work that way or there is a bug.

- Mark

Re: Solr 3.6 - Out Of Memory Exception

2013-02-28 Thread Jan Høydahl

How much memory on the server in total? For such a large index you should leave 
PLENTY of free memory for the OS to cache your index efficiently.
A quick thing to try is to upgrade to Solr4.1, as the index size itself will 
shrink dramatically and you will get better utilization of whatever memory you 
have. Also, you should read this blog to try optimize your HW resources 
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

My gut feel is that you still need to allocate more than 4G for Sold, until you 
get rid of all OOMs.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

28. feb. 2013 kl. 16:08 skrev Manivannan Selvadurai :

> Hi,
> 
>  Im using Solr 3.6 on Tomcat 6, Xmx is set to 4096m.
> 
>I have indexed about 61075834 documents using shingle filter  with max
> shingle size 3. Basically i have a lot of terms. Whenever i request 3-4
> queries at a time to to get the termvector component, I get the following
> exception.
> 
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>at
> org.apache.lucene.search.HitQueue.getSentinelObject(HitQueue.java:76)
>at
> org.apache.lucene.search.HitQueue.getSentinelObject(HitQueue.java:22)
>at
> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:116)
>at org.apache.lucene.search.HitQueue.(HitQueue.java:67)
>at
> org.apache.lucene.search.TopScoreDocCollector.(TopScoreDocCollector.java:275)
>at
> org.apache.lucene.search.TopScoreDocCollector.(TopScoreDocCollector.java:37)
>at
> org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.(TopScoreDocCollector.java:42)
>at
> org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.(TopScoreDocCollector.java:40)
>at
> org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:258)
>at
> org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:238)
>at
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1285)
>at
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1178)
>at
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:377)
>at
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:394)
>at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186)
>at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
>at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>at java.lang.Thread.run(Thread.java:662)
> 
> 
>   Even if the query returns data it takes a lot of time around 230
> sec (Qtime= 23). Is there any way to optimize my index.
> 
> 
> 
> 
> -- 
> *With Thanks,*
> *Manivannan *

Re: solr 4.1 - trying to create 2 collection with 2 different sets of configurations

2013-02-28 Thread Anirudha Jadhav

*1.empty Zookeeper*
*2.empty index directories for solr*
*3.empty solr.xml*




*
3.1 upload / link cfg in zookeeper for test collection*
*4*.* start 4 solr servers on different machines*
*5. Access server* : i see
 that's ok

*6. CREATE collection*
http://hostname:15000/solr/admin/collections?action=CREATE&name=test&numShards=1&replicationFactor=4

this creates one core on each server with one shard named
- test_shard1_replica1
- test_shard1_replica2
- test_shard1_replica3
- test_shard1_replica4
and persists it in solr.xml on each server.

---
REPEAT steps 3-6 to create new collections. We use this a lot with 4.1
Note: each collect needs a different solrHOME dir with its own solr.xml


On Thu, Feb 28, 2013 at 10:25 AM, Mark Miller  wrote:

>
> On Feb 28, 2013, at 9:41 AM, Shankar Sundararaju 
> wrote:
>
> > did having the config name same as collection name allow you to
> > create collections without having to link the corresponding config
> names? I
> > did not try this myself.
>
> It should work that way or there is a bug.
>
> - Mark
>
>


-- 
Anirudha P. Jadhav

Re: geodist() spatial sorting: sort param could not be parsed as a query, and is not a field that exists in the index: geodist()

2013-02-28 Thread David Smiley (@MITRE.org)

Strange.  The code in Solr that has that error string passes an an additional
exception that will have its own error message that is more detailed, and
you'll see that in the stack trace in the Solr logs; perhaps in your error
response too but I'm not sure.

If you remove the sorting, are the search results otherwise right?  I'm
looking at your query and some things look wrong.  Notably, where you refer
to "sfield", this is either supposed to be a request parameter or
local-param but you've concatenated it with the point that preceded with an
adjoining space (%20).  Use an ampersand in place of that %20 and see if it
starts working.

~ David


PeterKerk wrote
> I want to sort the results of my query on distance.
> 
> But I get this error:
> sort param could not be parsed as a query, and is not a field that exists
> in the index: geodist()
> 
> On this query:
> 
> http://localhost:8983/solr/tt/select/?indent=on&facet=true&fq=countryid:1&fq={!geofilt}&pt=51.8425,5.85278%20sfield=geolocation%20d=20&q=*:*&start=0&rows=10&fl=id,title,city&facet.mincount=1&sort=geodist()%20asc
> 
> I also tried:
> http://localhost:8983/solr/tt/select/?indent=on&facet=true&fq=countryid:1&fq={!geofilt&pt=51.8425,5.85278%20sfield=geolocation%20d=20}&q=*:*&start=0&rows=10&fl=id,title,city&facet.mincount=1&sort=geodist()%20asc
> 
> Here's what I have in my schema.xml:
>  subFieldSuffix="_coordinate"/>
>   
> 
> 
>  stored="false"/>
> I've been checking this page: http://wiki.apache.org/solr/SpatialSearch
> But that does not mention my error.





-
 Author: http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
--
View this message in context: 
http://lucene.472066.n3.nabble.com/geodist-spatial-sorting-sort-param-could-not-be-parsed-as-a-query-and-is-not-a-field-that-exists-in--tp4043603p4043679.html
Sent from the Solr - User mailing list archive at Nabble.com.

Search in String and Text_en fields simultaneously with edismax

2013-02-28 Thread Burgmans, Tom

I have a field "valueadd" of type String and field "body" of type text_en (with 
tokenization and linguistic processing).

When I search with edismax against field valueadd like this:
q=valueadd:(test . test2)
I see that the parsed query is
(valueadd:test valueadd:. valueadd:test2)~3

Why not (valueadd:test . test2) ? It looks like the query is tokenized while 
field type String doesn't have a tokenizer configured.

I know I could construct my query as:
q=valueadd:"test . test2"
in which case the phrase is searched as a whole against valueadd. But why 
doesn't that happen without quotes?


The reason I ask:
For a simultaneous search in multiple fields I like to include field valueadd 
in the qf parameter which contains String and text_en fields, like:
&qf=valueadd body

How can I search both fields simultaneously without duplicating search terms, 
while the query is (whitespace) tokenized for "body" but search as a phrase for 
"valueadd"?

Thanks,
Tom Burgmans

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

Re: Solr 3.6 - Out Of Memory Exception

2013-02-28 Thread Manivannan Selvadurai

hi,

Thanks for the quick reply,

Total memory in the server is around 7.5 G, Even though there are around
61075834 docs the index size is around
44G. I tried changing the directoryFactory to MMapDirectory, it didnt help.
Previously we used Lucene to query for term vectors using
TermVectorMapper, We didnt face any issues other than large response
time(40 sec).

As our index grew we decided to move to Solr, but now we are facing these
issues, OOM, large QTime etc.
Is this normal for our index or are we missing some thing?

With Thanks
Manivannan.



On Thu, Feb 28, 2013 at 9:01 PM, Jan Høydahl  wrote:

> How much memory on the server in total? For such a large index you should
> leave PLENTY of free memory for the OS to cache your index efficiently.
> A quick thing to try is to upgrade to Solr4.1, as the index size itself
> will shrink dramatically and you will get better utilization of whatever
> memory you have. Also, you should read this blog to try optimize your HW
> resources
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> My gut feel is that you still need to allocate more than 4G for Sold,
> until you get rid of all OOMs.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> 28. feb. 2013 kl. 16:08 skrev Manivannan Selvadurai <
> manivan...@unmetric.com>:
>
> > Hi,
> >
> >  Im using Solr 3.6 on Tomcat 6, Xmx is set to 4096m.
> >
> >I have indexed about 61075834 documents using shingle filter  with max
> > shingle size 3. Basically i have a lot of terms. Whenever i request 3-4
> > queries at a time to to get the termvector component, I get the following
> > exception.
> >
> > SEVERE: java.lang.OutOfMemoryError: Java heap space
> >at
> > org.apache.lucene.search.HitQueue.getSentinelObject(HitQueue.java:76)
> >at
> > org.apache.lucene.search.HitQueue.getSentinelObject(HitQueue.java:22)
> >at
> > org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:116)
> >at org.apache.lucene.search.HitQueue.(HitQueue.java:67)
> >at
> >
> org.apache.lucene.search.TopScoreDocCollector.(TopScoreDocCollector.java:275)
> >at
> >
> org.apache.lucene.search.TopScoreDocCollector.(TopScoreDocCollector.java:37)
> >at
> >
> org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.(TopScoreDocCollector.java:42)
> >at
> >
> org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.(TopScoreDocCollector.java:40)
> >at
> >
> org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:258)
> >at
> >
> org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:238)
> >at
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1285)
> >at
> >
> org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1178)
> >at
> >
> org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:377)
> >at
> >
> org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:394)
> >at
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186)
> >at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
> >at
> >
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> >at
> >
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> >at
> >
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
> >at
> >
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
> >at
> >
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> >at
> >
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> >at
> >
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> >at
> >
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
> >at
> >
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:859)
> >at
> >
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
> >at
> > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
> >at java.lang.Thread.run(Thread.java:662)
> >
> >
> >   Even if the query returns data it takes a lot of time around
> 230
> > sec (Qtime= 23). Is there any way to optimize my index.
> >
> >
> >
>

Re: geodist() spatial sorting: sort param could not be parsed as a query, and is not a field that exists in the index: geodist()

2013-02-28 Thread PeterKerk

You were right, sloppy on my side. I replaced the %20 with & (in more than 1
place) and now it does work. Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/geodist-spatial-sorting-sort-param-could-not-be-parsed-as-a-query-and-is-not-a-field-that-exists-in--tp4043603p4043686.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Search in String and Text_en fields simultaneously with edismax

Query text is always "tokenized" (more properly, "parsed"), unless the text 
is enclosed in quotes or spaces are escaped with backslash. Try:


q=valueadd:"test . test2"

or

q=valueadd:test\ .\ test2

Parentheses simply provide grouping, either to control boolean operator 
evaluation order or to apply a field name to a sequence of query tokens (as 
you have written.)


The analyzer or field type is only consulted when the query is generated, 
not while it is being parsed. The same identical parsing rules apply to both 
tokenized and non-tokenized fields. What a field type's analyzer does with 
its value is irrelevant to query parsing.


-- Jack Krupansky

-Original Message- 
From: Burgmans, Tom

Sent: Thursday, February 28, 2013 10:48 AM
To: solr-user@lucene.apache.org
Subject: Search in String and Text_en fields simultaneously with edismax

I have a field "valueadd" of type String and field "body" of type text_en 
(with tokenization and linguistic processing).


When I search with edismax against field valueadd like this:
q=valueadd:(test . test2)
I see that the parsed query is
(valueadd:test valueadd:. valueadd:test2)~3

Why not (valueadd:test . test2) ? It looks like the query is tokenized while 
field type String doesn't have a tokenizer configured.


I know I could construct my query as:
q=valueadd:"test . test2"
in which case the phrase is searched as a whole against valueadd. But why 
doesn't that happen without quotes?



The reason I ask:
For a simultaneous search in multiple fields I like to include field 
valueadd in the qf parameter which contains String and text_en fields, like:

&qf=valueadd body

How can I search both fields simultaneously without duplicating search 
terms, while the query is (whitespace) tokenized for "body" but search as a 
phrase for "valueadd"?


Thanks,
Tom Burgmans

This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended 
recipient, please
immediately notify us by email or telephone and delete the original email 
and attachments
without using, disseminating or reproducing its contents to anyone other 
than the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or 
incomplete transmission of

of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 
33202517.

Re: Solr 3.6 - Out Of Memory Exception

As a general guide: use the following process.

1. Set your JVM heap to a fairly large size.
2. Load Solr.
3. Do a bunch of common queries that cover the range of what production will 
see. Be sure to use the most expensive operations you expect, such as facets 
and filters, and all of the fields that might be referenced. The idea is to 
force Lucene and Solr to load various caches.

3. Check the available JVM heap memory.
4. Reset for JVM heap limit to that number plus a reasonable margin, such as 
at least 250M. The goal is to have enough for Solr to work, but not so much 
that tons of Java garbage can accumulate which will eventually cause very 
slow garbage collections.

5. Restart Solr.
6. Re-execute the test queries.
7. Verify that the available JVM heap is still reasonable, like 250M.
8. Maybe sure that available OS system memory outside of the JVM is at least 
half of your index, at a minimum. Caching the full index in OS system memory 
is preferable.

In short, it sounds like your system is woefully underconfigured for your 
index. If it happens to run some of the time, consider yourself very lucky. 
If it doesn't run reasonably well, which seems to be the case, start by 
properly configuring it with sufficient OS system memory and enough but not 
too much Java JVM heap memory.

-- Jack Krupansky

-Original Message- 
From: Manivannan Selvadurai

Sent: Thursday, February 28, 2013 10:58 AM
To: solr-user
Subject: Re: Solr 3.6 - Out Of Memory Exception

hi,

Thanks for the quick reply,

Total memory in the server is around 7.5 G, Even though there are around
61075834 docs the index size is around
44G. I tried changing the directoryFactory to MMapDirectory, it didnt help.
Previously we used Lucene to query for term vectors using
TermVectorMapper, We didnt face any issues other than large response
time(40 sec).

As our index grew we decided to move to Solr, but now we are facing these
issues, OOM, large QTime etc.
Is this normal for our index or are we missing some thing?

With Thanks
Manivannan.

On Thu, Feb 28, 2013 at 9:01 PM, Jan Høydahl  wrote:

How much memory on the server in total? For such a large index you should
leave PLENTY of free memory for the OS to cache your index efficiently.
A quick thing to try is to upgrade to Solr4.1, as the index size itself
will shrink dramatically and you will get better utilization of whatever
memory you have. Also, you should read this blog to try optimize your HW
resources
http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

My gut feel is that you still need to allocate more than 4G for Sold,
until you get rid of all OOMs.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

28. feb. 2013 kl. 16:08 skrev Manivannan Selvadurai <
manivan...@unmetric.com>:

> Hi,
>
>  Im using Solr 3.6 on Tomcat 6, Xmx is set to 4096m.
>
>I have indexed about 61075834 documents using shingle filter  with 
> max

> shingle size 3. Basically i have a lot of terms. Whenever i request 3-4
> queries at a time to to get the termvector component, I get the 
> following

> exception.
>
> SEVERE: java.lang.OutOfMemoryError: Java heap space
>at
> org.apache.lucene.search.HitQueue.getSentinelObject(HitQueue.java:76)
>at
> org.apache.lucene.search.HitQueue.getSentinelObject(HitQueue.java:22)
>at
> org.apache.lucene.util.PriorityQueue.initialize(PriorityQueue.java:116)
>at org.apache.lucene.search.HitQueue.(HitQueue.java:67)
>at
>
org.apache.lucene.search.TopScoreDocCollector.(TopScoreDocCollector.java:275)
>at
>
org.apache.lucene.search.TopScoreDocCollector.(TopScoreDocCollector.java:37)
>at
>
org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.(TopScoreDocCollector.java:42)
>at
>
org.apache.lucene.search.TopScoreDocCollector$InOrderTopScoreDocCollector.(TopScoreDocCollector.java:40)
>at
>
org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:258)
>at
>
org.apache.lucene.search.TopScoreDocCollector.create(TopScoreDocCollector.java:238)
>at
>
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1285)
>at
>
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:1178)
>at
>
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:377)
>at
>
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:394)
>at
>
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:186)
>at
>
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)
>at
>
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:365)
>at
>
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:260)
>

RE: Search in String and Text_en fields simultaneously with edismax

2013-02-28 Thread Burgmans, Tom

Ah OK. I didn't have a good view of query parsing vs query generation. Thanks 
for clearing this up.

So it means that searching in a tokenized and non-tokenized field 
simultaneously is not possible when I want
- the expression parsed as phrase for the non-tokenized field
- the expression parsed as multiple tokens for the tokenized field
?

If possible, I'd like to avoid writing my own query parser.



-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday 28 February 2013 05:05
To: solr-user@lucene.apache.org
Subject: Re: Search in String and Text_en fields simultaneously with edismax

Query text is always "tokenized" (more properly, "parsed"), unless the text
is enclosed in quotes or spaces are escaped with backslash. Try:

q=valueadd:"test . test2"

or

q=valueadd:test\ .\ test2

Parentheses simply provide grouping, either to control boolean operator
evaluation order or to apply a field name to a sequence of query tokens (as
you have written.)

The analyzer or field type is only consulted when the query is generated,
not while it is being parsed. The same identical parsing rules apply to both
tokenized and non-tokenized fields. What a field type's analyzer does with
its value is irrelevant to query parsing.

-- Jack Krupansky

-Original Message-
From: Burgmans, Tom
Sent: Thursday, February 28, 2013 10:48 AM
To: solr-user@lucene.apache.org
Subject: Search in String and Text_en fields simultaneously with edismax

I have a field "valueadd" of type String and field "body" of type text_en
(with tokenization and linguistic processing).

When I search with edismax against field valueadd like this:
q=valueadd:(test . test2)
I see that the parsed query is
(valueadd:test valueadd:. valueadd:test2)~3

Why not (valueadd:test . test2) ? It looks like the query is tokenized while
field type String doesn't have a tokenizer configured.

I know I could construct my query as:
q=valueadd:"test . test2"
in which case the phrase is searched as a whole against valueadd. But why
doesn't that happen without quotes?


The reason I ask:
For a simultaneous search in multiple fields I like to include field
valueadd in the qf parameter which contains String and text_en fields, like:
&qf=valueadd body

How can I search both fields simultaneously without duplicating search
terms, while the query is (whitespace) tokenized for "body" but search as a
phrase for "valueadd"?

Thanks,
Tom Burgmans

This email and any attachments may contain confidential or privileged
information
and is intended for the addressee only. If you are not the intended
recipient, please
immediately notify us by email or telephone and delete the original email
and attachments
without using, disseminating or reproducing its contents to anyone other
than the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or
incomplete transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number
33202517.


This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended recipient, 
please
immediately notify us by email or telephone and delete the original email and 
attachments
without using, disseminating or reproducing its contents to anyone other than 
the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or incomplete 
transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 33202517.

Re: SolrCloud as my primary data store

2013-02-28 Thread Michael Sokolov


On 02/21/2013 12:02 AM, jimtronic wrote:

Now that I've been running Solr Cloud for a couple months and gotten
comfortable with it, I think it's time to revisit this subject.


   

...

I'd really like to hear from someone who has made the leap.

Cheers, Jim
   
We use Solr as our primary storage for XML documents for many customer 
installs, and we're quite happy with performance, reliability, update 
speed, and so on.  We don't handle inetractive updates from end users: 
only batch updates from an upstream content authoring system of some 
kind.  In this situation, we're quite content.  I looked at other NoSQL 
db's, but couldn't find a compelling reason to switch, and some 
disadvantages in some cases.


-Mike

Re: Search in String and Text_en fields simultaneously with edismax

The analyzer/query generator for a tokenized field will in fact tokenize the 
value in quotes, but it will generate a "phrase query" to assure that the 
list of terms occur as a phrase in the index.


-- Jack Krupansky

-Original Message- 
From: Burgmans, Tom

Sent: Thursday, February 28, 2013 11:33 AM
To: solr-user@lucene.apache.org
Subject: RE: Search in String and Text_en fields simultaneously with edismax

Ah OK. I didn't have a good view of query parsing vs query generation. 
Thanks for clearing this up.


So it means that searching in a tokenized and non-tokenized field 
simultaneously is not possible when I want

- the expression parsed as phrase for the non-tokenized field
- the expression parsed as multiple tokens for the tokenized field
?

If possible, I'd like to avoid writing my own query parser.



-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Thursday 28 February 2013 05:05
To: solr-user@lucene.apache.org
Subject: Re: Search in String and Text_en fields simultaneously with edismax

Query text is always "tokenized" (more properly, "parsed"), unless the text
is enclosed in quotes or spaces are escaped with backslash. Try:

q=valueadd:"test . test2"

or

q=valueadd:test\ .\ test2

Parentheses simply provide grouping, either to control boolean operator
evaluation order or to apply a field name to a sequence of query tokens (as
you have written.)

The analyzer or field type is only consulted when the query is generated,
not while it is being parsed. The same identical parsing rules apply to both
tokenized and non-tokenized fields. What a field type's analyzer does with
its value is irrelevant to query parsing.

-- Jack Krupansky

-Original Message-
From: Burgmans, Tom
Sent: Thursday, February 28, 2013 10:48 AM
To: solr-user@lucene.apache.org
Subject: Search in String and Text_en fields simultaneously with edismax

I have a field "valueadd" of type String and field "body" of type text_en
(with tokenization and linguistic processing).

When I search with edismax against field valueadd like this:
q=valueadd:(test . test2)
I see that the parsed query is
(valueadd:test valueadd:. valueadd:test2)~3

Why not (valueadd:test . test2) ? It looks like the query is tokenized while
field type String doesn't have a tokenizer configured.

I know I could construct my query as:
q=valueadd:"test . test2"
in which case the phrase is searched as a whole against valueadd. But why
doesn't that happen without quotes?


The reason I ask:
For a simultaneous search in multiple fields I like to include field
valueadd in the qf parameter which contains String and text_en fields, like:
&qf=valueadd body

How can I search both fields simultaneously without duplicating search
terms, while the query is (whitespace) tokenized for "body" but search as a
phrase for "valueadd"?

Thanks,
Tom Burgmans

This email and any attachments may contain confidential or privileged
information
and is intended for the addressee only. If you are not the intended
recipient, please
immediately notify us by email or telephone and delete the original email
and attachments
without using, disseminating or reproducing its contents to anyone other
than the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or
incomplete transmission of
of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number
33202517.


This email and any attachments may contain confidential or privileged 
information
and is intended for the addressee only. If you are not the intended 
recipient, please
immediately notify us by email or telephone and delete the original email 
and attachments
without using, disseminating or reproducing its contents to anyone other 
than the intended
recipient. Wolters Kluwer shall not be liable for the incorrect or 
incomplete transmission of

of this email or any attachments, nor for unauthorized use by its employees.

Wolters Kluwer nv has its registered address in Alphen aan den Rijn, The 
Netherlands, and is registered
with the Trade Registry of the Dutch Chamber of Commerce under number 
33202517.

Sort by currency field using OpenExchangeRatesOrgProvider

2013-02-28 Thread marotosg

Hi,  
I have as part of my schema one currency field.
http://myurwithjsonfile"/>

IT works properly when i filter or do a query using an specific currency.
Let's say I want to filter by USD
price_c:[10.00,USD TO 100.00,USD]. It returns documents where the currency
is not in USD and it makes the conversion right.

But when I try to sort by this field it returns the results by amount and it
doesn't take in consideration the currency.

Do you know if there is a bug around this?

Thanks
Sergio



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sort-by-currency-field-using-OpenExchangeRatesOrgProvider-tp4043704.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Custom filter for document permissions

2013-02-28 Thread Timothy Potter

Hi Colin,

Your query is *:* so that is every document. Try a query that only
matches a small subset and see if you get different results.

Cheers,
Tim

On Thu, Feb 28, 2013 at 8:17 AM, Colin Hebert  wrote:
> Thank you Timothy,
>
> With the indication you gave me (and the help of this article
> http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/ ) I
> managed to draft my own filter, but it seems that it doesn't work
> quite as I expected.
>
> Here is what I've done so far:
> https://github.com/ColinHebert/Sakai-Solr/tree/permission/permission/solr/src/main/java/org/sakaiproject/search/solr/permission/filter
>
> But it seems that the filter is applied on every document matched by a
> query (rather than doing that on the range of documents I searched
> for).
>
> I've done some tests with 10k+ documents and the query
> /select?q=*%3A*&fq={!sakai%20userId=admin}&tv=false&start=0&rows=1
> takes ages to execute (and in my application I can see that solr is
> trying to apply the filter on absolutely every document.
>
> Cheers,
> Colin
> Colin Hebert
>
>
> On 26 February 2013 15:30, Timothy Potter  wrote:
>> Hi Colin,
>>
>> I think a filter is definitely the way to go. Moreover, you should
>> look into Solr's PostFilter concept which is intended to work with
>> "expensive" filters. Have a look at Yonik's blog post on this topic:
>> http://yonik.com/posts/advanced-filter-caching-in-solr/
>>
>> Cheers,
>> Tim
>>
>> On Tue, Feb 26, 2013 at 7:24 AM, Colin Hebert  wrote:
>>> Hi,
>>>
>>> I have some troubles to figure out the right thing when it comes to
>>> filtering results for security reasons.
>>>
>>> I work on this application that contains documents that are not
>>> accessible to everyone, so I want to filter the search results, based
>>> on the right to read each document for the user making the search
>>> query.
>>> To do that, right now, I have a filter on the application side that
>>> checks for each document returned by a search query, if it is
>>> accessible by the current user, and removes it from the result list if
>>> it isn't.
>>>
>>> That isn't really optimal as you might get a result page with 7
>>> results instead of 10 because some results were removed (and if you're
>>> smart enough you can figure out the content of those hidden documents
>>> by doing many search queries).
>>>
>>> So I can think of two solutions, either I code a paging system in my
>>> application that will take care of those holes in the result list, but
>>> it adds quite a lot of work that could be useless if solr can take
>>> care of that.
>>> The second solution is having solr filtering those results before
>>> sending them back.
>>>
>>> The second solution seems a bit more clean to me, but I'm not sure if
>>> it is a good practice or not.
>>>
>>> The permission system in the application is a bit 'wild', some
>>> permissions are based on the day of the week, others on the existence
>>> or not of another document, so I can't really get out of this
>>> situation by storing more information in the index and using standard
>>> filters.
>>> If creating a custom filter in Solr isn't too bad, what I was thinking
>>> of would require the solr server making a request to the application
>>> to check if the user (given as a parameter in the query) can access
>>> the document (and that should be done on each document).
>>> Note that I will have to do that security check anyways, so the time
>>> to do a security check isn't (at least shouldn't) be relevant to the
>>> performances of a solution over the other.
>>> What will have an impact though is the fact that the solr server has
>>> to do a request to the application (network connection) for each
>>> document.
>>>
>>> Colin Hebert

Re: Custom filter for document permissions

2013-02-28 Thread Colin Hebert

I know that the query selects everything, this is why I made this
request to test my solution.
If a user make a query with a very large amount of results with
paging, I expected the post filter to be executed only when necessary
(as it can be expensive).

Colin

On 28 February 2013 17:25, Timothy Potter  wrote:
> Hi Colin,
>
> Your query is *:* so that is every document. Try a query that only
> matches a small subset and see if you get different results.
>
> Cheers,
> Tim
>
> On Thu, Feb 28, 2013 at 8:17 AM, Colin Hebert  wrote:
>> Thank you Timothy,
>>
>> With the indication you gave me (and the help of this article
>> http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/ ) I
>> managed to draft my own filter, but it seems that it doesn't work
>> quite as I expected.
>>
>> Here is what I've done so far:
>> https://github.com/ColinHebert/Sakai-Solr/tree/permission/permission/solr/src/main/java/org/sakaiproject/search/solr/permission/filter
>>
>> But it seems that the filter is applied on every document matched by a
>> query (rather than doing that on the range of documents I searched
>> for).
>>
>> I've done some tests with 10k+ documents and the query
>> /select?q=*%3A*&fq={!sakai%20userId=admin}&tv=false&start=0&rows=1
>> takes ages to execute (and in my application I can see that solr is
>> trying to apply the filter on absolutely every document.
>>
>> Cheers,
>> Colin
>> Colin Hebert
>>
>>
>> On 26 February 2013 15:30, Timothy Potter  wrote:
>>> Hi Colin,
>>>
>>> I think a filter is definitely the way to go. Moreover, you should
>>> look into Solr's PostFilter concept which is intended to work with
>>> "expensive" filters. Have a look at Yonik's blog post on this topic:
>>> http://yonik.com/posts/advanced-filter-caching-in-solr/
>>>
>>> Cheers,
>>> Tim
>>>
>>> On Tue, Feb 26, 2013 at 7:24 AM, Colin Hebert  
>>> wrote:
 Hi,

 I have some troubles to figure out the right thing when it comes to
 filtering results for security reasons.

 I work on this application that contains documents that are not
 accessible to everyone, so I want to filter the search results, based
 on the right to read each document for the user making the search
 query.
 To do that, right now, I have a filter on the application side that
 checks for each document returned by a search query, if it is
 accessible by the current user, and removes it from the result list if
 it isn't.

 That isn't really optimal as you might get a result page with 7
 results instead of 10 because some results were removed (and if you're
 smart enough you can figure out the content of those hidden documents
 by doing many search queries).

 So I can think of two solutions, either I code a paging system in my
 application that will take care of those holes in the result list, but
 it adds quite a lot of work that could be useless if solr can take
 care of that.
 The second solution is having solr filtering those results before
 sending them back.

 The second solution seems a bit more clean to me, but I'm not sure if
 it is a good practice or not.

 The permission system in the application is a bit 'wild', some
 permissions are based on the day of the week, others on the existence
 or not of another document, so I can't really get out of this
 situation by storing more information in the index and using standard
 filters.
 If creating a custom filter in Solr isn't too bad, what I was thinking
 of would require the solr server making a request to the application
 to check if the user (given as a parameter in the query) can access
 the document (and that should be done on each document).
 Note that I will have to do that security check anyways, so the time
 to do a security check isn't (at least shouldn't) be relevant to the
 performances of a solution over the other.
 What will have an impact though is the fact that the solr server has
 to do a request to the application (network connection) for each
 document.

 Colin Hebert

Re: Custom filter for document permissions

2013-02-28 Thread Colin Hebert

Actually, after thinking for a bit, it makes sense to apply the post
filter everywhere, otherwise I wouldn't be able to know the number of
results overall (something I unfortunately really need).

Anyways, thank you Timothy
Colin Hebert


On 28 February 2013 17:38, Colin Hebert  wrote:
> I know that the query selects everything, this is why I made this
> request to test my solution.
> If a user make a query with a very large amount of results with
> paging, I expected the post filter to be executed only when necessary
> (as it can be expensive).
>
> Colin
>
>
> On 28 February 2013 17:25, Timothy Potter  wrote:
>> Hi Colin,
>>
>> Your query is *:* so that is every document. Try a query that only
>> matches a small subset and see if you get different results.
>>
>> Cheers,
>> Tim
>>
>> On Thu, Feb 28, 2013 at 8:17 AM, Colin Hebert  wrote:
>>> Thank you Timothy,
>>>
>>> With the indication you gave me (and the help of this article
>>> http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/ ) I
>>> managed to draft my own filter, but it seems that it doesn't work
>>> quite as I expected.
>>>
>>> Here is what I've done so far:
>>> https://github.com/ColinHebert/Sakai-Solr/tree/permission/permission/solr/src/main/java/org/sakaiproject/search/solr/permission/filter
>>>
>>> But it seems that the filter is applied on every document matched by a
>>> query (rather than doing that on the range of documents I searched
>>> for).
>>>
>>> I've done some tests with 10k+ documents and the query
>>> /select?q=*%3A*&fq={!sakai%20userId=admin}&tv=false&start=0&rows=1
>>> takes ages to execute (and in my application I can see that solr is
>>> trying to apply the filter on absolutely every document.
>>>
>>> Cheers,
>>> Colin
>>> Colin Hebert
>>>
>>>
>>> On 26 February 2013 15:30, Timothy Potter  wrote:
 Hi Colin,

 I think a filter is definitely the way to go. Moreover, you should
 look into Solr's PostFilter concept which is intended to work with
 "expensive" filters. Have a look at Yonik's blog post on this topic:
 http://yonik.com/posts/advanced-filter-caching-in-solr/

 Cheers,
 Tim

 On Tue, Feb 26, 2013 at 7:24 AM, Colin Hebert  
 wrote:
> Hi,
>
> I have some troubles to figure out the right thing when it comes to
> filtering results for security reasons.
>
> I work on this application that contains documents that are not
> accessible to everyone, so I want to filter the search results, based
> on the right to read each document for the user making the search
> query.
> To do that, right now, I have a filter on the application side that
> checks for each document returned by a search query, if it is
> accessible by the current user, and removes it from the result list if
> it isn't.
>
> That isn't really optimal as you might get a result page with 7
> results instead of 10 because some results were removed (and if you're
> smart enough you can figure out the content of those hidden documents
> by doing many search queries).
>
> So I can think of two solutions, either I code a paging system in my
> application that will take care of those holes in the result list, but
> it adds quite a lot of work that could be useless if solr can take
> care of that.
> The second solution is having solr filtering those results before
> sending them back.
>
> The second solution seems a bit more clean to me, but I'm not sure if
> it is a good practice or not.
>
> The permission system in the application is a bit 'wild', some
> permissions are based on the day of the week, others on the existence
> or not of another document, so I can't really get out of this
> situation by storing more information in the index and using standard
> filters.
> If creating a custom filter in Solr isn't too bad, what I was thinking
> of would require the solr server making a request to the application
> to check if the user (given as a parameter in the query) can access
> the document (and that should be done on each document).
> Note that I will have to do that security check anyways, so the time
> to do a security check isn't (at least shouldn't) be relevant to the
> performances of a solution over the other.
> What will have an impact though is the fact that the solr server has
> to do a request to the application (network connection) for each
> document.
>
> Colin Hebert

Re: Problems with documents that are added not showing up in index Solr 3.5

2013-02-28 Thread dboychuck

Yes I confirmed in the logs. I have also committed manually several times
using the updatehandler /update?commit=true



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problems-with-documents-that-are-added-not-showing-up-in-index-Solr-3-5-tp4043539p4043716.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problems with documents that are added not showing up in index Solr 3.5

2013-02-28 Thread Upayavira

What do you mean by 'will not show up'? Is numdocs wrong? They don't
show in queries?

Upayavira

On Thu, Feb 28, 2013, at 06:07 PM, dboychuck wrote:
> Yes I confirmed in the logs. I have also committed manually several times
> using the updatehandler /update?commit=true
> 
> 
> 
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Problems-with-documents-that-are-added-not-showing-up-in-index-Solr-3-5-tp4043539p4043716.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problems with documents that are added not showing up in index Solr 3.5

2013-02-28 Thread dboychuck

numdocs is wrong and they will not show up when I search by uniqueid


On Thu, Feb 28, 2013 at 10:45 AM, Upayavira [via Lucene] <
ml-node+s472066n4043724...@n3.nabble.com> wrote:

> What do you mean by 'will not show up'? Is numdocs wrong? They don't
> show in queries?
>
> Upayavira
>
> On Thu, Feb 28, 2013, at 06:07 PM, dboychuck wrote:
> > Yes I confirmed in the logs. I have also committed manually several
> times
> > using the updatehandler /update?commit=true
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Problems-with-documents-that-are-added-not-showing-up-in-index-Solr-3-5-tp4043539p4043716.html
>
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Problems-with-documents-that-are-added-not-showing-up-in-index-Solr-3-5-tp4043539p4043724.html
>  To unsubscribe from Problems with documents that are added not showing up
> in index Solr 3.5, click 
> here
> .
> NAML
>



-- 
*David Boychuck*
Software Engineer
Build.com, Inc.  
Smarter Home Improvement™
P.O. Box 7990 Chico, CA 95927
*P*: 800.375.3403
*F*: 530.566.1893
dboych...@build.com | Network of
Stores




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Problems-with-documents-that-are-added-not-showing-up-in-index-Solr-3-5-tp4043539p4043727.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr 3.6.1 Query large field


: Subject: Solr 3.6.1 Query large field
: In-Reply-To: 

https://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.



-Hoss

Re: Zookeeper Error When Trying to Setup SolrCloud on Weblogic

2013-02-28 Thread Mishra, Shikhar

I was able to move past the error by trying out the fix proposed here:

https://gist.github.com/barkbay/4153107


It feels strange to catch RuntimeException, though.


On 2/27/13 2:09 PM, "Mishra, Shikhar"  wrote:

>Hi,
>
>I'm trying to setup Solr Could on Weblogic 12c. I've started Zookeeper in
>an independent mode (Host: localhost:2181). My solr.war is deployed on
>Weblogic with startup arguments:
>
>-Dsolr.solr.home=/opt/wwtdomain/solr
>-Dbootstrap_confdir=/opt/wwtdomain/solr/my_core/conf
>-Dcollection.configName=myconf
>-DzkHost=localhost:2181
>
>Here is the error I get upon Solr startup:
>
>java.lang.IllegalArgumentException: No Configuration was registered that
>can handle the configuration named Client
>   at 
>com.bea.common.security.jdkutils.JAASConfiguration.getAppConfigurationEntr
>y
>(JAASConfiguration.java:130)
>   at 
>org.apache.zookeeper.client.ZooKeeperSaslClient.(ZooKeeperSaslClient
>.
>java:97)
>   at 
>org.apache.zookeeper.ClientCnxn$SendThread.startConnect(ClientCnxn.java:94
>3
>)
>   at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:993)
>Feb 27, 2013 11:38:31 AM org.apache.zookeeper.ClientCnxn$SendThread run
>
>Thanks,
>Shikhar
>

Syntax for sorting by a sub-query

2013-02-28 Thread Edward Rudd

I'm using solr 4.0 in a project and I need to sort the results based on whether 
they match another filter.

e.g.  I have a "worked_companies" multi-integer field that contains the list of 
company ids some person has worked with before.  I have a series of other fq= 
filters to narrow down the list of users and I want to put the users 
containing, say, 61 in that worked_companies before users that don't have it.

I've tried various incarnations of this.

> &sort=query({qf=worked_companies_im:61})


which yields

> Can't determine a Sort Order (asc or desc) in sort spec 
> 'query({qf=worked_companies_im:61})', pos=34


So I added desc

> &sort=query({qf=worked_companies_im:61}) desc

which yields

> sort param could not be parsed as a query, and is not a field that exists in 
> the index: query({qf=worked_companies_im:61})

Any ideas on how to accomplish what I'm trying to do, and how I'm to word the 
sort function would be extremely appreciated..

Edward Rudd
OutOfOrder.cc
Skype: outoforder_cc

Re: Syntax for sorting by a sub-query


: > &sort=query({qf=worked_companies_im:61}) desc
: 
: which yields
: 
: > sort param could not be parsed as a query, and is not a field that 
: exists in the index: query({qf=worked_companies_im:61})

because it can't be parsed as a query (and is not a field)

Try a simple request like...

   http://yoururl/select?q={qf=worked_companies_im:61}

...and you should also see an error.

the "query()" function is strict about requiring nested local param 
syntax (so you can't just pass an arbitrary query string to it) but in 
your case the input just doesn't make sense.

basd on your problem description, i'm guessing what you want is something 
like...

{!dismax qf=worked_companies_im v=61}
{!lucene f=worked_companies_im v=61}
{!term f=worked_companies_im v=61}

or when used in the query function in a sort...

   sort=query({!term f=worked_companies_im v=61}) desc


-Hoss

Re: Role of zookeeper at runtime

How can I setup cloud master-slave ? Can you point me to any sample config
or tutorial which describe the steps to get slor cloud in master-slave
setup.

As you know from my previous mails, that I dont need active solr replicas,
I just need a mechanism to copy a given solr cloud index to a new instance
of solr-cloud ( classic master-slave setup)

Eric/ Mark,
  We have 10 virtual data centres . Now its setup like this because we do
rolling update. While 1 st dc is getting indexed other 9 serve traffic .
Indexing one dc take 2 hours. Now with single shard we use to index one dc
and then quickly replicate index into other dcs by having master-slave
setup. Now in case of solr cloud obviously we can't index each dc
sequentially as it will take 2*10 hours. So we need way of indexing 1 dc
and then somehow quickly propagate the index binary to others. What will
you recommend for solr cloud ?

Thanks
Varun

On Thu, Feb 28, 2013 at 6:12 AM, Mark Miller  wrote:

>
> On Feb 26, 2013, at 6:49 PM, varun srivastava 
> wrote:
>
> > So does it means while doing "document add" the state of cluster is
> fetched
> > from zookeeper and then depending upon hash of docid the target shard is
> > decided ?
>
> We keep the zookeeper info cached locally. We only updated it when
> ZooKeeper tells us it has changed.
>
> >
> > Assume we have 3 shards ( with no replicas) in which 1 went down while
> > indexing , so will all the documents will be routed to remaining 2 shards
> > or only 2/3 rd of the documents will be indexed ? If answer is remaining
> 2
> > shards will get all the documents , then if later 3rd shard comes up
> online
> > then will solr cloud will do rebalancing ?
>
> All of the updates that hash to the third shard will fail. That is why we
> have replicas - if you have a replica, it will take over as the leader.
>
> >
> > Is anywhere in zookeeper we store the range of docids stored in each
> shard,
> > or any other information about actual docs ?
>
> The range of hashes are stored for each shard in zk.
>
> > We have 2 datacentres (dc1 and
> > dc2) which need to be indexed with exactly same data and we update index
> > only once a day. Both dc1 and dc2 have exact same solrcloud config and
> > machines.
> >
> > Can we populate dc2 by just copying all the index binaries from
> > solr-cores/core0/data of dc1, to the machines in dc2 ( to avoid indexing
> > same documents on dc2). I guess solr replication API doesn't work in
> > solrcloud, hence loooking for work around.
> >
> > Thanks
> > Varun
> >
> > On Tue, Feb 26, 2013 at 3:34 PM, Mark Miller 
> wrote:
> >
> >> ZooKeeper
> >> /
> >> /clusterstate.json - info about the layout and state of the cluster -
> >> collections, shards, urls, etc
> >> /collections - config to use for the collection, shard leader voting zk
> >> nodes
> >> /configs - sets of config files
> >> /live_nodes - ephemeral nodes, one per Solr node
> >> /overseer - work queue for update clusterstate.json, creating new
> >> collections, etc
> >> /overseer_elect - overseer voting zk nodes
> >>
> >> - Mark
> >>
> >> On Feb 26, 2013, at 6:18 PM, varun srivastava 
> >> wrote:
> >>
> >>> Hi Mark,
> >>> One more question
> >>>
> >>> While doing solr doc update/add what information is required from
> >> zookeeper
> >>> ? Can you tell what all information is stored in zookeeper other than
> the
> >>> startup configs.
> >>>
> >>> Thanks
> >>> Varun
> >>>
> >>> On Tue, Feb 26, 2013 at 3:09 PM, Mark Miller 
> >> wrote:
> >>>
> 
>  On Feb 26, 2013, at 5:25 PM, varun srivastava  >
>  wrote:
> 
> > Hi All,
> > I have some questions regarding role of zookeeper in solrcloud
> runtime,
> > while processing the queries .
> >
> > 1) Is zookeeper cluster referred by solr shards for processing every
> > request, or its only used to copy config on startup time ?
> 
>  No, it's not used per request. Solr talks to ZooKeeper on SolrCore
> >> startup
>  - to get configs and set itself up. Then it only talks to ZooKeeper
> >> when a
>  cluster state change happens - in that case, ZooKeeper pings Solr and
> >> Solr
>  will get an update view of the cluster. That view is cached and used
> for
>  requests. In a stable state, Solr is not talking to ZooKeeper other
> than
>  the heartbeat they keep to know a node is up.
> 
> > 2) How loadbalancing is done between replicas ? Is traffic stat
> shared
> > through zookeeper ?
> 
>  Basic round robin. Traffic stats are not currently in Zk.
> 
> > 3) If for any reason zookeeper cluster goes offline for sometime,
> does
>  solr
> > cloud will not be able to server any traffic ?
> 
>  It will stop allowing updates, but continue serving searches.
> 
>  - Mark
> 
> >
> >
> > Thanks
> > Varun
> 
> 
> >>
> >>
>
>

Re: Solr3.5 Vs Solr4.1 - Help please


On 2/28/2013 5:48 AM, adityab wrote:

Well i did another test. Copied the Index files from perf lab to Dev machine
which has Solr4.1
Now ran solrmeter to generate load on Dev server. We were able to drive the
QPS upto 150 with CPU on avg 35%. but the same index is generating 100% CPU
at 1 QPS in perf lab.

On a side note. has "fl" parameter to do anything with this? coz with my
test using solrmeter i used fl=*,score and q=masterId:. Where as in perf
lab they have fixed 10 field names in "fl".


I have seen some anectodal evidence that including 'score' in fl makes 
things very slow -- people mentioning things in IRC and the mailing 
list, not sure which because they all blend together in my mind.


I don't think any of those claims have been verified, so you might try 
removing the fl parameter from the system where it is set to *,score. 
If you see a big performance jump, then there may be something to it.


Thanks,
Shawn

Re: Problems with documents that are added not showing up in index Solr 3.5


On 2/28/2013 11:07 AM, dboychuck wrote:

Yes I confirmed in the logs. I have also committed manually several times
using the updatehandler /update?commit=true


To the experts: Does an empty update request with commit=true work for 
this, or would the user have to send the actual commit command?  Even if 
it does work, I'm pretty sure this URL format would have to be a POST 
request, not a GET, so it could not be done easily in a browser.


The wiki has explicit instructions for doing this using curl to send a 
POST request or a browser with a GET request:


http://wiki.apache.org/solr/UpdateXmlMessages#Updating_a_Data_Record_via_curl

Thanks,
Shawn

Get page number of searchresult of a pdf in solr

2013-02-28 Thread dev


Hello,

I'm building a web application where users can search for pdf  
documents and view them with pdf.js. I would like to display the  
search results with a short snippet of the paragraph where the search  
term where found and a link to open the document at the right page.


So what I need is the page number and a short text snippet of every  
search result.


I'm using SOLR 4.1 for indexing pdf documents. The indexing itself  
works fine but I don't know how to get the page number and paragraph  
of a search result. I only get the document where the search term was  
found in.


-Gesh

Re: Solr cloud deployment on tomcat in prod

Great .. I will do it and send you all for review.

Thanks
Varun

On Thu, Feb 28, 2013 at 4:50 AM, Erick Erickson wrote:

> Anyone can edit the Wiki, contributions welcome!
>
> Best
> Erick
>
>
> On Mon, Feb 25, 2013 at 5:50 PM, varun srivastava  >wrote:
>
> > Hi,
> >  Is there any official documentation around deployment of solr cloud in
> > production on tomcat ?
> >
> > I am looking for anything as detailed as following one .. It will be good
> > if someone can take the following tutorial and get it on official
> solrcloud
> > wiki after reviewing each step.
> >
> >
> >
> http://www.myjeeva.com/2012/10/solrcloud-cluster-single-collection-deployment/
> >
> >
> > Thanks
> > Varun
> >
>

Re: Problems with documents that are added not showing up in index Solr 3.5


On Feb 28, 2013, at 3:22 PM, Shawn Heisey  wrote:

> To the experts: Does an empty update request with commit=true work for this

It should work fine.

- Mark

Re: Get page number of searchresult of a pdf in solr

2013-02-28 Thread Michael Della Bitta

My guess is the best way to do this is to index each page separately
and to store a link to the PDF/page in each document.

That would probably require you to preprocess the PDFs to turn each
one into a single page per PDF, or to extract the text per page
another way.

Michael Della Bitta


Appinions
18 East 41st Street, 2nd Floor
New York, NY 10017-6271

www.appinions.com

Where Influence Isn’t a Game


On Thu, Feb 28, 2013 at 3:26 PM,   wrote:
> Hello,
>
> I'm building a web application where users can search for pdf documents and
> view them with pdf.js. I would like to display the search results with a
> short snippet of the paragraph where the search term where found and a link
> to open the document at the right page.
>
> So what I need is the page number and a short text snippet of every search
> result.
>
> I'm using SOLR 4.1 for indexing pdf documents. The indexing itself works
> fine but I don't know how to get the page number and paragraph of a search
> result. I only get the document where the search term was found in.
>
> -Gesh
>

Re: Solr3.5 Vs Solr4.1 - Help please

thanks Shawn, 
I did try with specifying a fixed set of fl and with no score none gave any
better performance. 

We have a different VM with same index and Solr4.1 on Jboss 5.1 which does
perfectly fine with all the queries. So this is confusing us a bit more.
Have our VM expert to look now hopefully to find some solution. 

Aditya 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr3-5-Vs-Solr4-1-Help-please-tp4043543p4043786.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr4.1 Loggin Level Screen just shows root

Logging screen seam to be broken on Solr 4.1 .. any ideas ?
 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-1-Loggin-Level-Screen-just-shows-root-tp4042556p4043787.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Get page number of searchresult of a pdf in solr

2013-02-28 Thread Swati Swoboda

You can get the paragraph of the search result via highlights. You'd have to 
mark your field as stored (re-indexing required) and then specify it in the 
highlighting parameters. 

http://wiki.apache.org/solr/HighlightingParameters#hl

As for getting the page number, I am not sure if there is more you can do than 
what Michael suggested...



-Original Message-
From: d...@geschan.de [mailto:d...@geschan.de] 
Sent: Thursday, February 28, 2013 3:27 PM
To: solr-user@lucene.apache.org
Subject: Get page number of searchresult of a pdf in solr

Hello,

I'm building a web application where users can search for pdf documents and 
view them with pdf.js. I would like to display the search results with a short 
snippet of the paragraph where the search term where found and a link to open 
the document at the right page.

So what I need is the page number and a short text snippet of every search 
result.

I'm using SOLR 4.1 for indexing pdf documents. The indexing itself works fine 
but I don't know how to get the page number and paragraph of a search result. I 
only get the document where the search term was found in.

-Gesh

What am I doing wrong - writing an OpenNLP Filter

2013-02-28 Thread vybe3142

Since the official OpenNLP filter is not yet in an actual release, I'm
experimenting with the OpenNLP filter implementation described in chapter 8
of the Taming Text Book http://www.manning.com/ingersoll/Sample-ch08.pdf .

The original code is at :
https://github.com/tamingtext/book/tree/master/src/main/java/com/tamingtext/texttamer/solr
, I made a few minor changes to reflect the SOLR 4.x interface changes. The
Name filter described should extract people, dates, locations etc from the
text.

Schema config:

Questions:
1. When I run a query on a term that shouldn't exist: such as
https://gist.github.com/anonymous/5060539 , I actually get 2 results back !!

2. Looking at the index, I "data" field entries (data is the field i
associated with fieldname text_opennlp) such as ne_location.
Then again, searching for data:ne_location yields far fewer hits than
data:london . I don't expect a perfect match, given that this is an NLP type
filter, but there appears to be something wrong with the way I'm looking at
this.

3. Running a simple analysis on a multi sentence block of text, this is what
I see (screenshot at http://i.imgur.com/5ORgnRt.png). The filter appears to
work. However, from the query perspective, .. how could I query the
processed data for "John as a person", (as opposed to John as an
organization). I feel that this could better be achieved by saving the
relevant information to other dedicated fields (person, organization, place
etc .. ) that map with OpenNlp's capabilities. Open to ideas and suggestions
here. I'm still learning.

Thanks

--
View this message in context:
http://lucene.472066.n3.nabble.com/What-am-I-doing-wrong-writing-an-OpenNLP-Filter-tp4043799.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr Case-sensitivity issue with search field name

2013-02-28 Thread hyrax

Hi guys,

I'm using Solr 4.0 and I recently notice an issue that bothers me a lot
which is that if you define a field in your schema named 'HOST' then in the
query you have to specify this field by 'HOST' while if you used 'host' it
would throw an 'undefined field' error.

I have done some googling while I only found a jira ticket which says this
issue had been fixed:  https://issues.apache.org/jira/browse/SOLR-873
  

I know I can use  to accomplish this but I'm wonder if there a
way to apply this change all the field on the fly not one by one ...

Many many thanks in advance!
Thanks,
Hyrax



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Case-sensitivity-issue-with-search-field-name-tp4043800.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Timestamp field is changed on update

2013-02-28 Thread Isaac Hebsh

Hoss Man suggested a wonderful solution for this need:
Always set update="add" to the field you want to keep (is exists), and use
FirstFieldValueUpdateProcessorFactory in the update chain, after
DistributedUpdateProcessorFactory (so the AtomicUpdate will add the
existing field before, if exists).

This solution exactly covers my case. Thank you!


On Wed, Feb 20, 2013 at 11:33 PM, Isaac Hebsh  wrote:

> Nobody responded my JIRA issue :(
> Should I commit this patch into SVN's trunk, and set the issue as Resolved?
>
>
> On Sun, Feb 17, 2013 at 9:26 PM, Isaac Hebsh wrote:
>
>> Thank you Alex.
>> Atomic Update allows you to "add" new values into multivalued field, for
>> example... It means that the original document is being read (using
>> RealTimeGet, which depends on updateLog).
>> There is no reason that the list of operations (add/set/inc) will not
>> include a "create-only" operation... I think that throwing it to the client
>> is not a good idea, and even only because the required atomicity (which is
>> handled in the DistributedUpdateProcessor using internal locks).
>>
>> There is no problem when using Atomic Update semantics on non-existent
>> document.
>>
>> Indeed, it will work on stored fields only.
>>
>>
>> On Sun, Feb 17, 2013 at 8:47 AM, Alexandre Rafalovitch <
>> arafa...@gmail.com> wrote:
>>
>>> Unless it is an Atomic Update, right. In which case Solr/Lucene will
>>> actually look at the existing document and - I assume - will preserve
>>> whatever field got already populated as long as it is stored. Should work
>>> for default values as well, right? They get populated on first creation,
>>> then that document gets partially updated.
>>>
>>> But I can't tell from the problem description whether it can be
>>> reformulated as something that fits Atomic Update. I think if the client
>>> does not know whether this is a new record or an update one, Solr will
>>> complain if Atomic Update semantics is used against non-existent
>>> document.
>>>
>>> Regards,
>>>Alex.
>>> P.s. Lots of conjecture here; I haven't tested exactly this use-case.
>>>
>>> Personal blog: http://blog.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all at
>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>>
>>>
>>> On Sun, Feb 17, 2013 at 12:40 AM, Walter Underwood <
>>> wun...@wunderwood.org>
>>> wrote:
>>> >
>>> > It is natural part of the update model for Solr (and for many other
>>> search engines). Solr does not do updates. It does add, replace, and
>>> delete.
>>> >
>>> > Every document is processed as if it was new. If there is already a
>>> document with that id, then the new document replaces it. The existing
>>> documents are not read during indexing. This allows indexing to be much
>>> faster than in a relational database.
>>> >
>>> > wunder
>>>
>>
>>
>

Re: Role of zookeeper at runtime

Any thought on this ?

We have 10 virtual data centres . Now its setup like this because we do
rolling update. While 1 st dc is getting indexed other 9 serve traffic .
Indexing one dc take 2 hours. Now with single shard we use to index one dc
and then quickly replicate index into other dcs by having master-slave
setup. Now in case of solr cloud obviously we can't index each dc
sequentially as it will take 2*10 hours. So we need way of indexing 1 dc
and then somehow quickly propagate the index binary to others. What will
you recommend for solr cloud ?

Thanks
Varun

On Thu, Feb 28, 2013 at 11:33 AM, varun srivastava
wrote:

> How can I setup cloud master-slave ? Can you point me to any sample config
> or tutorial which describe the steps to get slor cloud in master-slave
> setup.
>
> As you know from my previous mails, that I dont need active solr replicas,
> I just need a mechanism to copy a given solr cloud index to a new instance
> of solr-cloud ( classic master-slave setup)
>
> Eric/ Mark,
>   We have 10 virtual data centres . Now its setup like this because we do
> rolling update. While 1 st dc is getting indexed other 9 serve traffic .
> Indexing one dc take 2 hours. Now with single shard we use to index one dc
> and then quickly replicate index into other dcs by having master-slave
> setup. Now in case of solr cloud obviously we can't index each dc
> sequentially as it will take 2*10 hours. So we need way of indexing 1 dc
> and then somehow quickly propagate the index binary to others. What will
> you recommend for solr cloud ?
>
> Thanks
> Varun
>
>
> On Thu, Feb 28, 2013 at 6:12 AM, Mark Miller wrote:
>
>>
>> On Feb 26, 2013, at 6:49 PM, varun srivastava 
>> wrote:
>>
>> > So does it means while doing "document add" the state of cluster is
>> fetched
>> > from zookeeper and then depending upon hash of docid the target shard is
>> > decided ?
>>
>> We keep the zookeeper info cached locally. We only updated it when
>> ZooKeeper tells us it has changed.
>>
>> >
>> > Assume we have 3 shards ( with no replicas) in which 1 went down while
>> > indexing , so will all the documents will be routed to remaining 2
>> shards
>> > or only 2/3 rd of the documents will be indexed ? If answer is
>> remaining 2
>> > shards will get all the documents , then if later 3rd shard comes up
>> online
>> > then will solr cloud will do rebalancing ?
>>
>> All of the updates that hash to the third shard will fail. That is why we
>> have replicas - if you have a replica, it will take over as the leader.
>>
>> >
>> > Is anywhere in zookeeper we store the range of docids stored in each
>> shard,
>> > or any other information about actual docs ?
>>
>> The range of hashes are stored for each shard in zk.
>>
>> > We have 2 datacentres (dc1 and
>> > dc2) which need to be indexed with exactly same data and we update index
>> > only once a day. Both dc1 and dc2 have exact same solrcloud config and
>> > machines.
>> >
>> > Can we populate dc2 by just copying all the index binaries from
>> > solr-cores/core0/data of dc1, to the machines in dc2 ( to avoid indexing
>> > same documents on dc2). I guess solr replication API doesn't work in
>> > solrcloud, hence loooking for work around.
>> >
>> > Thanks
>> > Varun
>> >
>> > On Tue, Feb 26, 2013 at 3:34 PM, Mark Miller 
>> wrote:
>> >
>> >> ZooKeeper
>> >> /
>> >> /clusterstate.json - info about the layout and state of the cluster -
>> >> collections, shards, urls, etc
>> >> /collections - config to use for the collection, shard leader voting zk
>> >> nodes
>> >> /configs - sets of config files
>> >> /live_nodes - ephemeral nodes, one per Solr node
>> >> /overseer - work queue for update clusterstate.json, creating new
>> >> collections, etc
>> >> /overseer_elect - overseer voting zk nodes
>> >>
>> >> - Mark
>> >>
>> >> On Feb 26, 2013, at 6:18 PM, varun srivastava 
>> >> wrote:
>> >>
>> >>> Hi Mark,
>> >>> One more question
>> >>>
>> >>> While doing solr doc update/add what information is required from
>> >> zookeeper
>> >>> ? Can you tell what all information is stored in zookeeper other than
>> the
>> >>> startup configs.
>> >>>
>> >>> Thanks
>> >>> Varun
>> >>>
>> >>> On Tue, Feb 26, 2013 at 3:09 PM, Mark Miller 
>> >> wrote:
>> >>>
>> 
>>  On Feb 26, 2013, at 5:25 PM, varun srivastava <
>> varunmail...@gmail.com>
>>  wrote:
>> 
>> > Hi All,
>> > I have some questions regarding role of zookeeper in solrcloud
>> runtime,
>> > while processing the queries .
>> >
>> > 1) Is zookeeper cluster referred by solr shards for processing every
>> > request, or its only used to copy config on startup time ?
>> 
>>  No, it's not used per request. Solr talks to ZooKeeper on SolrCore
>> >> startup
>>  - to get configs and set itself up. Then it only talks to ZooKeeper
>> >> when a
>>  cluster state change happens - in that case, ZooKeeper pings Solr and
>> >> Solr
>>  will get an update view of the cluster. That view is cached and used
>> for

Re: Solr Case-sensitivity issue with search field name


On 2/28/2013 3:40 PM, hyrax wrote:

I'm using Solr 4.0 and I recently notice an issue that bothers me a lot
which is that if you define a field in your schema named 'HOST' then in the
query you have to specify this field by 'HOST' while if you used 'host' it
would throw an 'undefined field' error.

I have done some googling while I only found a jira ticket which says this
issue had been fixed:  https://issues.apache.org/jira/browse/SOLR-873


I know I can use  to accomplish this but I'm wonder if there a
way to apply this change all the field on the fly not one by one ...


It appears that the issue you have linked is specific to the dataimport 
handler (importing from a database or another structured data source), 
not searching.  I've always read that fields in a Solr schema are case 
sensitive.


My own recommendation is that you pick a standard, either all uppercase 
or all lowercase, and that you stick with it.  I prefer all lowercase 
myself.


Thanks,
Shawn

Re: Role of zookeeper at runtime


On 2/28/2013 4:20 PM, varun srivastava wrote:

We have 10 virtual data centres . Now its setup like this because we do
rolling update. While 1 st dc is getting indexed other 9 serve traffic .
Indexing one dc take 2 hours. Now with single shard we use to index one dc
and then quickly replicate index into other dcs by having master-slave
setup. Now in case of solr cloud obviously we can't index each dc
sequentially as it will take 2*10 hours. So we need way of indexing 1 dc
and then somehow quickly propagate the index binary to others. What will
you recommend for solr cloud ?


This is my understanding of how SolrCloud works.  If I am wrong about 
any of this, I'm sure one of the experts will correct me.  I'm still 
learning SolrCloud, so this is an opportunity for me to find out if I 
understand it right:


SolrCloud is not master-slave.  One replica of each shard is designated 
leader.  I think you can influence which one becomes leader, but I don't 
know how to do this.


When you index, the receiving node forwards the request to the leader of 
the correct shard.  The leader then processes the update request locally 
and sends it to all replicas of that shard, so they all index the same 
data independently.


If a node goes down, the remaining replicas handle requests and continue 
to process any updates that come in.  When the down node comes back up, 
the leader will see if it can use its transaction log to sync up the 
recovered node.  If it can, it will do so.  If it can't, it tells the 
recovered node to replicate its index, so you must have the replication 
handler enabled on all SolrCloud nodes, even though it does not use 
traditional master/slave roles.


If the leader goes down, the remaining replicas elect a new leader.

If you want to continue using master/slave semantics, I don't think you 
can use SolrCloud.  SolrCloud will result in a lot of inter-DC traffic 
at all times, which you probably want to avoid.


Thanks,
Shawn

Re: Solr 4.1 Solr Cloud Shard Structure

You will pay some in performance, but it's certainly not bad practice. It's a 
good choice for setting up so that you can scale later. You just have to do 
some testing to make sure it fits your requirments. The Collections API even 
has built in support for this - you can specify more shards than nodes and it 
will overload a node. See the documentation. Later you can start up a new 
replica on another machine and kill/remove the original.

- Mark

On Feb 28, 2013, at 7:10 PM, Chris Simpson  wrote:

> Dear Lucene / Solr Community-
> 
> I recently posted this question on Stackoverflow, but it doesnt seem to be 
> going too far. Then I found this mailing list and was hoping perhaps to have 
> more luck:
> 
> Question-
> 
> If I plan on holding 7TB of data in a Solr Cloud, is it bad practice to begin 
> with 1 server holding 100 shards and then begin populating the collection 
> where once the size grew, each shard ultimately will be peeled off into its 
> own dedicated server (holding ~70GB ea with its own dedicated resources and 
> replicas)?
> 
> That is, I would start the collection with 100 shards locally, then as data 
> grew, I could peel off one shard at a time and give it its own server -- 
> dedicated w/plenty of resources.
> 
> Is this okay to do -- or would I somehow incur a massive bottleneck 
> internally by putting that many shards in 1 server to start with while data 
> was low?
> 
> Thank you.
> Chris
> 
>

Re: Problems with documents that are added not showing up in index Solr 3.5


: but at a certain add
: all documents after that add will not exist in the index
: what settings could affect this behavior?
: I just need somewhere to start looking
: could it be the merge policy?

Is anything else logged arround the time of this special "add" ?

what are the numDocs & maxDoc values from /admin/mbeans?stats=true&key=searcher 
?

what are the numFound from a serach for *:* ?

what is the fieldType for your id field look like?

what is a specific example of a query url for a missing doc, and what does 
it return?  can you find & paste the corrisponding "add" log message for 
this doc?

https://wiki.apache.org/solr/UsingMailingLists#Information_useful_for_indexing_problems
https://wiki.apache.org/solr/UsingMailingLists#Information_useful_for_searching_problems





-Hoss

Re: Role of zookeeper at runtime

On Feb 28, 2013, at 6:20 PM, varun srivastava  wrote:

> So we need way of indexing 1 dc
> and then somehow quickly propagate the index binary to others.

You can replicate from a SolrCloud node still. Just hit it's replication 
handler and pass in the master url to replicate to. It doesn't have any 
guarantees in terms of data loss, eg it's not part of SolrCloud per say, but 
it's a fast way to move an index.

- Mark

Re: Problems with Solr 3.6 and Magento


: I noticed that Magento is using the overwritePending commit directive but I
: can't find any documentation on this. Does the overwritePending directive
: purge any added docs since the last commit? Any help would be appreciated.

overwritePending has never been a commit option.

it was an "add" option in Solr 3.x, but has been deprecated for a long 
itme in place of the singular "overwrite" option...

https://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22add.22

...it's just a way of letting clients tell solr that solr doesn't need to 
do the normal checks to overwrite existing documents with the same key -- 
because the client is taking responsibility for garunteeing that a 
document with the same key doesn't already exist.


-Hoss

Re: Solr Case-sensitivity issue with search field name

2013-02-28 Thread Walter Underwood

Lower case is safer than upper case. For unicode, uppercasing is a lossy 
conversion. There are sets of different lower case characters that convert to 
the same upper case character. When you convert back to lower case, you don't 
know which one it was originally.

Always use lower case for text. That avoids some really subtle bugs.

wunder

On Feb 28, 2013, at 3:47 PM, Shawn Heisey wrote:

> On 2/28/2013 3:40 PM, hyrax wrote:
>> I'm using Solr 4.0 and I recently notice an issue that bothers me a lot
>> which is that if you define a field in your schema named 'HOST' then in the
>> query you have to specify this field by 'HOST' while if you used 'host' it
>> would throw an 'undefined field' error.
>> 
>> I have done some googling while I only found a jira ticket which says this
>> issue had been fixed:  https://issues.apache.org/jira/browse/SOLR-873
>> 
>> 
>> I know I can use  to accomplish this but I'm wonder if there a
>> way to apply this change all the field on the fly not one by one ...
> 
> It appears that the issue you have linked is specific to the dataimport 
> handler (importing from a database or another structured data source), not 
> searching.  I've always read that fields in a Solr schema are case sensitive.
> 
> My own recommendation is that you pick a standard, either all uppercase or 
> all lowercase, and that you stick with it.  I prefer all lowercase myself.
> 
> Thanks,
> Shawn
>

Re: Solr 4.1 Solr Cloud Shard Structure

2013-02-28 Thread Walter Underwood

100 shards on a node will almost certainly be slow, but at least it would be 
scalable. 7TB of data on one node is going to be slow regardless of how you 
shard it.

I might choose a number with more useful divisors than 100, perhaps 96 or 144.

wunder

On Feb 28, 2013, at 4:25 PM, Mark Miller wrote:

> You will pay some in performance, but it's certainly not bad practice. It's a 
> good choice for setting up so that you can scale later. You just have to do 
> some testing to make sure it fits your requirments. The Collections API even 
> has built in support for this - you can specify more shards than nodes and it 
> will overload a node. See the documentation. Later you can start up a new 
> replica on another machine and kill/remove the original.
> 
> - Mark
> 
> On Feb 28, 2013, at 7:10 PM, Chris Simpson  
> wrote:
> 
>> Dear Lucene / Solr Community-
>> 
>> I recently posted this question on Stackoverflow, but it doesnt seem to be 
>> going too far. Then I found this mailing list and was hoping perhaps to have 
>> more luck:
>> 
>> Question-
>> 
>> If I plan on holding 7TB of data in a Solr Cloud, is it bad practice to 
>> begin with 1 server holding 100 shards and then begin populating the 
>> collection where once the size grew, each shard ultimately will be peeled 
>> off into its own dedicated server (holding ~70GB ea with its own dedicated 
>> resources and replicas)?
>> 
>> That is, I would start the collection with 100 shards locally, then as data 
>> grew, I could peel off one shard at a time and give it its own server -- 
>> dedicated w/plenty of resources.
>> 
>> Is this okay to do -- or would I somehow incur a massive bottleneck 
>> internally by putting that many shards in 1 server to start with while data 
>> was low?
>> 
>> Thank you.
>> Chris
>>

Re: can we configure spellcheck to be invoked after request processing?

You can execute search components in any order you want, but their execution 
will be unconditional, so you would be best off to invoke spellcheck in your 
application layer. In other words, a second call to Solr, but only if no 
results were found.


-- Jack Krupansky

-Original Message- 
From: roz dev

Sent: Thursday, February 28, 2013 7:33 PM
To: solr-user@lucene.apache.org
Subject: can we configure spellcheck to be invoked after request processing?

Hi All,
I may be asking a stupid question but please bear with me.

Is it possible to configure Spell check to be invoked after Solr has
processed the original query?

My use case is :

I am using DirectSpellChecker and have a document which has "Denim" as a
term and there is another document which has "Jeap".

I am issuing a Search as "Jean" or "Denim"

I am finding that this Solr query is giving me ZERO results and suggesting
"Jeap" as an alternative.

I want Solr to try to run the query for "Jean" or "Denim" and if there are
no results found then only suggest "Jeap" as an alternative

Is this doable in Solr?

Any suggestions.

-Saroj

Re: Solr 4.1 Solr Cloud Shard Structure


On Feb 28, 2013, at 7:55 PM, Walter Underwood  wrote:

> 100 shards on a node will almost certainly be slow

I think it depends on some things - with one of the largest of those things 
being your hardware. Many have found that you can get much better performance 
out of super concurrent, beefy hardware using more cores on a single node. So 
there will be some give and take that are tough jump to conclusions about. 
Slower at 100, I would assume yes, slow, depends.

One thing that will happen is that you will require a lot more threads…

You would want some pretty beefy hardware.

But you don't have to do 100 either. That should just be a rough starting 
number. At some point you have to reindex into a new cluster if you keep 
growing. Or consider shard splitting if its feasible (and becomes available). 
You can only over shard so much.

So perhaps you do 50 or whatever. It will be faster than you think I imagine. 
My main concern is the number of threads - might want to mess with Xss to 
minimize their ram usage at least.

- Mark

Re: Custom filter for document permissions


: Actually, after thinking for a bit, it makes sense to apply the post
: filter everywhere, otherwise I wouldn't be able to know the number of
: results overall (something I unfortunately really need).

Not to mention things like facet counts, which need access to the full set 
of matching documents.

Situations like this are why using a high "cost" value are really 
important on "expensive" filters -- you want to ensure that all of the 
"cheap" filtering that can be done on your request, is done before your 
PostFilter ever gets executed, to minimize the amount of external logic 
you have to apply.

In a synthetic test like this, matching all documents w/o any other 
filtering, that "cost" is irrelevant, but it's important to remember when 
you start creating your real world requests.


-Hoss

RE: Repartition solr cloud

2013-02-28 Thread Vaillancourt, Tim

Sort of off-topic, is there a way to do the reverse, ie: split indexes?

This could be useful for people that would like to move to sharding from one 
core and could be interesting under SolrCloud.

Cheers,

Tim

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Friday, February 22, 2013 5:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Repartition solr cloud

You could copy each shard to a single node and then use the merge index feature 
to merge them into one index and then start up a single Solr node on that. Use 
the same configs.

- Mark

On Feb 22, 2013, at 8:11 PM, Erol Akarsu  wrote:

> I have a solr cloud 7 nodes, each has 2 shards.
> Now, I would like to build another solr server with only one core  (no 
> shards or replica) from whole cloud data.
> What is fastest and safest way to achieve this, making only one SOLR 
> from solr cloud?
> 
> I appreciate your answer
> 
> Erol Akarsu

Re: update fails if one doc is wrong

This has been hanging around for a long time. I did some preliminary work
here: https://issues.apache.org/jira/browse/SOLR-445 but moved on to other
things before committing it. The discussion there might be useful.

FWIW,
Erick


On Wed, Feb 27, 2013 at 5:32 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Colleagues,
>
> Here are my considerations
>
> If the exception is occurs somewhere in updateprocessor we can add a
> special update processor on top of the head of update processor chain,
> which will catch exception from delegated processAdd call, log and/or
> swallow it.
> If it fits for the purpose we can try to figure out how to return failed
> doc ids back to the client. I'm not sure but i think it's possible. Just
> because responsewrite is quite -dumb- flexible, i e if update processor
> drops something to response, it should be blindly streamed back to the
> client.
>
> One more consideration.
> Anirudha,
> When you say "re-try them" do you mean to post a failed doc one more time?
> It seems I didn't get your point. Please clarify.
>  27.02.2013 1:13 пользователь "Anirudha Jadhav" 
> написал:
>
> > Ideally you would want to use SOLRJ or other interface which can catch
> > exceptions/error and re-try them.
> >
> >
> > On Tue, Feb 26, 2013 at 3:45 PM, Walter Underwood  > >wrote:
> >
> > > I've done exactly the same thing. On error, set the batch size to one
> and
> > > try again.
> > >
> > > wunder
> > >
> > > On Feb 26, 2013, at 12:27 PM, Timothy Potter wrote:
> > >
> > > > Here's what I do to work-around failures when processing batches of
> > > updates:
> > > >
> > > > On client side, catch the exception that the batch failed. In the
> > > > exception handler, switch to one-by-one mode for the failed batch
> > > > only.
> > > >
> > > > This allows you to isolate the *bad* documents as well as getting the
> > > > *good* documents in the batch indexed in Solr.
> > > >
> > > > This assumes most batches work so you only pay the one-by-one penalty
> > > > for the occasional batch with a bad doc.
> > > >
> > > > Tim
> > > >
> > > > On Tue, Feb 26, 2013 at 12:08 PM, Isaac Hebsh  >
> > > wrote:
> > > >> Hi.
> > > >>
> > > >> I add documents to Solr by POSTing them to UpdateHandler, as bulks
> of
> > > 
> > > >> commands (DIH is not used).
> > > >>
> > > >> If one document contains any invalid data (e.g. string data into
> > numeric
> > > >> field), Solr returns HTTP 400 Bad Request, and the whole bulk is
> > failed.
> > > >>
> > > >> I'm searching for a way to tell Solr to accept the rest of the
> > > documents...
> > > >> (I'll use RealTimeGet to determine which documents were added).
> > > >>
> > > >> If there is no standard way for doing it, maybe it can be
> implemented
> > by
> > > >> spiltting the  commands into seperate HTTP POSTs. Because of
> > using
> > > >> auto-soft-commit, can I say that it is almost equivalent? What is
> the
> > > >> performance penalty of 100 POST requests (of 1 document each)
> againt 1
> > > >> request of 100 docs, if a soft commit is eventually done.
> > > >>
> > > >> Thanks in advance...
> > >
> > > --
> > > Walter Underwood
> > > wun...@wunderwood.org
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Anirudha P. Jadhav
> >
>

Re: Poll: SolrCloud vs. Master-Slave usage

2013-02-28 Thread Amit Nithian

I don't know a ton about SolrCloud but for our setup and my limited
understanding of it is that you start to bleed operational and
non-operational aspects together which I am not comfortable doing (i.e.
software load balancing). Also adding ZooKeeper to the mix is yet another
thing to install, setup, monitor, maintain etc which doesn't add any value
above and beyond what we have setup already.

For example, we have a hardware load balancer that can do the actual load
balancing of requests among the slaves and taking slaves in and out of
rotation either on demand or if it's down. We've placed a virtual IP on top
of our multiple masters so that we have redundancy there. While we have
multiple cores, the data volume is large enough to fit on one node so we
aren't at the data volume necessary for sharding our indices. I suspect
that if we had a sufficiently large dataset that couldn't fit on one box
SolrCloud is perfect but when you can fit on one box, why add more
complexity?

Please correct me if I'm wrong for I'd like to better understand this!

On Thu, Feb 28, 2013 at 12:53 AM, rulinma  wrote:

> I am doing research on SolrCloud.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: MultiValued Search using Filter Query

Well, first are you using that syntax? Because it's not correct.
q=sym:TEST
fq=initials:MKN

[] is the "range" operator.

Second, you are using a "string" type for initials, which isn't analyzed in
any way so you'd have to search on exactly MKN JRT. MKN wouldn't match. JRT
wouldn't match. MKN  JRT (two spaces between) wouldn't match.

The admin/analysis page shows you exactly how data sent to fields is
parsed, it's invaluable. Also, adding &debug=all to the query will show you
how the query gets parsed, which often gives you a clue, although it does
take some practice to know which parts of the output are relevant

Best
Erick

On Wed, Feb 27, 2013 at 10:12 PM, Deepak  wrote:

> Hi
>
> How can I filter records using filter query on multiValued field.  Here are
> two records
>
>   {
> "sub_count":8,
> "long_name":"Mike",
> "first_name":"John",
> "id":45949,
> "sym":"TEST",
> "type":"T",
> "last_name":"Account",
> "person_id":"3613",
> "short_name":"Molly",
> "initials":["ABC XYZ"],
> "timestamp":"2013-02-28T02:44:02.235Z"
> }
>
>   {
> "sub_count":8,
> "long_name":"Mike",
> "first_name":"John",
> "id":45949,
> "sym":"TEST",
> "type":"T",
> "last_name":"Account",
> "person_id":"3613",
> "short_name":"Molly",
> "initials":["MKN JRT"],
> "timestamp":"2013-02-28T02:44:02.235Z"
> }
>
>  multiValued="true" />
>
> 
>
>
> Note. All values are same except initials field value.
> My search criteria
> q:sym:TEST
> fq:initials[MKN]
>
> This doesn't return the second record.  What am I missing here?
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/MultiValued-Search-using-Filter-Query-tp4043544.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: A few operations questions about the tlog (UpdateLog)

A new tlog gets created and the current one is closed on a hard commit
(openSearcher=true or false doesn't matter). The old one will be kept
around for a bit, I suspect if you'd done it a third time the first one
would go away. For all I know, the code might read if (tlog docs > 100) and
you may be hitting an off-by-one situation.

But you're right. If you follow older practices and _never_ commit until
the end of a really long indexing process, you'll see these logs grow huge.
Even worse, anytime they're replayed they'll then re-index a lot of
documents.

So I'd try the experiment over again with 110 docs But I can say
they're getting purged on my machine quite regularly with some stress
testing I'm doing.

Best,
Erick

On Thu, Feb 28, 2013 at 7:59 AM, adityab  wrote:

> whats the life cycle of a tlog file. Is it purged after commit (even with
> soft commit) ?
> I posted 100 docs to solr (standalone) did hard commit. Observed a new tlog
> file is created.
> re-posted the same 100 docs and did hard commit. Observed a new tlog file
> is
> created. Old one still exists.
>
> When do they get purged. Concern is we have at least 20K docs published
> every 2hrs so need to understand if its safe to put them in a different
> location where we can have a script to purge old files at regular interval.
>
> thanks
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/A-few-operations-questions-about-the-tlog-UpdateLog-tp4042560p4043618.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Solr3.5 Vs Solr4.1 - Help please

First, did the index you moved get built with 3.x? In that case the fields
won't be compressed. And things will be slower due 4.1 having to go through
the back-compat layer. If you did do this, try optimizing since that'll
re-write the index in 4.1 format (I'm pretty sure at least).

Second, fl _does_ matter, especially if you have lazy field loading
enabled. Basically you'll have to load/uncompress all the fields mentioned
in the fl list.

Best
Erick

On Thu, Feb 28, 2013 at 4:53 PM, adityab  wrote:

> thanks Shawn,
> I did try with specifying a fixed set of fl and with no score none gave any
> better performance.
>
> We have a different VM with same index and Solr4.1 on Jboss 5.1 which does
> perfectly fine with all the queries. So this is confusing us a bit more.
> Have our VM expert to look now hopefully to find some solution.
>
> Aditya
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr3-5-Vs-Solr4-1-Help-please-tp4043543p4043786.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: solrCloud insert data steps

bq: I want to improve performance and want to become a committer if possible

Super! In the words of the great wise one (Yonik), you become a committer
by acting like a committer. There are lots of options:

1> pick some JIRAs that are interesting to you. Ping the dev list to see if
there's active interest in them. Provide patches. Essentially a really good
way is to pick an area of Solr in which you want to become an expert and
"own" that. By that I mean consider it your responsibility to help
maintain/enhance/fix that code, in cooperation with others. A great model
is what David Smiley has done with Lucene Spatial.

2> Document, document, document. There's lots of room for improvement in
both the Wiki and Javadocs. This is an awesome way to become familiar with
an area of the code, and a great help to others.

3> participate int he user's list, helping others out.

4> look at the test coverage on the nightly builds, it'll point out parts
of the code that don't have coverage yet. Build some unit tests to exercise
those bits of code and make patches.

5> There are lots of other ways

be prepared to dig in deeply to some code written by some very good
programmers. You _will not_ be able to understand the entire code-base at
once, trust me on that ...

Best
Erick

On Thu, Feb 28, 2013 at 8:01 PM, rulinma  wrote:

> I want to improve performance and want to become a committer if possible.
>
> THX!
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-insert-data-steps-tp4043315p4043827.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Sort by currency field using OpenExchangeRatesOrgProvider


: I have as part of my schema one currency field.
...
: IT works properly when i filter or do a query using an specific currency.
...
: But when I try to sort by this field it returns the results by amount and it
: doesn't take in consideration the currency.
...
: Do you know if there is a bug around this?

Hmmm, do you only see this problem with the OpenExchangeRatesOrgProvider? 
have you tried reproducing it with the FileExchangeRateProvider ?

The provider used shouldn't have any effect on wether the sorting code 
works, but there also don't seem to be any tests actaully excercising the 
OpenExchangeRatesOrgProvider, so it's possible that there might be some 
bugs (i created SOLR-4514 for that) but doing some adhoc testing i 
couldn't find any.

Here's what i tried...


using the 4x checkout, i added teh following the example schema...

  
...
  

...and then indexed three simple docs (see below), and queried them with 
asc and desc orders on the currency field -- based on the ordering 
returned in each case, it definitley appears to be paying attention to the 
currency conversion...

http://localhost:8983/solr/select?q=*:*&sort=price_oer_c+desc&wt=json&indent=true&omitHeader=true

{
  "response":{"numFound":3,"start":0,"docs":[
  {
"id":"euro",
"price_oer_c":"0.9,EUR",
"_version_":1428268201637052416},
  {
"id":"doller",
"price_oer_c":"1,USD",
"_version_":1428268009579872256},
  {
"id":"yen",
"price_oer_c":"1.5,JPY",
"_version_":1428268193079623680}]
  }}

http://localhost:8983/solr/select?q=*:*&sort=price_oer_c+asc&wt=json&indent=true&omitHeader=true

{
  "response":{"numFound":3,"start":0,"docs":[
  {
"id":"yen",
"price_oer_c":"1.5,JPY",
"_version_":1428268193079623680},
  {
"id":"doller",
"price_oer_c":"1,USD",
"_version_":1428268009579872256},
  {
"id":"euro",
"price_oer_c":"0.9,EUR",
"_version_":1428268201637052416}]
  }}


...can you give us more details about the specifics of what data you are 
using and what results you are seeing when you sort on it?






-Hoss

Re: Role of zookeeper at runtime

"You can replicate from a SolrCloud node still. Just hit it's replication
handler and pass in the master url to replicate to"

How will this work ? lets say s1dc1 is master of s1dc2 , s2dc1 is master
for s2dc2 .. so after hitting replicate index binary will get copied but
then how appropriate entries will be made in zookeeper. Zookeeper need to
know which doc id range residing in which shard.

Thanks
Varun

On Thu, Feb 28, 2013 at 4:27 PM, Mark Miller  wrote:

>
> On Feb 28, 2013, at 6:20 PM, varun srivastava 
> wrote:
>
> > So we need way of indexing 1 dc
> > and then somehow quickly propagate the index binary to others.
>
> You can replicate from a SolrCloud node still. Just hit it's replication
> handler and pass in the master url to replicate to. It doesn't have any
> guarantees in terms of data loss, eg it's not part of SolrCloud per say,
> but it's a fast way to move an index.
>
> - Mark
>
>

Re: Custom filter for document permissions

You might get some mileage out of encoding what you can in the documents
and doing a standard fq clause on that part, and then have your post-filter
do the really wild stuff. But you're right, you have to be prepared for the
nightmare scenario of your sysadmin who has rights to see everything firing
off a *:* query..

Other options:

1> fail after a certain number of docs have been evaluated (or perhaps
passed your filter). Return some message about "query too expensive, refine
it please". "Fail" here means that you deny rights to all documents after
some number N is passed.

2> think hard about your permissions model. Perhaps you can encode more of
it into your documents than you think.

3> External File Fields. Maybe you could encode some of the permissions in
an EFF and use that when calculating permissions. Consider your "day of
week" question. While I know nothing of what that means, let's assume it
means that some documents are available only on a particular day of the
week. At midnight on Monday you calculate the EFF field for a field
day_of_week_available for Tuesday and re-load your searcher. Your filter
uses the EFF field to calculate permissions. You might be able to extend
this idea to make the problem tractable.

4> Think really hard about whether your permissions model is really useful
to enough users to be worth the headache. Often you have no choice, but if
you could say "if we don't support feature X, we can give you query speed
Y, is it worth it?".

Best,
Erick

On Thu, Feb 28, 2013 at 8:10 PM, Chris Hostetter
wrote:

>
> : Actually, after thinking for a bit, it makes sense to apply the post
> : filter everywhere, otherwise I wouldn't be able to know the number of
> : results overall (something I unfortunately really need).
>
> Not to mention things like facet counts, which need access to the full set
> of matching documents.
>
> Situations like this are why using a high "cost" value are really
> important on "expensive" filters -- you want to ensure that all of the
> "cheap" filtering that can be done on your request, is done before your
> PostFilter ever gets executed, to minimize the amount of external logic
> you have to apply.
>
> In a synthetic test like this, matching all documents w/o any other
> filtering, that "cost" is irrelevant, but it's important to remember when
> you start creating your real world requests.
>
>
> -Hoss
>

Re: Repartition solr cloud

In the works, high priority:
https://issues.apache.org/jira/browse/SOLR-3755

Best
Erick


On Thu, Feb 28, 2013 at 8:13 PM, Vaillancourt, Tim wrote:

> Sort of off-topic, is there a way to do the reverse, ie: split indexes?
>
> This could be useful for people that would like to move to sharding from
> one core and could be interesting under SolrCloud.
>
> Cheers,
>
> Tim
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Friday, February 22, 2013 5:30 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Repartition solr cloud
>
> You could copy each shard to a single node and then use the merge index
> feature to merge them into one index and then start up a single Solr node
> on that. Use the same configs.
>
> - Mark
>
> On Feb 22, 2013, at 8:11 PM, Erol Akarsu  wrote:
>
> > I have a solr cloud 7 nodes, each has 2 shards.
> > Now, I would like to build another solr server with only one core  (no
> > shards or replica) from whole cloud data.
> > What is fastest and safest way to achieve this, making only one SOLR
> > from solr cloud?
> >
> > I appreciate your answer
> >
> > Erol Akarsu
>
>
>

Re: Poll: SolrCloud vs. Master-Slave usage

Amit:

It's a balancing act. If I was starting fresh, even with one shard, I'd
probably use SolrCloud rather than deal with the issues around the "how do
I recover if my master goes down" question. Additionally, SolrCloud allows
one to monitor the health of the entire system by monitoring the state
information kept in Zookeeper rather than build a monitoring system that
understands the changing topology of your network.

And if you need NRT, you just can't get it with traditional M/S setups.

In a mature production system where all the operational issues are figured
out and you don't need NRT, it's easier just to plop 4.x in traditional M/S
setups and not go to SolrCloud. And you're right, you have to understand
Zookeeper which isn't all that difficult, but is another moving part and
I'm a big fan of keeping the number of moving parts down if possible.

It's not a one-size-fits-all situation. From what you've described, I can't
say there's a compelling reason to do the SolrCloud thing. If you find
yourself spending lots of time building monitoring or High
Availability/Disaster Recovery tools, then you might find the cost/benefit
analysis changing.

Personally, I think it's ironic that the memory improvements that came
along _with_ SolrCloud make it less necessary to shard. Which means that
traditional M/S setups will suit more people longer 

Best
Erick

On Thu, Feb 28, 2013 at 8:22 PM, Amit Nithian  wrote:

> I don't know a ton about SolrCloud but for our setup and my limited
> understanding of it is that you start to bleed operational and
> non-operational aspects together which I am not comfortable doing (i.e.
> software load balancing). Also adding ZooKeeper to the mix is yet another
> thing to install, setup, monitor, maintain etc which doesn't add any value
> above and beyond what we have setup already.
>
> For example, we have a hardware load balancer that can do the actual load
> balancing of requests among the slaves and taking slaves in and out of
> rotation either on demand or if it's down. We've placed a virtual IP on top
> of our multiple masters so that we have redundancy there. While we have
> multiple cores, the data volume is large enough to fit on one node so we
> aren't at the data volume necessary for sharding our indices. I suspect
> that if we had a sufficiently large dataset that couldn't fit on one box
> SolrCloud is perfect but when you can fit on one box, why add more
> complexity?
>
> Please correct me if I'm wrong for I'd like to better understand this!
>
>
>
>
> On Thu, Feb 28, 2013 at 12:53 AM, rulinma  wrote:
>
> > I am doing research on SolrCloud.
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Poll-SolrCloud-vs-Master-Slave-usage-tp4042931p4043582.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>

Re: filter query on multi-Valued field

2013-02-28 Thread Deepak Parmar

Jack

Thank for your suggestions.  I made "initials"  a tokenized field and now
I'm getting the expected result


On Thu, Feb 28, 2013 at 10:03 AM, Jack Krupansky wrote:

> First, what is your unique key field? If it is "id", then only one of
> these documents will be stored since they have the same id values.
>
> Please provide the exact request URL so we can see exactly what the q and
> fq parameters look like. Your fq looks malformed, but it's hard to say for
> sure without the exact literal URL.
>
> It appears that your multi-valued "initials" field actually has only a
> single value which is a string that contains keywords delimited by blanks.
> You have two choices: 1) make "initials" a tokenized/text field, or 2) be
> sure to add the second set of initials as a separate value, such as:
> "initials":["MKN", "JRT"]
>
> -- Jack Krupansky
>
> -Original Message- From: Deepak
> Sent: Thursday, February 28, 2013 8:05 AM
> To: solr-user@lucene.apache.org
> Subject: filter query on multi-Valued field
>
>
> Hi How can I filter results using filter query on multi-Valued field.  Here
> are sample two records   { "sub_count":8,
> "long_name":"Mike", "first_name":"John", "id":45949,
> "sym":"TEST", "type":"T", "last_name":"Account",
> "person_id":"3613", "short_name":"Molly", "initials":["ABC
> XYZ"], "timestamp":"2013-02-28T02:44:**02.235Z" }   {
> "sub_count":8, "long_name":"Mike", "first_name":"John",
> "id":45949, "sym":"TEST", "type":"T",
> "last_name":"Account", "person_id":"3613",
> "short_name":"Molly", "initials":["MKN JRT"],
> "timestamp":"2013-02-28T02:44:**02.235Z" } Note, all values are same
> except initials field value.   Solr query criteria q:sym:TEST
> fq:initials[MKN] This doesn't return the second record.  What am I missing
> here?
>
>
>
> --
> View this message in context: http://lucene.472066.n3.**
> nabble.com/filter-query-on-**multi-Valued-field-tp4043621.**html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: Role of zookeeper at runtime

Yup - nothing about it will be automatic or easy - multi dc is not really a 
current feature. I'm just saying it's a fast way to move the data. If you setup 
the same cluster on each side though, the appropriate stuff will be in 
ZooKeeper.

- Mark

On Feb 28, 2013, at 9:04 PM, varun srivastava  wrote:

> "You can replicate from a SolrCloud node still. Just hit it's replication
> handler and pass in the master url to replicate to"
> 
> How will this work ? lets say s1dc1 is master of s1dc2 , s2dc1 is master
> for s2dc2 .. so after hitting replicate index binary will get copied but
> then how appropriate entries will be made in zookeeper. Zookeeper need to
> know which doc id range residing in which shard.
> 
> 
> Thanks
> Varun
> 
> On Thu, Feb 28, 2013 at 4:27 PM, Mark Miller  wrote:
> 
>> 
>> On Feb 28, 2013, at 6:20 PM, varun srivastava 
>> wrote:
>> 
>>> So we need way of indexing 1 dc
>>> and then somehow quickly propagate the index binary to others.
>> 
>> You can replicate from a SolrCloud node still. Just hit it's replication
>> handler and pass in the master url to replicate to. It doesn't have any
>> guarantees in terms of data loss, eg it's not part of SolrCloud per say,
>> but it's a fast way to move an index.
>> 
>> - Mark
>> 
>>

Re: A few operations questions about the tlog (UpdateLog)

To add:

Current best practice is to do a hard commit with openSearcher=false every 
minute or so. AutoCommit is great for this. It shouldn't affect your overall 
indexing performance and it will constrain the transaction log.

- Mark

On Feb 28, 2013, at 8:35 PM, Erick Erickson  wrote:

> A new tlog gets created and the current one is closed on a hard commit
> (openSearcher=true or false doesn't matter). The old one will be kept
> around for a bit, I suspect if you'd done it a third time the first one
> would go away. For all I know, the code might read if (tlog docs > 100) and
> you may be hitting an off-by-one situation.
> 
> But you're right. If you follow older practices and _never_ commit until
> the end of a really long indexing process, you'll see these logs grow huge.
> Even worse, anytime they're replayed they'll then re-index a lot of
> documents.
> 
> So I'd try the experiment over again with 110 docs But I can say
> they're getting purged on my machine quite regularly with some stress
> testing I'm doing.
> 
> Best,
> Erick
> 
> 
> On Thu, Feb 28, 2013 at 7:59 AM, adityab  wrote:
> 
>> whats the life cycle of a tlog file. Is it purged after commit (even with
>> soft commit) ?
>> I posted 100 docs to solr (standalone) did hard commit. Observed a new tlog
>> file is created.
>> re-posted the same 100 docs and did hard commit. Observed a new tlog file
>> is
>> created. Old one still exists.
>> 
>> When do they get purged. Concern is we have at least 20K docs published
>> every 2hrs so need to understand if its safe to put them in a different
>> location where we can have a script to purge old files at regular interval.
>> 
>> thanks
>> 
>> 
>> 
>> 
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/A-few-operations-questions-about-the-tlog-UpdateLog-tp4042560p4043618.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>

What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

2013-02-28 Thread Alexandre Rafalovitch

Hello,

I want to have a unified reference of all different processors one could
use in Solr in various extension points.

I have written a small tool to extract all implementations
of UpdateRequestProcessorFactory, Analyzer, CharFilterFactory, etc
(actually of any root class).

However, I assume not all Lucene Analyzer derivatives can be just plugged
into Solr.

Is it fair to say that the class must:
*) Derive from appropriate root (is there a list of ALL the roots?)
*) Be public and not abstract (though a common sub-root could be)
*) Have a default empty constructor

My preliminary tests seem to indicate this is the case. Am I missing
anything.

Regards,
   Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Re: What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?


The package Javadoc for Solr analysis is a good start:

http://lucene.apache.org/solr/4_1_0/solr-core/org/apache/solr/analysis/package-tree.html

Especially the AbstractAnalysisFactory:

http://lucene.apache.org/core/4_1_0/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html

Also, look at the various "factories" in solrconfig.xml for other Solr 
extension points.


Including search components, spellcheckers, etc.

-- Jack Krupansky

-Original Message- 
From: Alexandre Rafalovitch

Sent: Thursday, February 28, 2013 10:32 PM
To: solr-user@lucene.apache.org
Subject: What makes an Analyzer/Tokenizer/CharFilter/etc suitable for Solr?

Hello,

I want to have a unified reference of all different processors one could
use in Solr in various extension points.

I have written a small tool to extract all implementations
of UpdateRequestProcessorFactory, Analyzer, CharFilterFactory, etc
(actually of any root class).

However, I assume not all Lucene Analyzer derivatives can be just plugged
into Solr.

Is it fair to say that the class must:
*) Derive from appropriate root (is there a list of ALL the roots?)
*) Be public and not abstract (though a common sub-root could be)
*) Have a default empty constructor

My preliminary tests seem to indicate this is the case. Am I missing
anything.

Regards,
  Alex.
Personal blog: http://blog.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)

Re: index storage on other server