Index not fitting in memory (file-cache)

2016-03-24 Thread Robert Brown

Hi,

If my index data directory size is 70G, and I don't have 70G (plus heap, 
etc) in the system, this will occasionally affect search speed right?  
When Solr has to resort to reading from disk?


Before I go out and throw more RAM into the system, in the above 
example, what would you recommend?


Thanks,
Rob




Re: SolrJ Indexing

2016-03-24 Thread fabigol
Hi Shawn
thank for your response.
Like i can see in my XML file i have many enties which are linked between
it.
I know doint that for DIH but with solrJ i don't know. Must i use the
annotations as @Field...?

Moreover, i create a new project solr with the same XML Files - copy conf
directory - and oddly the Indexing is much faster and not a little 100 time
more quick.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-Indexing-tp4265506p4265812.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Issue With Manual Lock

2016-03-24 Thread Reth RM
Hi Salman,

The index lock error is generally reported when 2 cores are trying to share
an index directory between more than one core or Solr instance. Please
check if there are more than one of those cores pointing to same data
directory. You can see dir path on "overview" tab admin page.




On Wed, Mar 23, 2016 at 1:59 PM, Salman Ansari 
wrote:

> Hi,
>
> I am facing an issue which I believe has something to do with recent
> changes in Solr. I already have a collection spread on 2 shards (each with
> 2 replicas).  What happened is that my Solr and Zookeeper ensemble went
> down and I restarted the servers. I have performed the following steps
>
> 1) I restarted the machine and performed Windows update
> 2) I started Zookeeper ensemble
> 3) Then I started Solr instances
>
> My issues are (for collections which existed before starting Solr servers)
>
> 1) From time to time, I see some replicas are down on Solr dashboard
> 2) When I try to index some documents, I faced the following exception
>
> SolrNet.Exceptions.SolrConnectionException was unhandled by user code
>
>   HResult=-2146232832
>
>   Message=
>
> 
> 500 name="QTime">1021 name="msg">{msg=SolrCore '[myCollection]_shard1_replica1' is not available
> due to init failure: Index locked for write for core
> '[myCollection]_shard1_replica1'. Solr now longer supports forceful
> unlocking via 'unlockOnStartup'. Please verify locks
> manually!,trace=org.apache.solr.common.SolrException: SolrCore
> '[myCollection]_shard1_replica1' is not available due to init failure:
> Index locked for write for core '[myCollection]_shard1_replica1'. Solr now
> longer supports forceful unlocking via 'unlockOnStartup'. Please verify
> locks manually!??
>
> I have tried several steps including
>
> 1) I have removed write.lock file manually from the folders while Solr is
> up and I have tried reloading the core while the Solr is up as well but
> nothing changed (still some replicas are down)
> 2) I have restarted Solr instances but now all replicas are down :)
>
> Any idea how to handle this issue?
>
> Appreciate your comments/feedback.
>
> Regards,
> Salman
>


Re: SyntaxError - Block Join Parent Query

2016-03-24 Thread Charles Sanders
Ah yes. Thank you. Made the correction and I do not get the SyntaxError. 
However, it does not apply the child filter. The query should return only 
TestParent4. But it is returning TestParent2, TestParent3 and TestParent4. All 
of these meet the parent portion of the query (+blue). But only TestParent4 
should meet the child portion of the query. 

q=+blue +{!parent 
which="documentKind:TestParent"v=$childq}&childq=portal_product:("red hat") 


{
  "responseHeader":{
"status":0,
"QTime":15,
"params":{
  "indent":"true",
  "q":" blue  {!parent which=\"documentKind:TestParent\"v=$childq}",
  "childq":"portal_product:(\"red hat\")",
  "wt":"json"}},
  "response":{"numFound":2733,"start":0,"maxScore":3.0138793,"docs":[
  {
"documentKind":"TestParent",
"uri":"https://ping/pong/testparent4";,
"view_uri":"https://ping/pong/testparent4";,
"id":"TestParent4",
"allTitle":"blue",
"sortTitle":"blue",
"_version_":1529622873461751808,
"_root_":["https://ping/pong/testparent4";],
"timestamp":"2016-03-23T19:40:48.211Z",
"language":"en"},
  {
"documentKind":"TestParent",
"uri":"https://ping/pong/testparent3";,
"view_uri":"https://ping/pong/testparent3";,
"id":"TestParent3",
"allTitle":"blue",
"sortTitle":"blue",
"_version_":1529622308758487040,
"_root_":["https://ping/pong/testparent3";],
"timestamp":"2016-03-23T19:31:49.668Z",
"language":"en"},
  {
"documentKind":"TestParent",
"uri":"https://ping/pong/testparent2";,
"view_uri":"https://ping/pong/testparent2";,
"id":"TestParent2",
"allTitle":"blue",
"sortTitle":"blue",
"_version_":1529622293809987584,
"_root_":["https://ping/pong/testparent2";],
"timestamp":"2016-03-23T19:31:35.408Z",
"language":"en"} 

- Original Message -

From: "Mikhail Khludnev"  
To: "solr-user"  
Sent: Wednesday, March 23, 2016 5:34:31 PM 
Subject: Re: SyntaxError - Block Join Parent Query 

On Thu, Mar 24, 2016 at 12:16 AM, Charles Sanders  
wrote: 

> Thanks for the quick reply. But I'm not sure I understand. Did I do 
> something wrong?? 
> 
Absolutely 
'portal_product"red hat")' 
You omit a colon and opening bracket after a field name, don't you? 


> 
> 
> /select?q=+blue%20+{!parent%20which=%22documentKind:TestParent%22%20v=$childq}&childq=portal_product%22red%20hat%22)
>  
> 
>  
>  
>  400  
>  2  
>  
>  
> blue {!parent which="documentKind:TestParent" v=$childq} 
>  
>  portal_product"red hat")  
>  
>  
>  
>  
> org.apache.solr.search.SyntaxError: Cannot parse 'portal_product"red 
> hat")': Encountered " ")" ") "" at line 1, column 23. Was expecting one of: 
>   ...  ...  ... "+" ... "-" ...  ... "(" ... 
> "*" ... "^" ...  ...  ...  ...  ... 
>  ...  ... "[" ... "{" ...  ...  ... 
>  
>  400  
>  
>  
> 
> - Original Message - 
> 
> From: "Mikhail Khludnev"  
> To: "solr-user"  
> Sent: Wednesday, March 23, 2016 5:02:29 PM 
> Subject: Re: SyntaxError - Block Join Parent Query 
> 
> On Wed, Mar 23, 2016 at 11:09 PM, Charles Sanders  
> wrote: 
> 
> > I'm getting a SyntaxError which I do not understand when I execute a 
> block 
> > join parent query. I'm running Solr5.2.1, with 2 shards. The problem 
> > appears to be in that portion of the query that filters the child 
> document. 
> > Any insight as to where I made the error is greatly appreciated. 
> > 
> > This query produces an error: 
> > q=+blue +{!parent which="documentKind:TestParent"}portal_product:("red 
> > hat") 
> > -- should return TestParent4 
> > 
> q=+blue +{!parent which="documentKind:TestParent" 
> v=$childq}&childq=portal_product:("red hat") 
> 
> 
> > However, this query works: 
> > q=+blue +{!parent which="documentKind:TestParent"}portal_product:rhel 
> > -- should return TestParent2 
> > 
> > Sample data and schema information below: 
> > { 
> > "documentKind": "TestParent", 
> > "uri": "https://ping/pong/testparent1";, 
> > "view_uri": "https://ping/pong/testparent1";, 
> > "id": "TestParent1", 
> > "allTitle": "gold", 
> > "allText": "gold", 
> > "contents": "gold", 
> > "_childDocuments_": [ 
> > { 
> > "documentKind": "TestChild", 
> > "uri": "testchild1", 
> > "id": "testchild1", 
> > "portal_product_version": "6", 
> > "portal_product": "rhel" 
> > } 
> > ] 
> > } 
> > 
> > { 
> > "documentKind": "TestParent", 
> > "uri": "https://ping/pong/testparent2";, 
> > "view_uri": "https://ping/pong/testparent2";, 
> > "id": "TestParent2", 
> > "allTitle": "blue", 
> > "allText": "blue", 
> > "contents": "blue", 
> > "_childDocuments_": [ 
> > { 
> > "documentKind": "TestChild", 
> > "uri": "testchild2", 
> > "id": "testchild2", 
> > "portal_product_version": "6", 
> > "portal_product": "rhel" 
> > } 
> > ] 
> > } 
> > 
> > { 
> > "documentKind": "TestParent", 
> > "uri": "https://ping/pong/testparent3";, 
> > "view_uri": "htt

Re: SolrJ Indexing

2016-03-24 Thread Shawn Heisey
On 3/24/2016 4:06 AM, fabigol wrote:
> I know doint that for DIH but with solrJ i don't know. Must i use the
> annotations as @Field...?
>
> Moreover, i create a new project solr with the same XML Files - copy conf
> directory - and oddly the Indexing is much faster and not a little 100 time
> more quick.

I can't figure out exactly what you're saying here.  I'll respond as
best I can.

I have no idea how to use the annotations with SolrJ.  I just construct
SolrInputDocument objects, put them in a List, and send them to Solr
using the add() method.

There could be any number of reasons that this other indexing you're
talking about is faster, and you haven't given enough information to
know what might be happening.  Does it still go 100 times as fast if you
let the index grow to the same size as the other one?

Thanks,
Shawn



Re: Index not fitting in memory (file-cache)

2016-03-24 Thread Shawn Heisey
On 3/24/2016 4:02 AM, Robert Brown wrote:
> If my index data directory size is 70G, and I don't have 70G (plus
> heap, etc) in the system, this will occasionally affect search speed
> right?  When Solr has to resort to reading from disk?
>
> Before I go out and throw more RAM into the system, in the above
> example, what would you recommend?

Having enough memory available to cache all your index data offers the
best possible performance.

You may be able to achieve acceptable performance when you don't have
that much memory, but I would try to make sure there's at least enough
memory available to cache *half* the index data.  Depending on the
nature of your queries and your index, this might not be enough, but
chances are good that it would work well.

I have a dev server where there's only enough memory available to cache
about a tenth of the index -- it's got full copies of all three of my
large indexes on ONE machine, while production runs two copies of these
same indexes on ten machines.  Performance of any single query is not
very good on the dev server, but if I absolutely had to use that server
for production with one of my indexes, it would be a slow, but I could
do it.  I don't think it would have enough performance to handle running
all three indexes for production, though.

Thanks,
Shawn



Re: SolrCloud: published host/port

2016-03-24 Thread Shawn Heisey
On 3/24/2016 12:26 AM, Hendrik Haddorp wrote:
> is it possible to instruct Solr to publish a different host/port into
> ZooKeeper then it is actually running on? This is required if the Solr
> node is not directly reachable on its port from outside due to a NAT
> setup or when running Solr as a Docker container with a mapped port.
>
> For what its worth ElasticSearch is supporting this as documented here [1]:
> - transport.publish_port
> - transport.publish_host

Although it is possible to publish different information than Jetty is
actually using, if you do so, SolrCloud may not work correctly.

SolrCloud uses the information published to zookeeper when it routes
requests from one node to another, so if the inter-node communication
must use the real port, then this will not work if you tell it about the
port number in use beyond the firewall.

Thanks,
Shawn



Re: Index not fitting in memory (file-cache)

2016-03-24 Thread Robert Brown

Thanks Shawn,

One of my indexes is 70G on disk but only has 25G RAM, usually it's fast 
as hell, less than 0.5s for a full API wrapped call, but we do 
occasionally see searches taking 2.5 seconds.


I'm currently shuffling VMs around to increase the RAM, good to hear 
this may solve those random slowdowns - or at least rule it out.




On 03/24/2016 01:44 PM, Shawn Heisey wrote:

On 3/24/2016 4:02 AM, Robert Brown wrote:

If my index data directory size is 70G, and I don't have 70G (plus
heap, etc) in the system, this will occasionally affect search speed
right?  When Solr has to resort to reading from disk?

Before I go out and throw more RAM into the system, in the above
example, what would you recommend?

Having enough memory available to cache all your index data offers the
best possible performance.

You may be able to achieve acceptable performance when you don't have
that much memory, but I would try to make sure there's at least enough
memory available to cache *half* the index data.  Depending on the
nature of your queries and your index, this might not be enough, but
chances are good that it would work well.

I have a dev server where there's only enough memory available to cache
about a tenth of the index -- it's got full copies of all three of my
large indexes on ONE machine, while production runs two copies of these
same indexes on ten machines.  Performance of any single query is not
very good on the dev server, but if I absolutely had to use that server
for production with one of my indexes, it would be a slow, but I could
do it.  I don't think it would have enough performance to handle running
all three indexes for production, though.

Thanks,
Shawn





Performance potential for updating (reindexing) documents

2016-03-24 Thread tedsolr
With a properly tuned solr cloud infrastructure and less than 1B total docs
spread out over 50 collections where the largest collection is 100M docs,
what is a reasonable target goal for entirely reindexing a single
collection?

I understand there are a lot of variables, so I'm hypothetically wiping them
away by assuming "a properly tuned infrastructure". So the hardware, RAM,
etc. is configured correctly (not so in my case).

The scenario is to add 3 fields to all the existing docs in one collection.
The fields are the same but the values vary based on the docs. So a search
is performed and finds 100 matches - all 100 docs will get the same updates.
Then another search is performed that matches 15000 docs, and these are
updated. This continues 10-20,000 times until essentially all the docs have
been updated.

The docs all have 100 - 200 fields, mostly text and mostly small in size.
What's the best possible throughput I can expect? 1000 docs/sec? 5000
docs/sec?

Using SolrJ for querying and indexing against a v5.2.1 cloud.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-potential-for-updating-reindexing-documents-tp4265861.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Indexing multiple pdf's and partial update of pdf

2016-03-24 Thread Jay Parashar


Thanks Reth,



Yes I am using Apache Tike and went by the instructions given in

https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika



Here I see we can index a pdf " solr-word.pdf" to a document with unique key = 
"doc1" as below



curl 
'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true'
 -F 
"myfile=@example/exampledocs/solr-word.pdf"



My requirement is to index another separate pdf to this document with key = 
doc1. Basically I need the contents of both pdfs to be searchable and related 
to the id=doc1.



What comes to my mind is to perform an 'extractOnly' as below on both pdf's and 
then index the concatenation of the contents. Is there another less invasive 
way?



curl "http://localhost:8983/solr/techproducts/update/extract?&extractOnly=true"; 
--data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'



Thanks

Jay



-Original Message-
From: Reth RM [mailto:reth.ik...@gmail.com]
Sent: Thursday, March 24, 2016 12:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing multiple pdf's and partial update of pdf



Are you using apache tika parser to parse pdf files?



1) Solr support parent-child block join using which you can index more than one 
file data within document object(if that is what you are looking for) 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_Other-2BParsers-23OtherParsers-2DBlockJoinQueryParsers&d=CwIFaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI&s=83RBCYuuwc7iI4KAzkPMsyNThtsMqr9Bp9QOk1lr_fU&e=



2) If the unique key of the document that exists in index is equal to new 
document that you are reindexing, it will be overwritten. If you'd like to do 
partial updates via curl, here are some examples listed :

https://urldefense.proofpoint.com/v2/url?u=http-3A__yonik.com_solr_atomic-2Dupdates_&d=CwIFaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI&s=RnLUMlzU69Qr6D2NPbCH9wig6JLekcfwfGu9kC9l9DA&e=











On Thu, Mar 24, 2016 at 3:43 AM, Jay Parashar 
mailto:bparas...@slb.com>> wrote:



> Hi,

>

> I have couple of questions regarding indexing files (say pdf).

>

> 1)  Is there any way to index more than one file to one document with

> a unique id?

>

> One way I think is to do a “extractOnly” of all the documents and then

> index that extract separately. Is there an easier way?

>

> 2)  If my Solr document has existing fields populated and then I index

> a pdf, it seems it overwrites the document with the end result being

> just the contents of the pdf. I know we can do partial updates using

> SolrJ but is it possible to do partial updates of pdf using curl?

>

>

> Thanks

> Jay

>


Re: Solr 5.5.0: JVM args warning in console logfile.

2016-03-24 Thread Bram Van Dam
> When I made the change outlined in the patch on SOLR-8145 to my bin/solr
> script, the warning disappeared.  That was not the intended effect of
> the patch, but I'm glad to have the mystery solved.
> 
> Thank you for mentioning the problem so we could track it down.

You're welcome. And thanks for fixing it ;-). We're rather particular
about what appears in our logs.

 - Bram



Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Bram Van Dam
On 23/03/16 15:50, Yonik Seeley wrote:
> Kind of a unique situation for a dot-oh release, but from the Solr
> perspective, 6.0 should have *fewer* bugs than 5.5 (for those features
> in 5.5 at least)... we've been squashing a bunch of docValue related
> issues.

I've been led to understand that 6.X (at least the Lucene part?) won't
be backwards compatible with 4.X data. 5.5 at least works fine with data
files from 4.7, for instance. With that in mind, at least from my
selfish perspective, applying fixes to 5.X would be much appreciated ;-)

 - Bram




Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Jack Krupansky
Does anybody know if we have doc on the recommended process for upgrading
data after upgrading Solr? Sure the upgraded version will work fine with
that old data, but unless the data is upgraded, the user can't then upgrade
to the next major release after that. This is a case in point - the user is
on 4.x and upgrades to 5.x with that 4.x data, but will want to upgrade to
6.x shortly, but that will require the 4.x data to be rewritten
(force-merged?) to 5.x first.

-- Jack Krupansky

On Thu, Mar 24, 2016 at 11:38 AM, Bram Van Dam  wrote:

> On 23/03/16 15:50, Yonik Seeley wrote:
> > Kind of a unique situation for a dot-oh release, but from the Solr
> > perspective, 6.0 should have *fewer* bugs than 5.5 (for those features
> > in 5.5 at least)... we've been squashing a bunch of docValue related
> > issues.
>
> I've been led to understand that 6.X (at least the Lucene part?) won't
> be backwards compatible with 4.X data. 5.5 at least works fine with data
> files from 4.7, for instance. With that in mind, at least from my
> selfish perspective, applying fixes to 5.X would be much appreciated ;-)
>
>  - Bram
>
>
>


Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Yonik Seeley
On Thu, Mar 24, 2016 at 11:45 AM, Yonik Seeley  wrote:
>> I've been led to understand that 6.X (at least the Lucene part?) won't
>> be backwards compatible with 4.X data. 5.5 at least works fine with data
>> files from 4.7, for instance.

It really doesn't seem like much changed at the lucene index-format
level from 5 to 6...
it makes one wonder how much work would be involved in allowing Lucene
6 to directly read a newer 4.x index... maybe it's just down to
version strings in the index and not much else?

-Yonik


Is dataimporter.functions.escapeSql() functional ?

2016-03-24 Thread Joachim DORNBUSCH
Hi,

I wonder if the function ${dataimporter.functions.escapeSql()} is available in 
Solr 5.3.1.

Whenever i use it in my data import handlers, Solr replaces 
'${dataimporter.functions.escapeSql(field)}' by '' (an empty string).

How can I escape strings when building sql queries in DIHconfigFile ?

I have 2 entities : ref_entity0 and ref_entity1.

If ref_entity0's name contains a simple quote ('), the following exception 
occurs : 
ref_entity1:org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to execute query: select ...

Below an excerpt of the code.



  

  

Re: SyntaxError - Block Join Parent Query

2016-03-24 Thread Mikhail Khludnev
I suggest to add debugQuery=true and fl=*,[child ...] doc transformer. And
come back with response.

On Thu, Mar 24, 2016 at 3:23 PM, Charles Sanders 
wrote:

> Ah yes. Thank you. Made the correction and I do not get the SyntaxError.
> However, it does not apply the child filter. The query should return only
> TestParent4. But it is returning TestParent2, TestParent3 and TestParent4.
> All of these meet the parent portion of the query (+blue). But only
> TestParent4 should meet the child portion of the query.
>
> q=+blue +{!parent
> which="documentKind:TestParent"v=$childq}&childq=portal_product:("red hat")
>
>
> {
>   "responseHeader":{
> "status":0,
> "QTime":15,
> "params":{
>   "indent":"true",
>   "q":" blue  {!parent which=\"documentKind:TestParent\"v=$childq}",
>   "childq":"portal_product:(\"red hat\")",
>   "wt":"json"}},
>   "response":{"numFound":2733,"start":0,"maxScore":3.0138793,"docs":[
>   {
> "documentKind":"TestParent",
> "uri":"https://ping/pong/testparent4";,
> "view_uri":"https://ping/pong/testparent4";,
> "id":"TestParent4",
> "allTitle":"blue",
> "sortTitle":"blue",
> "_version_":1529622873461751808,
> "_root_":["https://ping/pong/testparent4";],
> "timestamp":"2016-03-23T19:40:48.211Z",
> "language":"en"},
>   {
> "documentKind":"TestParent",
> "uri":"https://ping/pong/testparent3";,
> "view_uri":"https://ping/pong/testparent3";,
> "id":"TestParent3",
> "allTitle":"blue",
> "sortTitle":"blue",
> "_version_":1529622308758487040,
> "_root_":["https://ping/pong/testparent3";],
> "timestamp":"2016-03-23T19:31:49.668Z",
> "language":"en"},
>   {
> "documentKind":"TestParent",
> "uri":"https://ping/pong/testparent2";,
> "view_uri":"https://ping/pong/testparent2";,
> "id":"TestParent2",
> "allTitle":"blue",
> "sortTitle":"blue",
> "_version_":1529622293809987584,
> "_root_":["https://ping/pong/testparent2";],
> "timestamp":"2016-03-23T19:31:35.408Z",
> "language":"en"}
>
> - Original Message -
>
> From: "Mikhail Khludnev" 
> To: "solr-user" 
> Sent: Wednesday, March 23, 2016 5:34:31 PM
> Subject: Re: SyntaxError - Block Join Parent Query
>
> On Thu, Mar 24, 2016 at 12:16 AM, Charles Sanders 
> wrote:
>
> > Thanks for the quick reply. But I'm not sure I understand. Did I do
> > something wrong??
> >
> Absolutely
> 'portal_product"red hat")'
> You omit a colon and opening bracket after a field name, don't you?
>
>
> >
> >
> >
> /select?q=+blue%20+{!parent%20which=%22documentKind:TestParent%22%20v=$childq}&childq=portal_product%22red%20hat%22)
> >
> > 
> > 
> >  400 
> >  2 
> > 
> > 
> > blue {!parent which="documentKind:TestParent" v=$childq}
> > 
> >  portal_product"red hat") 
> > 
> > 
> > 
> > 
> > org.apache.solr.search.SyntaxError: Cannot parse 'portal_product"red
> > hat")': Encountered " ")" ") "" at line 1, column 23. Was expecting one
> of:
> >   ...  ...  ... "+" ... "-" ...  ... "(" ...
> > "*" ... "^" ...  ...  ...  ...  ...
> >  ...  ... "[" ... "{" ...  ... 
> ...
> > 
> >  400 
> > 
> > 
> >
> > - Original Message -
> >
> > From: "Mikhail Khludnev" 
> > To: "solr-user" 
> > Sent: Wednesday, March 23, 2016 5:02:29 PM
> > Subject: Re: SyntaxError - Block Join Parent Query
> >
> > On Wed, Mar 23, 2016 at 11:09 PM, Charles Sanders 
> > wrote:
> >
> > > I'm getting a SyntaxError which I do not understand when I execute a
> > block
> > > join parent query. I'm running Solr5.2.1, with 2 shards. The problem
> > > appears to be in that portion of the query that filters the child
> > document.
> > > Any insight as to where I made the error is greatly appreciated.
> > >
> > > This query produces an error:
> > > q=+blue +{!parent which="documentKind:TestParent"}portal_product:("red
> > > hat")
> > > -- should return TestParent4
> > >
> > q=+blue +{!parent which="documentKind:TestParent"
> > v=$childq}&childq=portal_product:("red hat")
> >
> >
> > > However, this query works:
> > > q=+blue +{!parent which="documentKind:TestParent"}portal_product:rhel
> > > -- should return TestParent2
> > >
> > > Sample data and schema information below:
> > > {
> > > "documentKind": "TestParent",
> > > "uri": "https://ping/pong/testparent1";,
> > > "view_uri": "https://ping/pong/testparent1";,
> > > "id": "TestParent1",
> > > "allTitle": "gold",
> > > "allText": "gold",
> > > "contents": "gold",
> > > "_childDocuments_": [
> > > {
> > > "documentKind": "TestChild",
> > > "uri": "testchild1",
> > > "id": "testchild1",
> > > "portal_product_version": "6",
> > > "portal_product": "rhel"
> > > }
> > > ]
> > > }
> > >
> > > {
> > > "documentKind": "TestParent",
> > > "uri": "https://ping/pong/testparent2";,
> > > "view_uri": "https://ping/pong/testparent2";,
> > > "id": "TestParent2",
> > > "allTitle": "blue",
> > > "a

Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Yonik Seeley
On Thu, Mar 24, 2016 at 11:38 AM, Bram Van Dam  wrote:
> On 23/03/16 15:50, Yonik Seeley wrote:
>> Kind of a unique situation for a dot-oh release, but from the Solr
>> perspective, 6.0 should have *fewer* bugs than 5.5 (for those features
>> in 5.5 at least)... we've been squashing a bunch of docValue related
>> issues.
>
> I've been led to understand that 6.X (at least the Lucene part?) won't
> be backwards compatible with 4.X data. 5.5 at least works fine with data
> files from 4.7, for instance. With that in mind, at least from my
> selfish perspective, applying fixes to 5.X would be much appreciated ;-)

I hear ya...
In the event that someone volunteers to make a 5.6 or a 5.5.1 release,
we should have a big back-porting party :-)

-Yonik


Re: Index not fitting in memory (file-cache)

2016-03-24 Thread Toke Eskildsen
Robert Brown  wrote:
> Before I go out and throw more RAM into the system, in the above
> example, what would you recommend?

That you try to determine what causes the slow response times.

Replay logged queries (thousands of queries, not just a few) and see if the 
pauses are random or tied to specific queries. Turn on GC-logs to see if the 
pauses are caused by garbage collection.

If the long response times are tied to specific queries, then try turning off 
queryResultCache and documentCache and replay a handful of the slow queries a 
few times. This ensures that the data needed are cached by the file system. If 
they continue being slow, then more RAM might not help you.

- Toke Eskildsen


Solr 5 and JDK 8 - awful performance

2016-03-24 Thread Dragos Vizireanu
Hi,

I have a big problem with the performance of running Solr 5 with JDK 8.

Details:
- tried with both Solr 5.4.0 and Solr 5.5.0  (even with Solr 4)
- default Solr 5 configuration
- created a new core, for which I am using data import handler to get data
from MySQL


When I am trying to index data using the import handler, the import is very
fast with JDK 7 and awful slow with JDK 8 ! Here are the results copied
from Solr gui:

- *JDK 7  - Duration 17 seconds*

Indexing completed. Added/Updated: 5,997 documents. Deleted 0
documents. *(Duration:
17s)*

Requests: 6 (0/s), Fetched: 235,593 (13,858/s), Skipped: 0, Processed:
5,997 (353/s)



- *JDK 8 - Duration 7 minutes*

Indexing completed. Added/Updated: 5,997 documents. Deleted 0
documents. *(Duration:
7m 06s)*
Requests: 6 (0/s), Fetched: 47,160 (111/s), Skipped: 0, Processed: 5,997


As you can see, there is a problem with the performance being awful with
JDK 8. While calling DIH to index, I got no errors in the log, but the
indexing just didn't do anything for some minutes and then started again.
This is very strange I could not find the reason why this is happening.


I also found these ideas on Solr Wiki:

https://wiki.apache.org/solr/SolrPerformanceProblems -> GC pause problems

https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr

The thing is that it's saying " If you are using the bin/solr or bin\solr
script to start Solr, you already have GC tuning and won't need to worry
about the recommendations here."; I did check the scripts in /bin directory
and there is set a Garbage tuning "set GC_TUNE ...".


I would appreciate if you can help with this issue.


Best regards,

Dragos


Re: Solr 5 and JDK 8 - awful performance

2016-03-24 Thread Yonik Seeley
Wow... that's pretty strange.

> indexing just didn't do anything for some minutes and then started again.

I wonder if it's anything to do with DNS lookups or something like that?

-Yonik


On Thu, Mar 24, 2016 at 11:54 AM, Dragos Vizireanu  wrote:
> Hi,
>
> I have a big problem with the performance of running Solr 5 with JDK 8.
>
> Details:
> - tried with both Solr 5.4.0 and Solr 5.5.0  (even with Solr 4)
> - default Solr 5 configuration
> - created a new core, for which I am using data import handler to get data
> from MySQL
>
>
> When I am trying to index data using the import handler, the import is very
> fast with JDK 7 and awful slow with JDK 8 ! Here are the results copied
> from Solr gui:
>
> - *JDK 7  - Duration 17 seconds*
>
> Indexing completed. Added/Updated: 5,997 documents. Deleted 0
> documents. *(Duration:
> 17s)*
>
> Requests: 6 (0/s), Fetched: 235,593 (13,858/s), Skipped: 0, Processed:
> 5,997 (353/s)
>
>
>
> - *JDK 8 - Duration 7 minutes*
>
> Indexing completed. Added/Updated: 5,997 documents. Deleted 0
> documents. *(Duration:
> 7m 06s)*
> Requests: 6 (0/s), Fetched: 47,160 (111/s), Skipped: 0, Processed: 5,997
>
>
> As you can see, there is a problem with the performance being awful with
> JDK 8. While calling DIH to index, I got no errors in the log, but the
> indexing just didn't do anything for some minutes and then started again.
> This is very strange I could not find the reason why this is happening.
>
>
> I also found these ideas on Solr Wiki:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems -> GC pause problems
>
> https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr
>
> The thing is that it's saying " If you are using the bin/solr or bin\solr
> script to start Solr, you already have GC tuning and won't need to worry
> about the recommendations here."; I did check the scripts in /bin directory
> and there is set a Garbage tuning "set GC_TUNE ...".
>
>
> I would appreciate if you can help with this issue.
>
>
> Best regards,
>
> Dragos


Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Erick Erickson
There's always the IndexUpgrader, one could run the 5x version against
a 4x index and have a 5x-compatible index that would then be readable
by 6x OOB.

A bit convoluted to be sure.

Erick

On Thu, Mar 24, 2016 at 8:49 AM, Yonik Seeley  wrote:
> On Thu, Mar 24, 2016 at 11:45 AM, Yonik Seeley  wrote:
>>> I've been led to understand that 6.X (at least the Lucene part?) won't
>>> be backwards compatible with 4.X data. 5.5 at least works fine with data
>>> files from 4.7, for instance.
>
> It really doesn't seem like much changed at the lucene index-format
> level from 5 to 6...
> it makes one wonder how much work would be involved in allowing Lucene
> 6 to directly read a newer 4.x index... maybe it's just down to
> version strings in the index and not much else?
>
> -Yonik


Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Yonik Seeley
On Thu, Mar 24, 2016 at 12:16 PM, Erick Erickson
 wrote:
> There's always the IndexUpgrader, one could run the 5x version against
> a 4x index and have a 5x-compatible index that would then be readable
> by 6x OOB.

This may be the last time that will work.  See the thread on the
changes to all the numeric types... it doesn't seem like the
IndexUpgrader will be able to handle that transition?

Not to mention the fact that Solr 6 is using deprecated Lucene 6
numeric types if those are removed in Lucene 7, then what?

-Yonik


Re: SolrCloud: published host/port

2016-03-24 Thread Tomás Fernández Löbbe
I believe this can be done by setting the "host" and "hostPort" elements in
solr.xml. In the default solr.xml they are configured in a way to support
also setting them via System properties:

${host:}
${jetty.port:8983}

Tomás

On Wed, Mar 23, 2016 at 11:26 PM, Hendrik Haddorp 
wrote:

> Hi,
>
> is it possible to instruct Solr to publish a different host/port into
> ZooKeeper then it is actually running on? This is required if the Solr
> node is not directly reachable on its port from outside due to a NAT
> setup or when running Solr as a Docker container with a mapped port.
>
> For what its worth ElasticSearch is supporting this as documented here [1]:
> - transport.publish_port
> - transport.publish_host
>
> regards,
> Hendrik
>
> [1]
>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-transport.html
>


Re: Performance potential for updating (reindexing) documents

2016-03-24 Thread Erick Erickson
Impossible to say if for no other reason than you haven't told us
how many physical machines this is spread over ;).

For the process you've outlined to work, all the fields are stored,
right? So why not use Atomic Updates? You still have to query
the docs.

About querying. If I'm reading this right, you'll form some query
like q=whatever_identifies_docs_that_should_get_values_X_Y_Z
then process each one of those. So, really, all you need here is
the id of all the queries that satisfy that clause. You should
consider the /export handler (Streaming Aggregation). It's
designed to return large result sets with minimal memory.

So the process I'm thinking of is this (and it assumes all your
fields are stored so Atomic updates work).

Use the CloudSolrStream for each query. As the stream
comes back, you get the IDs you need and use them
to do an atomic update that adds the relevant fields.

Note that when _adding_ fields, you can change the schema
to include the new fields on an existing collection. All that
means is that any new docs added can have these fields.

Now, if all the fields are _not_ stored at least once, you can't
use atomic updates and you'll have to re-index from the system
of record.

Best,
Erick

On Thu, Mar 24, 2016 at 7:18 AM, tedsolr  wrote:
> With a properly tuned solr cloud infrastructure and less than 1B total docs
> spread out over 50 collections where the largest collection is 100M docs,
> what is a reasonable target goal for entirely reindexing a single
> collection?
>
> I understand there are a lot of variables, so I'm hypothetically wiping them
> away by assuming "a properly tuned infrastructure". So the hardware, RAM,
> etc. is configured correctly (not so in my case).
>
> The scenario is to add 3 fields to all the existing docs in one collection.
> The fields are the same but the values vary based on the docs. So a search
> is performed and finds 100 matches - all 100 docs will get the same updates.
> Then another search is performed that matches 15000 docs, and these are
> updated. This continues 10-20,000 times until essentially all the docs have
> been updated.
>
> The docs all have 100 - 200 fields, mostly text and mostly small in size.
> What's the best possible throughput I can expect? 1000 docs/sec? 5000
> docs/sec?
>
> Using SolrJ for querying and indexing against a v5.2.1 cloud.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performance-potential-for-updating-reindexing-documents-tp4265861.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Tomás Fernández Löbbe
>
>
> Not to mention the fact that Solr 6 is using deprecated Lucene 6
> numeric types if those are removed in Lucene 7, then what?
>
> I believe this is going to be an issue. We have SOLR-8396
 open, but it doesn't look
like it's going to make it to 6.0 (I tried to look at it but I didn't have
time the past weeks). We'll have to support it until Solr 8 I guess.

Tomás


Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Yago Riveiro
I did the IndexUpgrade path to upgrade my 4.x index to 5.x (15 terabytes of
data an growing), It wasn't an easy task to do it without downtime,
IndexUpgrade doesn't work if the replica is loaded.  
  
With 12T of data re-index is like a no-no operation (the time expended to do
the re-index can take several months).

  

Optimize one replica at a time doesn't work (All replicas are optimize at the
same time) killing CPU an IO and as result the cluster.

  

Conclusion, if I need to do it again to upgrade to a newer version of Solr I'm
in literally in troubles ...  
  

\--

  

/Yago Riveiro

> On Mar 24 2016, at 4:32 pm, Tomás Fernández Löbbe
 wrote:  

>

> >  
>  
> Not to mention the fact that Solr 6 is using deprecated Lucene 6  
> numeric types if those are removed in Lucene 7, then what?  
>  
> I believe this is going to be an issue. We have SOLR-8396  
; open, but it doesn't
look  
like it's going to make it to 6.0 (I tried to look at it but I didn't have  
time the past weeks). We'll have to support it until Solr 8 I guess.

>

> Tomás



RE: Reload or Reload and Solr Restart

2016-03-24 Thread Matt Kuiper
Based on what I have read, it looks like only a collection reload is needed for 
the scenario below and for that matter for applying any modifications to the 
solrconfig.xml.

Matt

From: Matt Kuiper
Sent: Wednesday, March 23, 2016 10:26 AM
To: solr-user@lucene.apache.org
Subject: Reload or Reload and Solr Restart

Hi,

I am preparing for a Production install.  In this release we will be moving 
from an AutoSuggest feature based on the Suggestor component to one based on an 
Ngram approach.  We will perform a re-index of the source data.

[
After updating the Solr config for each collection a collection reload (via the 
Solr collection api) will be executed. My question is whether this reload will 
clear the memory used by the Suggestor component or if a Solr restart on each 
Solr node will be necessary to clear the in-memory structure that was 
previously used by the Suggestor component.

Thanks,
Matt



Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Yonik Seeley
On Thu, Mar 24, 2016 at 12:32 PM, Tomás Fernández Löbbe
 wrote:
>>
>>
>> Not to mention the fact that Solr 6 is using deprecated Lucene 6
>> numeric types if those are removed in Lucene 7, then what?
>>
>> I believe this is going to be an issue. We have SOLR-8396
>  open, but it doesn't look
> like it's going to make it to 6.0 (I tried to look at it but I didn't have
> time the past weeks). We'll have to support it until Solr 8 I guess.

Even if it did make it for 6.0, it seems like someone couldn't upgrade
from 6->7 (future) without reindexing?
I don't think the IndexUpgrade tool is going to migrate from
(Trie)NumericField->PointField, right?

-Yonik


RE: Solr 5.5 Issue with CJK and mm being ignored when searching with white space.

2016-03-24 Thread Tiffany Goguen
Hi Shawn,

Thank you for the reply.  

I removed defaultOperator parameter from the schema.  I have the following in 
the request handler:

   edismax
   100

I reindexed content.

I am still seeing the same incorrect behavior.

mm=100 does not seem to be sticking with
クイック リファレンス(space between ク リ)  I am still incorrectly getting 1 result.

Tiffany


-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, March 23, 2016 6:57 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr 5.5 Issue with CJK and mm being ignored when searching with 
white space.

On 3/23/2016 8:21 AM, Tiffany Goguen wrote:
> Is this a new bug?
>
> I am using esdimax and I have set mm=100% via  defaultOperator="AND"/>

This config was deprecated in 4.x.  I actually thought support had been removed 
from 5.x already, but perhaps not.  The equivalent in modern configs is the 
q.op parameter ... but when using edismax, you should not use either q.op OR 
the defaultOperator parameter.  You should explicitly set mm, to 100% in this 
case.

The following bug applies to 5.5 and probably explains at least some of what 
you are seeing:

https://issues.apache.org/jira/browse/SOLR-8812

Thanks,
Shawn



Re: SyntaxError - Block Join Parent Query

2016-03-24 Thread Charles Sanders
I tried this on another machine with a clean index. I was able to get the query 
to work. Thank you. 

Couple of related questions. 
1) I was able to get this to work on a single shard machine. But I am not able 
to get this query to work on Solr with two shards (SorlCloud). Any reason why 
this does not work with SolrCloud? 
2) The query pattern you supplied does not appear in the documentation. Do you 
know of any reason why the information in the documentation does not work and 
does not mention your pattern? 
https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
 

Thanks again. 


- Original Message -

From: "Mikhail Khludnev"  
To: "solr-user"  
Sent: Thursday, March 24, 2016 11:31:11 AM 
Subject: Re: SyntaxError - Block Join Parent Query 

I suggest to add debugQuery=true and fl=*,[child ...] doc transformer. And 
come back with response. 

On Thu, Mar 24, 2016 at 3:23 PM, Charles Sanders  
wrote: 

> Ah yes. Thank you. Made the correction and I do not get the SyntaxError. 
> However, it does not apply the child filter. The query should return only 
> TestParent4. But it is returning TestParent2, TestParent3 and TestParent4. 
> All of these meet the parent portion of the query (+blue). But only 
> TestParent4 should meet the child portion of the query. 
> 
> q=+blue +{!parent 
> which="documentKind:TestParent"v=$childq}&childq=portal_product:("red hat") 
> 
> 
> { 
> "responseHeader":{ 
> "status":0, 
> "QTime":15, 
> "params":{ 
> "indent":"true", 
> "q":" blue {!parent which=\"documentKind:TestParent\"v=$childq}", 
> "childq":"portal_product:(\"red hat\")", 
> "wt":"json"}}, 
> "response":{"numFound":2733,"start":0,"maxScore":3.0138793,"docs":[ 
> { 
> "documentKind":"TestParent", 
> "uri":"https://ping/pong/testparent4";, 
> "view_uri":"https://ping/pong/testparent4";, 
> "id":"TestParent4", 
> "allTitle":"blue", 
> "sortTitle":"blue", 
> "_version_":1529622873461751808, 
> "_root_":["https://ping/pong/testparent4";], 
> "timestamp":"2016-03-23T19:40:48.211Z", 
> "language":"en"}, 
> { 
> "documentKind":"TestParent", 
> "uri":"https://ping/pong/testparent3";, 
> "view_uri":"https://ping/pong/testparent3";, 
> "id":"TestParent3", 
> "allTitle":"blue", 
> "sortTitle":"blue", 
> "_version_":1529622308758487040, 
> "_root_":["https://ping/pong/testparent3";], 
> "timestamp":"2016-03-23T19:31:49.668Z", 
> "language":"en"}, 
> { 
> "documentKind":"TestParent", 
> "uri":"https://ping/pong/testparent2";, 
> "view_uri":"https://ping/pong/testparent2";, 
> "id":"TestParent2", 
> "allTitle":"blue", 
> "sortTitle":"blue", 
> "_version_":1529622293809987584, 
> "_root_":["https://ping/pong/testparent2";], 
> "timestamp":"2016-03-23T19:31:35.408Z", 
> "language":"en"} 
> 
> - Original Message - 
> 
> From: "Mikhail Khludnev"  
> To: "solr-user"  
> Sent: Wednesday, March 23, 2016 5:34:31 PM 
> Subject: Re: SyntaxError - Block Join Parent Query 
> 
> On Thu, Mar 24, 2016 at 12:16 AM, Charles Sanders  
> wrote: 
> 
> > Thanks for the quick reply. But I'm not sure I understand. Did I do 
> > something wrong?? 
> > 
> Absolutely 
> 'portal_product"red hat")' 
> You omit a colon and opening bracket after a field name, don't you? 
> 
> 
> > 
> > 
> > 
> /select?q=+blue%20+{!parent%20which=%22documentKind:TestParent%22%20v=$childq}&childq=portal_product%22red%20hat%22)
>  
> > 
> >  
> >  
> >  400  
> >  2  
> >  
> >  
> > blue {!parent which="documentKind:TestParent" v=$childq} 
> >  
> >  portal_product"red hat")  
> >  
> >  
> >  
> >  
> > org.apache.solr.search.SyntaxError: Cannot parse 'portal_product"red 
> > hat")': Encountered " ")" ") "" at line 1, column 23. Was expecting one 
> of: 
> >   ...  ...  ... "+" ... "-" ...  ... "(" ... 
> > "*" ... "^" ...  ...  ...  ...  ... 
> >  ...  ... "[" ... "{" ...  ...  
> ... 
> >  
> >  400  
> >  
> >  
> > 
> > - Original Message - 
> > 
> > From: "Mikhail Khludnev"  
> > To: "solr-user"  
> > Sent: Wednesday, March 23, 2016 5:02:29 PM 
> > Subject: Re: SyntaxError - Block Join Parent Query 
> > 
> > On Wed, Mar 23, 2016 at 11:09 PM, Charles Sanders  
> > wrote: 
> > 
> > > I'm getting a SyntaxError which I do not understand when I execute a 
> > block 
> > > join parent query. I'm running Solr5.2.1, with 2 shards. The problem 
> > > appears to be in that portion of the query that filters the child 
> > document. 
> > > Any insight as to where I made the error is greatly appreciated. 
> > > 
> > > This query produces an error: 
> > > q=+blue +{!parent which="documentKind:TestParent"}portal_product:("red 
> > > hat") 
> > > -- should return TestParent4 
> > > 
> > q=+blue +{!parent which="documentKind:TestParent" 
> > v=$childq}&childq=portal_product:("red hat") 
> > 
> > 
> > > However, this query works: 
> > > q=+blue +{!parent which="documentKind:TestParent"}portal_product:rhel 
> > > -- should return TestParent2 
> > > 
> > > Sample data and schema information below: 
> > > { 
> > > "documen

Re: Performance potential for updating (reindexing) documents

2016-03-24 Thread tedsolr
Hi Erick,

My post was scant on details. The numbers I gave for collection sizes are
projections for the future. I am in the midst of an upgrade that will be
completed within a few weeks. My concern is that I may not be able to
produce the throughput necessary to index an entire collection quickly
enough (3 to 4 hours) for a large customer (100M docs).

Currently:
- single Solr instance on one host that is sharing memory and cpu with other
applications
- 4GB dedicated to Solr
~ 20M docs
~ 10GB index size
- using HttpSolrClient for all queries and updates

Very soon:
- two VMs dedicated to Solr (2 nodes)
- up to 16GB available memory
- running in cloud mode, and can now scale horizontally
- all collections are single sharded with 2 replicas

All fields are stored. The scenario I gave is using atomic updates. The
updates are done in large batches of 5000-1 docs. The use case I have is
different than most Solr setups perhaps. Indexing throughput is more
important than qps. We have very few concurrent users that do massive
amounts of doc updates. I am seeing lousy (production) performance currently
(not a surprise - long GC pauses), and have just begun the process of tuning
in a test environment.

After some more weeks of testing and tweaking I hope to get to 5000
updates/sec, but even that may not be enough. So my main concern is that
this business model (of updating entire collections about once a day) cannot
be supported by Solr.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Performance-potential-for-updating-reindexing-documents-tp4265861p4265922.html
Sent from the Solr - User mailing list archive at Nabble.com.


[nesting] Any way to return the whole hierarchical structure when doing Block Join queries?

2016-03-24 Thread Alisa Z .
 Hi all, 

I apologize for duplicating my previous message: 
Solr 5.3:  anything similar to ChildDocTransformerFactory  that does not 
flatten the hierarchical structure?    

However, it is still an open and interesting question:  

Following the example from  https://dzone.com/articles/using-solr-49-new , 
let's say we are given multiple-level nested structure: 


1
I am the parent
PARENT

1.1
I am the 1st child
CHILD


1.2
I am the 2nd child
CHILD

1.2.1
I am a grandchildren
GRANDCHILD





Querying 
q={!parent which="cat:PARENT"}name:(I am +child)&fl=id,name,[child 
parentFilter=cat:PARENT]

will return flattened structure, where cat:CHILD and cat:GRANDCHILD documents 
end up on the same level:

1
I am the parent
PARENT

1.1
I am the 1st child
CHILD


1.2
I am the 2nd child
CHILD


1.2.1
I am a grandchildren
GRANDCHILD
  
 Indeed, the JAVAdocs for ChildDocTransformerFactory say: "This 
transformer returns all descendants of each parent document in a flat list 
nested inside the parent document". 

Yet is there any way to preserve the hierarchy in the response? I really need 
to find the way to preserve the structure in the response.  

Thank you in advance! 

-- 
Alisa Zhila
--


Overriding SolrCloud Leader Election and manually assign leadership?-Is it possible?

2016-03-24 Thread ram
Hello,
  We have a setup where we have a 5 server cluster of which 3 are cloud
boxes and 2 are physical boxes. We have external zookeeper setup for the
same.The physical boxes have more capacity and in the past,we have seen
whenever the one of the boxes is leader in solrcloud,the performance seems
to be really good. However the leader election changes from time to time and
most of the time the cloud boxes seem to process most of the traffic
  Currently our solrcloud looks something like this
  Physical Box 1
X ->shard 1  Cloud 1
 - Cloud 2(Leader)
 - Physical Box 2
 -- Cloud 3
   
   Physical Box 1
 ->shard 1  Cloud 1
 - Cloud 2(Leader)
 - Physical Box 2
 -- Cloud 3

 Physical Box 1
 ->shard 1  Cloud 1
 - Cloud 2(Leader)
 - Physical Box 2
 -- Cloud 3


We are looking for a way to assign leadership to one of the physical box
always and if possible distribute the traffic only between Physical Box 1
and Physical Box 2. 

Is it possible to manually assign leadership? Would appreciate your inputs




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Overriding-SolrCloud-Leader-Election-and-manually-assign-leadership-Is-it-possible-tp4265932.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Next Solr Release - 5.5.1 or 6.0 ?

2016-03-24 Thread Jack Krupansky
Thanks, Erick, I had forgotten about that. I did find one short reference
to it in the doc: "Be sure to run the Lucene IndexUpgrader included with
Solr 4.10 if you might still have old 3x formatted segments in your index.
Alternatively: fully optimize your index with Solr 4.10 to make sure it
consists only of one up-to-date index segment."

See:
https://cwiki.apache.org/confluence/display/solr/Major+Changes+from+Solr+4+to+Solr+5

Note to doc guys and committers: That section needs to be replaced with
"Major Changes form Solr 5 to Solr 6".

Also, that IU reference doesn't link to any doc, even the Lucene Javadoc:
https://lucene.apache.org/core/5_5_0/core/org/apache/lucene/index/IndexUpgrader.html

Feels like there should be some Solr doc as well. For example, can Solr be
running, or does it (each node if SolrCloud) need to be shut down first.
And note that it's needed for each collection. Presumably the collections
can be upgraded in parallel since they are distinct directories. It would
be nice to have a SolrIndexUpgrader to run in one shot and discover and
upgrade all Solr collections.

-- Jack Krupansky

On Thu, Mar 24, 2016 at 12:16 PM, Erick Erickson 
wrote:

> There's always the IndexUpgrader, one could run the 5x version against
> a 4x index and have a 5x-compatible index that would then be readable
> by 6x OOB.
>
> A bit convoluted to be sure.
>
> Erick
>
> On Thu, Mar 24, 2016 at 8:49 AM, Yonik Seeley  wrote:
> > On Thu, Mar 24, 2016 at 11:45 AM, Yonik Seeley 
> wrote:
> >>> I've been led to understand that 6.X (at least the Lucene part?) won't
> >>> be backwards compatible with 4.X data. 5.5 at least works fine with
> data
> >>> files from 4.7, for instance.
> >
> > It really doesn't seem like much changed at the lucene index-format
> > level from 5 to 6...
> > it makes one wonder how much work would be involved in allowing Lucene
> > 6 to directly read a newer 4.x index... maybe it's just down to
> > version strings in the index and not much else?
> >
> > -Yonik
>


Re: Overriding SolrCloud Leader Election and manually assign leadership?-Is it possible?

2016-03-24 Thread Erick Erickson
First of all, for a cluster this size the additional work a leader
does is so small I suspect you'd have a hard time measuring any
performance difference. Personally I wouldn't worry about it. If you
insist, you can look at the collections API call REBALANCELEADERS (you
have to assigned the preferredLeader property). HOWEVER, that
functionality was put in for situations in which 100s of leaders could
be on a single box. At your scale it's highly unlikely to make any
difference so I'd imagine you'd get a lot greater return for effort by
concentrating your efforts somewhere else...

But step back and think about "manually assign leadership". There's no
way to absolutely fix leadership, and you wouldn't want to. Without a
leader, you cannot index documents. So fixing leadership on a
particular node would mean you'd take all the HA/DR away from
SolrCloud, A Very Bad Thing.


bq:  ...if possible distribute the traffic only between Physical Box 1
and Physical Box 2...

I have no idea what this means. Solr will randomly distribute queries
across all nodes in the collection. If you don't want to use some box,
don't put Solr instances on it.

Best,
Erick

On Thu, Mar 24, 2016 at 11:53 AM, ram  wrote:
> Hello,
>   We have a setup where we have a 5 server cluster of which 3 are cloud
> boxes and 2 are physical boxes. We have external zookeeper setup for the
> same.The physical boxes have more capacity and in the past,we have seen
> whenever the one of the boxes is leader in solrcloud,the performance seems
> to be really good. However the leader election changes from time to time and
> most of the time the cloud boxes seem to process most of the traffic
>   Currently our solrcloud looks something like this
>   Physical Box 1
> X ->shard 1  Cloud 1
>  - Cloud 2(Leader)
>  - Physical Box 2
>  -- Cloud 3
>
>    Physical Box 1
>  ->shard 1  Cloud 1
>  - Cloud 2(Leader)
>  - Physical Box 2
>  -- Cloud 3
>
>  Physical Box 1
>  ->shard 1  Cloud 1
>  - Cloud 2(Leader)
>  - Physical Box 2
>  -- Cloud 3
>
>
> We are looking for a way to assign leadership to one of the physical box
> always and if possible distribute the traffic only between Physical Box 1
> and Physical Box 2.
>
> Is it possible to manually assign leadership? Would appreciate your inputs
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Overriding-SolrCloud-Leader-Election-and-manually-assign-leadership-Is-it-possible-tp4265932.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Performance potential for updating (reindexing) documents

2016-03-24 Thread Erick Erickson
Well, for comparison I routinely get 20K docs/second on my Mac Pro
indexing Wikipedia docs. I _think_ I have 4 shards when I do this, all
in the same JVM. I'd be surprised if you can't get your 5K docs/sec,
but you may indeed need more shards.

All that said, 4G  for the JVM is kind of constrained, you already
mentioned GC. There are two pitfalls here:
1> allocating too little memory and spending lots of cycles doing very
small GCs. At 4G this is likelier than:
2> having very large heaps and seeing "stop the world" GC pauses.

So I think you're on the right track looking at memory, at least
that's what I'd be looking at first.

Note: your indexing scaling (assuming you're sending complete docs not
atomic updates) will scale better if you
1> use CloudSolrClient from Java since it routes docs to the right
leader first and avoids an extra hop.
2> batch updates. Sending one doc at a time makes things very slow, see:

https://lucidworks.com/blog/2015/10/05/really-batch-updates-solr-2/

Best,
Erick

On Thu, Mar 24, 2016 at 10:57 AM, tedsolr  wrote:
> Hi Erick,
>
> My post was scant on details. The numbers I gave for collection sizes are
> projections for the future. I am in the midst of an upgrade that will be
> completed within a few weeks. My concern is that I may not be able to
> produce the throughput necessary to index an entire collection quickly
> enough (3 to 4 hours) for a large customer (100M docs).
>
> Currently:
> - single Solr instance on one host that is sharing memory and cpu with other
> applications
> - 4GB dedicated to Solr
> ~ 20M docs
> ~ 10GB index size
> - using HttpSolrClient for all queries and updates
>
> Very soon:
> - two VMs dedicated to Solr (2 nodes)
> - up to 16GB available memory
> - running in cloud mode, and can now scale horizontally
> - all collections are single sharded with 2 replicas
>
> All fields are stored. The scenario I gave is using atomic updates. The
> updates are done in large batches of 5000-1 docs. The use case I have is
> different than most Solr setups perhaps. Indexing throughput is more
> important than qps. We have very few concurrent users that do massive
> amounts of doc updates. I am seeing lousy (production) performance currently
> (not a surprise - long GC pauses), and have just begun the process of tuning
> in a test environment.
>
> After some more weeks of testing and tweaking I hope to get to 5000
> updates/sec, but even that may not be enough. So my main concern is that
> this business model (of updating entire collections about once a day) cannot
> be supported by Solr.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Performance-potential-for-updating-reindexing-documents-tp4265861p4265922.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: [nesting] Any way to return the whole hierarchical structure when doing Block Join queries?

2016-03-24 Thread Mikhail Khludnev
I think you cal already kick tires and contribute a test case into
https://issues.apache.org/jira/browse/SOLR-8208 that's already reachable
there I believe, but I still working on core design.

On Thu, Mar 24, 2016 at 10:02 PM, Alisa Z.  wrote:

>  Hi all,
>
> I apologize for duplicating my previous message:
> Solr 5.3:  anything similar to ChildDocTransformerFactory  that does not
> flatten the hierarchical structure?
>
> However, it is still an open and interesting question:
>
> Following the example from  https://dzone.com/articles/using-solr-49-new
> , let's say we are given multiple-level nested structure:
>
> 
> 1
> I am the parent
> PARENT
> 
> 1.1
> I am the 1st child
> CHILD
> 
> 
> 1.2
> I am the 2nd child
> CHILD
> 
> 1.2.1
> I am a grandchildren
> GRANDCHILD
> 
> 
> 
>
>
> Querying
> q={!parent which="cat:PARENT"}name:(I am +child)&fl=id,name,[child
> parentFilter=cat:PARENT]
>
> will return flattened structure, where cat:CHILD and cat:GRANDCHILD
> documents end up on the same level:
> 
> 1
> I am the parent
> PARENT
> 
> 1.1
> I am the 1st child
> CHILD
> 
> 
> 1.2
> I am the 2nd child
> CHILD
> 
> 
> 1.2.1
> I am a grandchildren
> GRANDCHILD
> 
>  Indeed, the JAVAdocs for ChildDocTransformerFactory say: "This
> transformer returns all descendants of each parent document in a flat list
> nested inside the parent document".
>
> Yet is there any way to preserve the hierarchy in the response? I really
> need to find the way to preserve the structure in the response.
>
> Thank you in advance!
>
> --
> Alisa Zhila
> --
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: SyntaxError - Block Join Parent Query

2016-03-24 Thread Mikhail Khludnev
On Thu, Mar 24, 2016 at 8:31 PM, Charles Sanders 
wrote:

> I tried this on another machine with a clean index. I was able to get the
> query to work. Thank you.
>
> Couple of related questions.
> 1) I was able to get this to work on a single shard machine. But I am not
> able to get this query to work on Solr with two shards (SorlCloud). Any
> reason why this does not work with SolrCloud?
>

I hardly imagine. Show me your debugQuery=true output.


> 2) The query pattern you supplied does not appear in the documentation. Do
> you know of any reason why the information in the documentation does not
> work and does not mention your pattern?
>
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers

Just nobody wrote it there, but anybody can, and somebody wrote at some
other places
http://blog.griddynamics.com/2013/12/grandchildren-and-siblings-with-block.html?q=block+join
https://cwiki.apache.org/confluence/display/solr/Local+Parameters+in+Queries


>
>
> Thanks again.
>
>
> - Original Message -
>
> From: "Mikhail Khludnev" 
> To: "solr-user" 
> Sent: Thursday, March 24, 2016 11:31:11 AM
> Subject: Re: SyntaxError - Block Join Parent Query
>
> I suggest to add debugQuery=true and fl=*,[child ...] doc transformer. And
> come back with response.
>
> On Thu, Mar 24, 2016 at 3:23 PM, Charles Sanders 
> wrote:
>
> > Ah yes. Thank you. Made the correction and I do not get the SyntaxError.
> > However, it does not apply the child filter. The query should return only
> > TestParent4. But it is returning TestParent2, TestParent3 and
> TestParent4.
> > All of these meet the parent portion of the query (+blue). But only
> > TestParent4 should meet the child portion of the query.
> >
> > q=+blue +{!parent
> > which="documentKind:TestParent"v=$childq}&childq=portal_product:("red
> hat")
> >
> >
> > {
> > "responseHeader":{
> > "status":0,
> > "QTime":15,
> > "params":{
> > "indent":"true",
> > "q":" blue {!parent which=\"documentKind:TestParent\"v=$childq}",
> > "childq":"portal_product:(\"red hat\")",
> > "wt":"json"}},
> > "response":{"numFound":2733,"start":0,"maxScore":3.0138793,"docs":[
> > {
> > "documentKind":"TestParent",
> > "uri":"https://ping/pong/testparent4";,
> > "view_uri":"https://ping/pong/testparent4";,
> > "id":"TestParent4",
> > "allTitle":"blue",
> > "sortTitle":"blue",
> > "_version_":1529622873461751808,
> > "_root_":["https://ping/pong/testparent4";],
> > "timestamp":"2016-03-23T19:40:48.211Z",
> > "language":"en"},
> > {
> > "documentKind":"TestParent",
> > "uri":"https://ping/pong/testparent3";,
> > "view_uri":"https://ping/pong/testparent3";,
> > "id":"TestParent3",
> > "allTitle":"blue",
> > "sortTitle":"blue",
> > "_version_":1529622308758487040,
> > "_root_":["https://ping/pong/testparent3";],
> > "timestamp":"2016-03-23T19:31:49.668Z",
> > "language":"en"},
> > {
> > "documentKind":"TestParent",
> > "uri":"https://ping/pong/testparent2";,
> > "view_uri":"https://ping/pong/testparent2";,
> > "id":"TestParent2",
> > "allTitle":"blue",
> > "sortTitle":"blue",
> > "_version_":1529622293809987584,
> > "_root_":["https://ping/pong/testparent2";],
> > "timestamp":"2016-03-23T19:31:35.408Z",
> > "language":"en"}
> >
> > - Original Message -
> >
> > From: "Mikhail Khludnev" 
> > To: "solr-user" 
> > Sent: Wednesday, March 23, 2016 5:34:31 PM
> > Subject: Re: SyntaxError - Block Join Parent Query
> >
> > On Thu, Mar 24, 2016 at 12:16 AM, Charles Sanders 
> > wrote:
> >
> > > Thanks for the quick reply. But I'm not sure I understand. Did I do
> > > something wrong??
> > >
> > Absolutely
> > 'portal_product"red hat")'
> > You omit a colon and opening bracket after a field name, don't you?
> >
> >
> > >
> > >
> > >
> >
> /select?q=+blue%20+{!parent%20which=%22documentKind:TestParent%22%20v=$childq}&childq=portal_product%22red%20hat%22)
> > >
> > > 
> > > 
> > >  400 
> > >  2 
> > > 
> > > 
> > > blue {!parent which="documentKind:TestParent" v=$childq}
> > > 
> > >  portal_product"red hat") 
> > > 
> > > 
> > > 
> > > 
> > > org.apache.solr.search.SyntaxError: Cannot parse 'portal_product"red
> > > hat")': Encountered " ")" ") "" at line 1, column 23. Was expecting one
> > of:
> > >   ...  ...  ... "+" ... "-" ...  ... "("
> ...
> > > "*" ... "^" ...  ...  ...  ... 
> ...
> > >  ...  ... "[" ... "{" ...  ... 
> > ...
> > > 
> > >  400 
> > > 
> > > 
> > >
> > > - Original Message -
> > >
> > > From: "Mikhail Khludnev" 
> > > To: "solr-user" 
> > > Sent: Wednesday, March 23, 2016 5:02:29 PM
> > > Subject: Re: SyntaxError - Block Join Parent Query
> > >
> > > On Wed, Mar 23, 2016 at 11:09 PM, Charles Sanders  >
> > > wrote:
> > >
> > > > I'm getting a SyntaxError which I do not understand when I execute a
> > > block
> > > > join parent query. I'm running Solr5.2.1, with 2 shards. The problem
> > > > appears to be in that portion of the query that filters the child
> > > document.
> > > > Any insight as to where I made the error is great

Re: Solr 5 and JDK 8 - awful performance

2016-03-24 Thread Mikhail Khludnev
Dragos, I wonder if you have a ScriptTransformer in your config?
Just a clue SolrAdmin has Threads tab, sometimes it's possible to diagnose
severe performance problem just by observing deep stacks there.

On Thu, Mar 24, 2016 at 6:54 PM, Dragos Vizireanu 
wrote:

> Hi,
>
> I have a big problem with the performance of running Solr 5 with JDK 8.
>
> Details:
> - tried with both Solr 5.4.0 and Solr 5.5.0  (even with Solr 4)
> - default Solr 5 configuration
> - created a new core, for which I am using data import handler to get data
> from MySQL
>
>
> When I am trying to index data using the import handler, the import is very
> fast with JDK 7 and awful slow with JDK 8 ! Here are the results copied
> from Solr gui:
>
> - *JDK 7  - Duration 17 seconds*
>
> Indexing completed. Added/Updated: 5,997 documents. Deleted 0
> documents. *(Duration:
> 17s)*
>
> Requests: 6 (0/s), Fetched: 235,593 (13,858/s), Skipped: 0, Processed:
> 5,997 (353/s)
>
>
>
> - *JDK 8 - Duration 7 minutes*
>
> Indexing completed. Added/Updated: 5,997 documents. Deleted 0
> documents. *(Duration:
> 7m 06s)*
> Requests: 6 (0/s), Fetched: 47,160 (111/s), Skipped: 0, Processed: 5,997
>
>
> As you can see, there is a problem with the performance being awful with
> JDK 8. While calling DIH to index, I got no errors in the log, but the
> indexing just didn't do anything for some minutes and then started again.
> This is very strange I could not find the reason why this is happening.
>
>
> I also found these ideas on Solr Wiki:
>
> https://wiki.apache.org/solr/SolrPerformanceProblems -> GC pause problems
>
> https://wiki.apache.org/solr/ShawnHeisey#GC_Tuning_for_Solr
>
> The thing is that it's saying " If you are using the bin/solr or bin\solr
> script to start Solr, you already have GC tuning and won't need to worry
> about the recommendations here."; I did check the scripts in /bin directory
> and there is set a Garbage tuning "set GC_TUNE ...".
>
>
> I would appreciate if you can help with this issue.
>
>
> Best regards,
>
> Dragos
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: No live SolrServers available to handle this request

2016-03-24 Thread Elaine Cario
Anil,

I've seen situations where if there was a problem with a specific query,
and every shard responds with the same error, the actual exception gets
hidden  by a "No live SolrServers..." exception.  We originally saw this
with wildcard queries (when every shard reported a "too many expansions..."
type error, but the exception in the response was "No live SolrServers..."
error.

You mention that you are using collapse/expand, and that you have shards -
that could possibly cause some issue, as I think collapse and expand only
work correctly if the data for any particular collapse value resides on one
shard.

On Sat, Mar 19, 2016 at 1:04 PM, Shawn Heisey  wrote:

> On 3/18/2016 9:55 PM, Anil wrote:
> > Thanks for your response.
> > CDH is a Cloudera (third party) distribution. is there any to get the
> > notifications copy of it when cluster state changed ? in logs ?
> >
> > I can assume that the exception is result of no availability of replicas
> > only. Agree?
>
> Yes, I think that Solr believes there are no replicas for at least one
> shard.  As for why it believes that, I cannot say.
>
> If Solr logged every single thing that happened where zookeeper (or even
> just the clusterstate) is involved, you'd be drowning in logs.  Much
> more than already happens.  The logfile is already very verbose.
>
> Chances are that at least one of your Solr nodes *did* log something
> related to a problem with that collection before you got the error
> you're asking about.
>
> The "No live SolrServers" error is one that people are seeing quite
> frequently.  There may be some instances where Solr isn't behaving
> correctly, but I think when this happens, it usually indicates there's a
> real problem of some kind.
>
> To troubleshoot, we'll need to see any errors or warnings you find in
> your Solr logfiles from the time before you get an error on a request.
> You'll need to check the logfile on all Solr nodes.
>
> It might be a good idea to also involve Cloudera support, see what they
> think.
>
> Thanks,
> Shawn
>
>


Use default field, if more specific field does not exist

2016-03-24 Thread Georg Sorst
Hi list,

we use Solr to search ecommerce products.

Items have a default price which can be overwritten per user. So when
searching we have to return the user price if it is set, otherwise the
default price. Same goes for building facets and when filtering by price.

What's the best way to achieve this in Solr? We know the user ID when
sending the request to Solr so we could do something like this:

* Add the default price in the field "price" to the items
* Add all the user prices in a field like "price_"

Now for displaying the correct price this is fine, just look if there is a
field "price_" for this result item, otherwise just display the
value of the "price" field.

The tricky part is faceting and filtering. Which field do we use?
"price_"? What happens for users that don't have a user price set
for an item? In this case the "price_" field does not exist so
faceting and filtering will not work.

We thought about adding a "price_" field for every item and every
user and fill in the default price for the item if the user does not have
an overwritten price for this item. This would potentially make our index
unnecessarily large. Consider 10,000 items and 10,000 users (quite
realistic), that's 100,000,000 "price_" fields, even though maybe
only a few users have overwritten prices.

What I've been (unsuccessfully) looking for is some sort of field fallback
where I can tell Solr something like "use price_ for the results,
facets and filter queries, and if that does not exist for an item, use
price instead". At first sight field aliases seemed like that but turns out
that just renames the field in the result items.

So, is there something like this or is there a better solution anyway?

Thanks,
Georg
-- 
*Georg M. Sorst I CTO*
FINDOLOGIC GmbH



Jakob-Haringer-Str. 5a | 5020 Salzburg I T.: +43 662 456708
E.: g.so...@findologic.com
www.findologic.com Folgen Sie uns auf: XING
facebook
 Twitter


Wir sehen uns auf dem *Shopware Community Day in Ahaus am 20.05.2016!* Hier
 Termin
vereinbaren!
Wir sehen uns auf der* dmexco in Köln am 14.09. und 15.09.2016!* Hier
 Termin
vereinbaren!


Re: Indexing multiple pdf's and partial update of pdf

2016-03-24 Thread Alexandre Rafalovitch
An approach that comes to mind is to use DataImportHandler with PDF
parsing being in the inner definition while indexed entity being at
the parent level. The main issue is how to ensure Tika output from one
PDF does not map to the same fields as from the second one. Maybe give
different prefixes. Then, you might be able to do it either with
UpdateRequestProcessor or with copyfields from those two prefixes into
the common place and ignoring source prefixes.

Disclaimer: not tested.

Regards,
   Alex.

Newsletter and resources for Solr beginners and intermediates:
http://www.solr-start.com/


On 25 March 2016 at 01:39, Jay Parashar  wrote:
>
>
> Thanks Reth,
>
>
>
> Yes I am using Apache Tike and went by the instructions given in
>
> https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika
>
>
>
> Here I see we can index a pdf " solr-word.pdf" to a document with unique key 
> = "doc1" as below
>
>
>
> curl 
> 'http://localhost:8983/solr/techproducts/update/extract?literal.id=doc1&commit=true'
>  -F 
> "myfile=@example/exampledocs/solr-word.pdf"
>
>
>
> My requirement is to index another separate pdf to this document with key = 
> doc1. Basically I need the contents of both pdfs to be searchable and related 
> to the id=doc1.
>
>
>
> What comes to my mind is to perform an 'extractOnly' as below on both pdf's 
> and then index the concatenation of the contents. Is there another less 
> invasive way?
>
>
>
> curl 
> "http://localhost:8983/solr/techproducts/update/extract?&extractOnly=true"; 
> --data-binary @example/exampledocs/sample.html -H 'Content-type:text/html'
>
>
>
> Thanks
>
> Jay
>
>
>
> -Original Message-
> From: Reth RM [mailto:reth.ik...@gmail.com]
> Sent: Thursday, March 24, 2016 12:24 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing multiple pdf's and partial update of pdf
>
>
>
> Are you using apache tika parser to parse pdf files?
>
>
>
> 1) Solr support parent-child block join using which you can index more than 
> one file data within document object(if that is what you are looking for) 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_Other-2BParsers-23OtherParsers-2DBlockJoinQueryParsers&d=CwIFaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI&s=83RBCYuuwc7iI4KAzkPMsyNThtsMqr9Bp9QOk1lr_fU&e=
>
>
>
> 2) If the unique key of the document that exists in index is equal to new 
> document that you are reindexing, it will be overwritten. If you'd like to do 
> partial updates via curl, here are some examples listed :
>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__yonik.com_solr_atomic-2Dupdates_&d=CwIFaQ&c=uGuXJ43KPkPWEl2imVFDmZQlhQUET7pVRA2PDIOxgqw&r=bRfqJEeedEKG5nkp5748YxbNMFrUYT3YiNl0Ni2vUBQ&m=mjl8EQIh28bi_at9AocrELmHdF6oGMDz4_-rPAaBWrI&s=RnLUMlzU69Qr6D2NPbCH9wig6JLekcfwfGu9kC9l9DA&e=
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Mar 24, 2016 at 3:43 AM, Jay Parashar 
> mailto:bparas...@slb.com>> wrote:
>
>
>
>> Hi,
>
>>
>
>> I have couple of questions regarding indexing files (say pdf).
>
>>
>
>> 1)  Is there any way to index more than one file to one document with
>
>> a unique id?
>
>>
>
>> One way I think is to do a “extractOnly” of all the documents and then
>
>> index that extract separately. Is there an easier way?
>
>>
>
>> 2)  If my Solr document has existing fields populated and then I index
>
>> a pdf, it seems it overwrites the document with the end result being
>
>> just the contents of the pdf. I know we can do partial updates using
>
>> SolrJ but is it possible to do partial updates of pdf using curl?
>
>>
>
>>
>
>> Thanks
>
>> Jay
>
>>


Re: how to update billions of docs

2016-03-24 Thread Mohsin Beg Beg

An update on how I ended up implementing the requirement in case it helps 
others. There are lots of other code I did not include but the general logic is 
below.

While performance is still not great, it is 10x faster than atomic updates ( 
because RealTimeGetComponent.getInputDocument() is not needed )


1. Wrote an update handler
   /myupdater?q=*:* & sort=fieldx desc & fl=fieldx, fieldy & 
stream.file=exampledocs/oldvalueToNewValue.properties & update.chain=myprocessor


2. In the handler read the map from content stream and invoke the export 
handler for the query params
   SolrRequestHandler handler = core.getRequestHandler("/export");
   core.execute(handler, req, rsp);
   numFound = (Integer) req.getContext().get("totalHits");


3. Iterate using /export handler response, similar to 
SortingResponseWrite.write() method
 
   List leaves = 
req.getSearcher().getTopReaderContext().leaves();
   for(int i=0; i wrote:

> Hi Mohsin,
> There's some work in progress for in-place updates to docValued fields,
> https://issues.apache.org/jira/browse/SOLR-5944. Can you try the latest
> patch there (or ping me if you need a git branch)?
> It would be nice to know how fast the updates go for your usecase with that
> patch. Please note that for that patch, both the version field and the
> updated field needs to have stored=false, indexed=false, docValues=true.
> Regards,
> Ishan
>
>
> On Thu, Mar 17, 2016 at 10:55 PM, Jack Krupansky  >
> wrote:
>
> > It would be nice to have a wiki/doc for "Bulk Field Update" that listed
> all
> > of these techniques and tricks.
> >
> > And, of course, it would be so much better to have an explicit Lucene
> > feature for this. It could work in the background like merge and process
> > one segment at a time as efficiently as possible.
> >
> > Have several modes:
> >
> > 1. Set a field of all documents to explicit value.
> > 2. Set a field of query documents to an explicit value.
> > 3. Increment by n.
> > 4. Add new field to all document, or maybe by query.
> > 5. Delete existing field for all documents.
> > 6. Delete field value for all documents or a specified query.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Mar 17, 2016 at 12:31 PM, Ken Krugler <
> kkrugler_li...@transpac.com
> > >
> > wrote:
> >
> > > As others noted, currently updating a field means deleting and
> inserting
> > > the entire document.
> > >
> > > Depending on how you use the field, you might be able to create another
> > > core/container with that one field (plus the key field), and use join
> > > support.
> > >
> > > Note that https://issues.apache.org/jira/browse/LUCENE-6352 is an
> > > improvement, which looks like it's in the 5.x code line, though I don't
> > see
> > > a fix version.
> > >
> > > -- Ken
> > >
> > > > From: Mohsin Beg Beg
> > > > Sent: March 16, 2016 3:52:47pm PDT
> > > > To: solr-user@lucene.apache.org
> > > > Subject: how to update billions of docs
> > > >
> > > > Hi,
> > > >
> > > > I have a requirement to replace a value of a field in 100B's of docs
> in
> > > 100's of cores.
> > > > The field is multiValued=false docValues=true type=StrField
> stored=true
> > > indexed=true.
> > > >
> > > > Atomic Updates performance is on the order of 5K docs per sec per
> core
> > > in solr 5.3 (other fields are quite big).
> > > >
> > > > Any suggestions ?
> > > >
> > > > -Mohsin
> > >
> > >
> > > --
> > > Ken Krugler
> > > +1 530-210-6378
> > > http://www.scaleunlimited.com
> > > custom big data solutions & training
> > > Hadoop, Cascading, Cassandra & Solr
> > >
> > >
> > >
> > >
> > >
> > >
> >
>