Re: TikaEntityProcesor Exception Handling

2014-04-07 Thread akash2489
Any updates on this?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaEntityProcesor-Exception-Handling-tp3502495p4129580.html
Sent from the Solr - User mailing list archive at Nabble.com.


ArrayIndexOutOfBoundsException while reindexing via DIH

2014-04-07 Thread Ralf Matulat

Hi,
we are currently facing a new problem while reindexing one of our SOLR 
4.4 instances:


We are using SOLR 4.4 getting data via DIH out of a MySQL Server.
The data is constantly growing.

We have reindexed our data a lot of times without any trouble.
The problem can be reproduced.

There is another server, configured exactly the same way (via git) which 
was reindex 3 days ago against the same MySQL Server without problems.
But: That server has more RAM and more powerfull CPUs as the one making 
headaches today.


The error log says:

java.lang.ArrayIndexOutOfBoundsException
at 
org.apache.lucene.util.packed.Packed64SingleBlock$Packed64SingleBlock4.get(Packed64SingleBlock.java:336)
at 
org.apache.lucene.util.packed.GrowableWriter.get(GrowableWriter.java:56)
at 
org.apache.lucene.util.packed.AbstractPagedMutable.get(AbstractPagedMutable.java:88)

at org.apache.lucene.util.fst.NodeHash.addNew(NodeHash.java:151)
at org.apache.lucene.util.fst.NodeHash.rehash(NodeHash.java:169)
at org.apache.lucene.util.fst.NodeHash.add(NodeHash.java:133)
at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:197)
at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289)
at org.apache.lucene.util.fst.Builder.add(Builder.java:394)
at 
org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.append(BlockTreeTermsWriter.java:474)
at 
org.apache.lucene.codecs.BlockTreeTermsWriter$PendingBlock.compileIndex(BlockTreeTermsWriter.java:438)
at 
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.writeBlocks(BlockTreeTermsWriter.java:569)
at 
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter$FindBlocks.freeze(BlockTreeTermsWriter.java:544)

at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:214)
at org.apache.lucene.util.fst.Builder.finish(Builder.java:463)
at 
org.apache.lucene.codecs.BlockTreeTermsWriter$TermsWriter.finish(BlockTreeTermsWriter.java:1010)
at 
org.apache.lucene.index.FreqProxTermsWriterPerField.flush(FreqProxTermsWriterPerField.java:553)
at 
org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:85)

at org.apache.lucene.index.TermsHash.flush(TermsHash.java:116)
at org.apache.lucene.index.DocInverter.flush(DocInverter.java:53)
at 
org.apache.lucene.index.DocFieldProcessor.flush(DocFieldProcessor.java:81)
at 
org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:501)
at 
org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:478)
at 
org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:372)
at 
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:445)
at 
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java)
at 
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:212)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
at 
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:572)
at 
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435)
at 
org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
at 
org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:70)
at 
org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:237)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:504)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:408)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:323)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:231)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:411)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:476)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:457)


Any suggestions are welcome.
Best regards
Ralf


Routing distance with Solr?

2014-04-07 Thread Matteo Tarantino
Hi all,
this is my first message on this mailing list, so I hope I'm doing all
correctly.

My problem is: I have to create a search engine of dealers that are in a
well defined routing distance from the address entered by the user. I have
already used Solr for some previous works, but I never needed geospatial
search, so i'm a newbie in this field.

On the web I have read that Solr can calculate only the distance "as the
crow flies" between two points, but for my purposes I need the exact
routing distance. This is not possible with Solr, can you confirm this? (If
so, I think I'll have to refine results with additional calculations with
GoogleMap Api's or some OSM tools like GraphHopper)


Thank you in advance!
Matteo


RE: Query and field name with wildcard

2014-04-07 Thread Croci Francesco Luigi (ID SWS)
Hello Alex,

I saw your example and took it as template for my needs.

I tried with the aliasing, but, maybe because I did it wrong, it does not 
work...

"error": {
"msg": "undefined field all",
"code": 400
  }

Here is a snippet of my solrconfig.xml:

...


explicit


rmDocumentTitle rmDocumentArt 
rmDocumentClass rmDocumentSubclass rmDocumentCatName rmDocumentCatNameEn 
fullText




  
edismax
fullText_en
full_Text
json
true
  
  
language:en
fullText_en
rmDocumentTitle rmDocumentArt 
rmDocumentClass rmDocumentSubclass rmDocumentCatName rmDocumentCatNameEn 
fullText_en
* -fullText_*
*,fullText:fullText_en
  



  
edismax
fullText_de
full_Text
json
true
  
  
language:de
fullText_de
rmDocumentTitle rmDocumentArt 
rmDocumentClass rmDocumentSubclass rmDocumentCatName rmDocumentCatNameEn 
fullText_de
* -fullText_*
*,fullText:fullText_de
  

...

What am I missing/ doing wrong?


Regards,
Francesco

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Freitag, 4. April 2014 11:08
To: solr-user@lucene.apache.org
Subject: Re: Query and field name with wildcard

Are you using eDisMax. That gives a lot of options, including field aliasing, 
including a single name to multiple fields:
http://wiki.apache.org/solr/ExtendedDisMax#Field_aliasing_.2F_renaming
(with example on p77 of my book
http://www.packtpub.com/apache-solr-for-indexing-data/book :-)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/ Current project: 
http://www.solr-start.com/ - Accelerating your Solr proficiency


On Fri, Apr 4, 2014 at 3:52 PM, Croci  Francesco Luigi (ID SWS) 
 wrote:
> In my index I have some fields which have the same prefix(rmDocumentTitle, 
> rmDocumentClass, rmDocumentSubclass, rmDocumentArt). Apparently it is not 
> possible to specify a query like this:
>
> q = rm* : some_word
>
> Is there a way to do this without having to write a long list of ORs?
>
> Another question is if it is really not possible to search a word over 
> the entire index. Something like this: q = * : some_word
>
> Thank you
> Francesco


Re: Block Join Parent Query across children docs

2014-04-07 Thread mertens
Thanks Hoss, with the filter queries it works. I was trying to use a normal
query from Mikhail's blog that looked like this:

q={!parent which=type_s:parent}+search_t:item1 +search_t:item2
-search_t:item3

That query doesn't work for me but the filter query does just what I want.

ps last years stump the chump was great, and it looks like you're still not
stumped.

Cheers,
Luke


On Thu, Apr 3, 2014 at 1:39 AM, Chris Hostetter-3 [via Lucene] <
ml-node+s472066n4128734...@n3.nabble.com> wrote:

>
> : Thanks for your response. Here is an example of what I'm trying to do.
> If I
> : had the following documents:
>
> what you are attempting is fairly trivial -- you want to query for all
> parent documents, then kapply 3 filters:
>
>  * parent of a child matching item1
>  * parent of a child matching item2
>  * not a parent of a chile matching item3
>
> Part of your problem may be that (in your example you posted anywayway)
> you appear to be trying to use a *string* field for listing multiple terms
> with commas and then seem to want to match on those individual terms --
> that's not going to work.  either make your string field a true
> multivalued field, or use a text field with tokenization.
>
> With the modified example data you provided below (using search_t instead
> of search_s) this query seems to do exactly waht you want...
>
> http://localhost:8983/solr/select?p_filt=type_s:parent&q=*:*&fq={!parent%20which=$p_filt}search_t:item2&fq={!parent%20which=$p_filt}search_t:item1&fq=-{!parent%20which=$p_filt}search_t:item3
>
>
>  q = *:*
> p_filt = type_s:parent
> wt = json
> fq =  {!parent which=$p_filt}search_t:item2
> fq =  {!parent which=$p_filt}search_t:item1
> fq = -{!parent which=$p_filt}search_t:item3
>
>
> -Hoss
> http://www.lucidworks.com/
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Block-Join-Parent-Query-across-children-docs-tp4127637p4128734.html
>  To unsubscribe from Block Join Parent Query across children docs, click
> here
> .
> NAML
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Block-Join-Parent-Query-across-children-docs-tp4127637p4129588.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Query and field name with wildcard

2014-04-07 Thread Croci Francesco Luigi (ID SWS)
Sorry, found the problem myself...

I used the /select where the edismax was not defined. 
The other two, /selectEN and  /selectDE, worked.

Adding the edismax to the /select made it work too.

Ciao
Francesco

-Original Message-
From: Croci Francesco Luigi (ID SWS) [mailto:fcr...@id.ethz.ch] 
Sent: Montag, 7. April 2014 11:20
To: solr-user@lucene.apache.org
Subject: RE: Query and field name with wildcard

Hello Alex,

I saw your example and took it as template for my needs.

I tried with the aliasing, but, maybe because I did it wrong, it does not 
work...

"error": {
"msg": "undefined field all",
"code": 400
  }

Here is a snippet of my solrconfig.xml:

...


explicit


rmDocumentTitle rmDocumentArt 
rmDocumentClass rmDocumentSubclass rmDocumentCatName rmDocumentCatNameEn 
fullText




  
edismax
fullText_en
full_Text
json
true
  
  
language:en
fullText_en
rmDocumentTitle rmDocumentArt 
rmDocumentClass rmDocumentSubclass rmDocumentCatName rmDocumentCatNameEn 
fullText_en
* -fullText_*
*,fullText:fullText_en
  



  
edismax
fullText_de
full_Text
json
true
  
  
language:de
fullText_de
rmDocumentTitle rmDocumentArt 
rmDocumentClass rmDocumentSubclass rmDocumentCatName rmDocumentCatNameEn 
fullText_de
* -fullText_*
*,fullText:fullText_de
  

...

What am I missing/ doing wrong?


Regards,
Francesco

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com]
Sent: Freitag, 4. April 2014 11:08
To: solr-user@lucene.apache.org
Subject: Re: Query and field name with wildcard

Are you using eDisMax. That gives a lot of options, including field aliasing, 
including a single name to multiple fields:
http://wiki.apache.org/solr/ExtendedDisMax#Field_aliasing_.2F_renaming
(with example on p77 of my book
http://www.packtpub.com/apache-solr-for-indexing-data/book :-)

Regards,
   Alex.

Personal website: http://www.outerthoughts.com/ Current project: 
http://www.solr-start.com/ - Accelerating your Solr proficiency


On Fri, Apr 4, 2014 at 3:52 PM, Croci  Francesco Luigi (ID SWS) 
 wrote:
> In my index I have some fields which have the same prefix(rmDocumentTitle, 
> rmDocumentClass, rmDocumentSubclass, rmDocumentArt). Apparently it is not 
> possible to specify a query like this:
>
> q = rm* : some_word
>
> Is there a way to do this without having to write a long list of ORs?
>
> Another question is if it is really not possible to search a word over 
> the entire index. Something like this: q = * : some_word
>
> Thank you
> Francesco


Bad request on update.distrib=FROMLEADER

2014-04-07 Thread Gastone Penzo
Hello,
i have a problem of bad request during indexing data.
I have for nodes with solr cloud. The architecture is this:

10.0.0.86   10.0.0.87
NODE1  NODE 2
 |  |
 |  |
 |  |
 |  |
NODE 3 NODE 4
10.0.0.88   10.0.0.89

2 shards (node1 and node 2) with 2 replicas (node 3 and node4)


I tried to index data in node1 with DataImportHandler (Mysql) and fullimport
the index were created, but only half. and i had this error

bad request

request:
http://10.0.0.88:9002/solr/collection1/update?update.distrib=FROMLEADER&distrib.from=http://10.0.0.86:9000/solr/collection1/&wt=javabin&version=2
at org.apache.solr.client.solrj.i
mpl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)

i think node 1 call node 2 to give it an half on index, but the parameter
distrib.from is incomplete.Why?
if i create index with post.jar there are no problems. is it a problem of
Datahandler??

thank you


-- 
*Gastone Penzo*


Re: Using Sentence Information For Snippet Generation

2014-04-07 Thread Dmitry Kan
Furkan,

I haven't worked with the boundary scanner before, but one thing I had to
tweak with position increments was the highlighter component itself.
Because it started to throw exceptions. The solution is described in this
thread (a conversation with myself :) )

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201301.mbox/%3CCAHUAEU_qjKcgzrxtM=x90_j8i5v0a5h0mtq4b0+0etxc7q0...@mail.gmail.com%3E

HTH,
Dmitry


On Sun, Apr 6, 2014 at 12:44 AM, Furkan KAMACI wrote:

> Hi Dmitry;
>
> I think that such kind of hacking may reduce the search speed. I think that
> it should be done with boundary scanner isn't it? I think that bs.type=LINE
> is what I am looking for? There is one more point. I want to do that for
> Turkish language and I think that I should customize it or if I put special
> characters to point boundaries I can use simple boundary scanner?
>
> Thanks;
> Furkan KAMACI
>
>
>
> 2014-03-24 21:14 GMT+02:00 Dmitry Kan :
>
> > Hi Furkan,
> >
> > I have done an implementation with a custom filler (special character)
> > sequence in between sentences. A better solution I landed at was
> increasing
> > the position of each sentence's first token by a large number, like 1
> > (perhaps, a smaller number could be used too). Then a user search can be
> > conducted with a proximity query: "some tokens" ~5000 (the recently
> > committed complexphrase parser supports rich phrase syntax, for example).
> > This of course expects that a sentence fits the 5000 window size and the
> > total number of sentences in the field * 10k does not exceed
> > Integer.MAX_VALUE. Then on the highlighter side you'd get the hits within
> > sentences naturally.
> >
> > Is this something you are looking for?
> >
> > Dmitry
> >
> >
> >
> > On Mon, Mar 24, 2014 at 5:43 PM, Furkan KAMACI  > >wrote:
> >
> > > Hi;
> > >
> > > When I generate snippet via Solr I do not want to remove beginning of
> any
> > > sentence at the snippet. So I need to do a sentence detection. I think
> > that
> > > I can do it before I send documents into Solr. I can put some special
> > > characters that signs beginning or end of a sentence. Then I can use
> that
> > > information when generating snippet. On the other hand I should not
> show
> > > that special character to the user.
> > >
> > > What do you think that how can I do it or do you have any other ideas
> for
> > > my purpose?
> > >
> > > PS: I do not do it for English sentences.
> > >
> > > Thanks;
> > > Furkan KAMACI
> > >
> >
> >
> >
> > --
> > Dmitry
> > Blog: http://dmitrykan.blogspot.com
> > Twitter: http://twitter.com/dmitrykan
> >
>



-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


Re: Block Join Parent Query across children docs

2014-04-07 Thread Mikhail Khludnev
for sake of completeness, here is the same query w/o fq

q=+{!parent which=type_s:parent}search_t:item1 +{!parent
which=type_s:parent}search_t:item2 -{!parent
which=type_s:parent}search_t:item3

here is more detail about the first symbol magic
http://www.mail-archive.com/solr-user@lucene.apache.org/msg96796.html


On Mon, Apr 7, 2014 at 1:23 PM, mertens  wrote:

> Thanks Hoss, with the filter queries it works. I was trying to use a normal
> query from Mikhail's blog that looked like this:
>
> q={!parent which=type_s:parent}+search_t:item1 +search_t:item2
> -search_t:item3
>
> That query doesn't work for me but the filter query does just what I want.
>
> ps last years stump the chump was great, and it looks like you're still not
> stumped.
>
> Cheers,
> Luke
>
>
> On Thu, Apr 3, 2014 at 1:39 AM, Chris Hostetter-3 [via Lucene] <
> ml-node+s472066n4128734...@n3.nabble.com> wrote:
>
> >
> > : Thanks for your response. Here is an example of what I'm trying to do.
> > If I
> > : had the following documents:
> >
> > what you are attempting is fairly trivial -- you want to query for all
> > parent documents, then kapply 3 filters:
> >
> >  * parent of a child matching item1
> >  * parent of a child matching item2
> >  * not a parent of a chile matching item3
> >
> > Part of your problem may be that (in your example you posted anywayway)
> > you appear to be trying to use a *string* field for listing multiple
> terms
> > with commas and then seem to want to match on those individual terms --
> > that's not going to work.  either make your string field a true
> > multivalued field, or use a text field with tokenization.
> >
> > With the modified example data you provided below (using search_t instead
> > of search_s) this query seems to do exactly waht you want...
> >
> >
> http://localhost:8983/solr/select?p_filt=type_s:parent&q=*:*&fq={!parent%20which=$p_filt}search_t:item2&fq={!parent%20which=$p_filt}search_t:item1&fq=-{!parent%20which=$p_filt}search_t:item3
> >
> >
> >  q = *:*
> > p_filt = type_s:parent
> > wt = json
> > fq =  {!parent which=$p_filt}search_t:item2
> > fq =  {!parent which=$p_filt}search_t:item1
> > fq = -{!parent which=$p_filt}search_t:item3
> >
> >
> > -Hoss
> > http://www.lucidworks.com/
> >
> >
> > --
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/Block-Join-Parent-Query-across-children-docs-tp4127637p4128734.html
> >  To unsubscribe from Block Join Parent Query across children docs, click
> > here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=4127637&code=bG1lcnRlbnNAZ21haWwuY29tfDQxMjc2Mzd8LTU0NDAxMzQzMw==
> >
> > .
> > NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Block-Join-Parent-Query-across-children-docs-tp4127637p4129588.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





what is geodist default value

2014-04-07 Thread Aman Tandon
Hello,

In my index, i am using the LatlonType, for using the geodist to calculate
the distance, and i am using it like geodist(lat, lon, location). Can
anybody told me what value the geodist will return if i will pass
geodist(0, 0, location)

Thanks
Aman Tandon


Re: Block Join Parent Query across children docs

2014-04-07 Thread mertens
Yeah, that works also for me. Thanks Mikhail.


On Mon, Apr 7, 2014 at 12:42 PM, Mikhail Khludnev [via Lucene] <
ml-node+s472066n4129604...@n3.nabble.com> wrote:

> for sake of completeness, here is the same query w/o fq
>
> q=+{!parent which=type_s:parent}search_t:item1 +{!parent
> which=type_s:parent}search_t:item2 -{!parent
> which=type_s:parent}search_t:item3
>
> here is more detail about the first symbol magic
> http://www.mail-archive.com/solr-user@.../msg96796.html
>
>
>
> On Mon, Apr 7, 2014 at 1:23 PM, mertens <[hidden 
> email]>
> wrote:
>
> > Thanks Hoss, with the filter queries it works. I was trying to use a
> normal
> > query from Mikhail's blog that looked like this:
> >
> > q={!parent which=type_s:parent}+search_t:item1 +search_t:item2
> > -search_t:item3
> >
> > That query doesn't work for me but the filter query does just what I
> want.
> >
> > ps last years stump the chump was great, and it looks like you're still
> not
> > stumped.
> >
> > Cheers,
> > Luke
> >
> >
> > On Thu, Apr 3, 2014 at 1:39 AM, Chris Hostetter-3 [via Lucene] <
> > [hidden email] >
> wrote:
> >
> > >
> > > : Thanks for your response. Here is an example of what I'm trying to
> do.
> > > If I
> > > : had the following documents:
> > >
> > > what you are attempting is fairly trivial -- you want to query for all
> > > parent documents, then kapply 3 filters:
> > >
> > >  * parent of a child matching item1
> > >  * parent of a child matching item2
> > >  * not a parent of a chile matching item3
> > >
> > > Part of your problem may be that (in your example you posted
> anywayway)
> > > you appear to be trying to use a *string* field for listing multiple
> > terms
> > > with commas and then seem to want to match on those individual terms
> --
> > > that's not going to work.  either make your string field a true
> > > multivalued field, or use a text field with tokenization.
> > >
> > > With the modified example data you provided below (using search_t
> instead
> > > of search_s) this query seems to do exactly waht you want...
> > >
> > >
> > http://localhost:8983/solr/select?p_filt=type_s:parent&q=*:*&fq={!parent%20which=$p_filt}search_t:item2&fq={!parent%20which=$p_filt}search_t:item1&fq=-{!parent%20which=$p_filt}search_t:item3
>
> > >
> > >
> > >  q = *:*
> > > p_filt = type_s:parent
> > > wt = json
> > > fq =  {!parent which=$p_filt}search_t:item2
> > > fq =  {!parent which=$p_filt}search_t:item1
> > > fq = -{!parent which=$p_filt}search_t:item3
> > >
> > >
> > > -Hoss
> > > http://www.lucidworks.com/
> > >
> > >
> > > --
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Block-Join-Parent-Query-across-children-docs-tp4127637p4128734.html
> > >  To unsubscribe from Block Join Parent Query across children docs,
> click
> > > here<
> >
> >
> > > .
> > > NAML<
> >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>
> > >
> > >
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Block-Join-Parent-Query-across-children-docs-tp4127637p4129588.html
>
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> <[hidden email] >
>
>
> --
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Block-Join-Parent-Query-across-children-docs-tp4127637p4129604.html
>  To unsubscribe from Block Join Parent Query across children docs, click
> here
> .
> NAML
>




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Block-Join-Parent-Query-across-children-docs-tp4127637p4129609.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Solr interface

2014-04-07 Thread Jonathan Varsanik
Do you mean to tell me that the people on this list that are indexing 100s of 
millions of documents are doing this over http?  I have been using custom 
Lucene code to index files, as I thought this would be faster for many 
documents and I wanted some non-standard OCR and index fields.  Is there a 
better way?

To the OP: You can also use Lucene to locally index files for Solr.



-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com] 
Sent: Thursday, April 03, 2014 8:47 AM
To: solr-user@lucene.apache.org
Cc: Solr User
Subject: Re: Solr interface

Yes. But why?

DataImportHandler kinda does this (still use http to kick off an indexing job). 
 And there's EmbeddedSolrServer too. 

Erik

> On Apr 3, 2014, at 8:39, Александр Вандышев  wrote:
> 
> Is it possible to index files not via HTTP interface?


converting 4.7 index to 4.3.1

2014-04-07 Thread Dmitry Kan
Dear list,

We have been generating solr indices with the solr-hadoop contrib module
(SOLR-1301). Our current solr in use is of 4.3.1 version. Is there any tool
that could do the backward conversion, i.e. 4.7->4.3.1? Or is the upgrade
the only way to go?

-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


Re: ngramfilter minGramSize problem

2014-04-07 Thread Andreas Owen
it works well. now why does the search only find something when the  
fieldname is added to the query with stopwords?


"cug" -> 9 hits
"mit cug" -> 0 hits
"plain_text:mit cug" -> 9 hits

why is this so? could it be a problem that stopwords aren't used in the  
query because no all fields that are search have the stopwordfilter?



On Mon, 07 Apr 2014 00:37:15 +0200, Furkan KAMACI   
wrote:



Correction: My patch is at SOLR-5152
7 Nis 2014 01:05 tarihinde "Andreas Owen"  yazdı:


i thought i cound use  to index and search words that are only 1 or 2 chars long. it
seems to work but i have to test it some more


On Sun, 06 Apr 2014 22:24:20 +0200, Andreas Owen 
wrote:

 i have the a fieldtype that uses ngramfilter whle indexing. is there a

setting that can force the ngramfilter to index smaller words then the
minGramSize? Mine is set to 3 and the search wont find word that are  
only 1
or 2 chars long. i would like to not set minGramSize=1 because the  
results

would be to diverse.

fieldtype:


   
 
 

ignoreCase="true"
words="lang/stopwords_de.txt" format="snowball"  
enablePositionIncrements="true"/>


 
 



   
   


 

class="solr.SnowballPorterFilterFactory"

language="German"/>

   
 




--
Using Opera's mail client: http://www.opera.com/mail/




--
Using Opera's mail client: http://www.opera.com/mail/


Eactly Mathcing for Elevator

2014-04-07 Thread Furkan KAMACI
I've defined a elevator as like that:


 
   
 
 
   
 
 
   
 
 
   
 


When I send a query it gives error
of: org.apache.solr.common.SolrException: Boosting query defined twice for
query

When I check the source code it says:

map.containsKey( elev.analyzed )

What I want is that:

when a user enters a query i.e.:

rüna telecom

I want to show id1. But not when a user enters that:

telecom

I do not want to elevate it?

Thanks;
Furkan KAMACI


Re: Commit Within and /update/extract handler

2014-04-07 Thread Erick Erickson
You say you see the commit happen in the log, is openSearcher
specified? This sounds like you're somehow getting a commit
with openSearcher=false...

Best,
Erick

On Sun, Apr 6, 2014 at 5:37 PM, Jamie Johnson  wrote:
> I'm running solr 4.6.0 and am noticing that commitWithin doesn't seem to
> work when I am using the /update/extract request handler.  It looks like a
> commit is happening from the logs, but the documents don't become available
> for search until I do a commit manually.  Could this be some type of
> configuration issue?


Re: Solr XML Messages

2014-04-07 Thread Erick Erickson
See: https://tika.apache.org/1.4/formats.html

short answer "yes".

Longer answer: It would be a lot easier to reply meaningfully if you
told us what you were trying to do.

You might want to review:

http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

On Sun, Apr 6, 2014 at 11:20 PM, Александр Вандышев
 wrote:
> Tell me whether it is possible to use Solr XML Messages for indexing via
> update
> extract hendler?


Regex for hl.bs.chars

2014-04-07 Thread Furkan KAMACI
Could I define a pattern for hl.bs.chars? I mean *$* shows the start or end
of a string at my documents and I want to define it as regex to hl.bs.chars?

On the other hand I do not use currently termVectors=on, termPositions=on
and termOffsets=on at my fields. Does it cause a performance issue or
breaks the expected behavior too?


RE: Solr interface

2014-04-07 Thread Toke Eskildsen
On Mon, 2014-04-07 at 13:52 +0200, Jonathan Varsanik wrote:
> Do you mean to tell me that the people on this list that are indexing
> 100s of millions of documents are doing this over http?

Some of us do. Our net archive indexer runs a lot of Tika processes that
sends their analysed documents through http. We're building 1TB indexes
of about 3-400M documents each. The Tika-analysis is by far the heavy
part of the setup: 1 Solr instance easily keeps up with 30 Tikas on a 24
core machine (or 48, depending on how you count). This setup makes it
easy to scale up & out, basically by starting new Tika processes on
whatever machines we have available.

In other setups, where the pre-index analysis is lighter, the choice of
transport layer might matter more. As always, optimize where it it
needed.

- Toke Eskildsen, State and University Library, Denmark




Re: Routing distance with Solr?

2014-04-07 Thread david.w.smi...@gmail.com
Hi,
This is definitely not possible with Solr.  Use GraphHopper.
~ David


On Mon, Apr 7, 2014 at 5:09 AM, Matteo Tarantino  wrote:

> Hi all,
> this is my first message on this mailing list, so I hope I'm doing all
> correctly.
>
> My problem is: I have to create a search engine of dealers that are in a
> well defined routing distance from the address entered by the user. I have
> already used Solr for some previous works, but I never needed geospatial
> search, so i'm a newbie in this field.
>
> On the web I have read that Solr can calculate only the distance "as the
> crow flies" between two points, but for my purposes I need the exact
> routing distance. This is not possible with Solr, can you confirm this? (If
> so, I think I'll have to refine results with additional calculations with
> GoogleMap Api's or some OSM tools like GraphHopper)
>
>
> Thank you in advance!
> Matteo
>


Re: what is geodist default value

2014-04-07 Thread david.w.smi...@gmail.com
Hi,

I'm not sure why you are asking or maybe I'm not getting what you *really*
want to know.  You'll get the geodesic distance (i.e. the "great circle
distance", the distance on the surface of a sphere) from 0,0 (off the coast
of Africa), to each point indexed in your "location" field.

~ David



On Mon, Apr 7, 2014 at 7:06 AM, Aman Tandon  wrote:

> Hello,
>
> In my index, i am using the LatlonType, for using the geodist to calculate
> the distance, and i am using it like geodist(lat, lon, location). Can
> anybody told me what value the geodist will return if i will pass
> geodist(0, 0, location)
>
> Thanks
> Aman Tandon
>


Re: Solr interface

2014-04-07 Thread Andre Bois-Crettez

You can use Solrj : https://wiki.apache.org/solr/Solrj
Anyway, even using http the performance is good.

André

On 2014-04-07 13:52, Jonathan Varsanik wrote:

Do you mean to tell me that the people on this list that are indexing 100s of 
millions of documents are doing this over http?  I have been using custom 
Lucene code to index files, as I thought this would be faster for many 
documents and I wanted some non-standard OCR and index fields.  Is there a 
better way?

To the OP: You can also use Lucene to locally index files for Solr.



-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com]
Sent: Thursday, April 03, 2014 8:47 AM
To:solr-user@lucene.apache.org
Cc: Solr User
Subject: Re: Solr interface

Yes. But why?

DataImportHandler kinda does this (still use http to kick off an indexing job). 
 And there's EmbeddedSolrServer too.

 Erik


On Apr 3, 2014, at 8:39, Александр Вандышев  wrote:

Is it possible to index files not via HTTP interface?


--
André Bois-Crettez

Software Architect
Big Data Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.


Re: Solr Search on Fields name

2014-04-07 Thread anuragwalia
Thanks Ahmat and Jack for replying.

I found a another way to solve the problem by using FilterQuery.

fq=RuleA:*+OR+RuleC:*

but due to development platform query parsing stuck some where else.

Hopefully after platform fix it will work for me.

I will get back to you if any other issue occurred.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Search-on-Fields-name-tp4129119p4129648.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: ArrayIndexOutOfBoundsException while reindexing via DIH

2014-04-07 Thread Shawn Heisey

On 4/7/2014 3:00 AM, Ralf Matulat wrote:
we are currently facing a new problem while reindexing one of our SOLR 
4.4 instances:


We are using SOLR 4.4 getting data via DIH out of a MySQL Server.
The data is constantly growing.

We have reindexed our data a lot of times without any trouble.
The problem can be reproduced.

There is another server, configured exactly the same way (via git) 
which was reindex 3 days ago against the same MySQL Server without 
problems.
But: That server has more RAM and more powerfull CPUs as the one 
making headaches today.


The error log says:

java.lang.ArrayIndexOutOfBoundsException
at 
org.apache.lucene.util.packed.Packed64SingleBlock$Packed64SingleBlock4.get(Packed64SingleBlock.java:336)
at 
org.apache.lucene.util.packed.GrowableWriter.get(GrowableWriter.java:56)
at 
org.apache.lucene.util.packed.AbstractPagedMutable.get(AbstractPagedMutable.java:88)

at org.apache.lucene.util.fst.NodeHash.addNew(NodeHash.java:151)
at org.apache.lucene.util.fst.NodeHash.rehash(NodeHash.java:169)
at org.apache.lucene.util.fst.NodeHash.add(NodeHash.java:133)
at 
org.apache.lucene.util.fst.Builder.compileNode(Builder.java:197)
at 
org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289)

at org.apache.lucene.util.fst.Builder.add(Builder.java:394)


This looks a little bit like a problem that recently surfaced in 
automated testing.  That particular problem was caused by IBM's J9 Java 
(based on JDK7) miscompiling a low-level lucene function.


Are you using a JVM from a vendor other than Oracle?  At the moment, the 
JVM recommendation is Oracle Java 7u25.  When 7u60 comes out (expected 
in May 2014), that will most likely be the recommended version.  Are 
there other differences between these two systems, like the garbage 
collector being used, 32-bit vs. 64-bit, different max heap size, 
running in a different servlet container, etc?


Are there any other errors, such as an OutOfMemory error?

I could be completely wrong with my guess.

Thanks,
Shawn



Re: Commit Within and /update/extract handler

2014-04-07 Thread Erick Erickson
What does the call look like? Are you setting opening a new searcher
or not? That should be in the log line where the commit is recorded...

FWIW,
Erick

On Sun, Apr 6, 2014 at 5:37 PM, Jamie Johnson  wrote:
> I'm running solr 4.6.0 and am noticing that commitWithin doesn't seem to
> work when I am using the /update/extract request handler.  It looks like a
> commit is happening from the logs, but the documents don't become available
> for search until I do a commit manually.  Could this be some type of
> configuration issue?


Duplicate Unique Key

2014-04-07 Thread Simon
Hi all,

I know someone has posted similar question before.  But my case is little
different as I don't have the schema set up issue mentioned in those posts
but still get duplicate records.

My unique key in schema is 




id$



Search on Solr- admin UI:   id$:1

I got two documents
{
   "id$": "1",
   "_version_": 1464225014071951400,
"_root_": 1
},
{
"id$": "1",
"_version_": 1464236728284872700,
"_root_": 1
}

I use SolrJ api to add documents.  My understanding solr uniqueKey is like a
database primary key. I am wondering how could I end up with two documents
with same uniqueKey in the index.

Thanks,
Simon




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr interface

2014-04-07 Thread Shawn Heisey

On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:

Do you mean to tell me that the people on this list that are indexing 100s of 
millions of documents are doing this over http?  I have been using custom 
Lucene code to index files, as I thought this would be faster for many 
documents and I wanted some non-standard OCR and index fields.  Is there a 
better way?

To the OP: You can also use Lucene to locally index files for Solr.


My sharded index has 94 million docs in it.  All normal indexing and 
maintenance is done with SolrJ, over http.Currently full rebuilds are 
done with the dataimport handler loading from MySQL, but that is 
legacy.  This is NOT a SolrCloud installation.  It is also not a 
replicated setup -- my indexing program keeps both copies up to date 
independently, similar to what happens behind the scenes with SolrCloud.


The single-thread DIH is very well optimized, and is faster than what I 
have written myself -- also single-threaded.


The real reason that we still use DIH for rebuilds is that I can run the 
DIH simultaenously on all shards.  A full rebuild that way takes about 5 
hours.  A SolrJ process feeding all shards with a single thread would 
take a lot longer.  Once I have time to work on it, I can make the SolrJ 
rebuild multi-threaded, and I expect it will be similar to DIH in 
rebuild speed.  Hopefully I can make it faster.


There is always overhead with HTTP.  On a gigabit LAN, I don't think 
it's high enough to matter.


Using Lucene to index files for Solr is an option -- but that requires 
writing a custom Lucene application, and knowledge about how to turn the 
Solr schema into Lucene code.  A lot of users on this list (me included) 
do not have the skills required.  I know SolrJ reasonably well, but 
Lucene is a nut that I haven't cracked.


Thanks,
Shawn



Regex For *|* at hl.regex.pattern

2014-04-07 Thread Furkan KAMACI
Hi;

I try that but it does not work do I miss anything:

q=portu&hl.regex.pattern=.*\*\|\*.*&hl.fragsize=120&hl.regex.slop=0.2

My aim is to check whether it includes *|* or not (that's why I've put .*
beginning and end of the regex to achieve whatever you match)

How to fix it?

Thanks;
Furkan KAMACI


Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Gregg Donovan
That was my first attempt, but it's much trickier than I anticipated.

A filter that calls HttpServletRequest#getParameter() before
SolrDispatchFilter will trigger an exception  -- see
getParameterIncompatibilityException [1] -- if the request is a POST. It
seems that Solr depends on the configured per-core SolrRequestParser to
properly parse the request parameters. A servlet filter that came before
SolrDispatchFilter would need to fetch the correct SolrRequestParser for
the requested core, parse the request, and reset the InputStream before
pulling the data into the MDC. It also duplicates the work of request
parsing. It's especially tricky if you want to remove the tracing
parameters from the SolrParams and just have them in the MDC to avoid them
being logged twice.


[1]
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628


On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch wrote:

> On the second thought,
>
> If you are already managing to pass the value using the request
> parameters, what stops you from just having a servlet filter looking
> for that parameter and assigning it directly to the MDC context?
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
>  wrote:
> > I like the idea. No comments about implementation, leave it to others.
> >
> > But if it is done, maybe somebody very familiar with logging can also
> > review Solr's current logging config. I suspect it is not optimized
> > for troubleshooting at this point.
> >
> > Regards,
> >Alex.
> > Personal website: http://www.outerthoughts.com/
> > Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
> >
> >
> > On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 
> wrote:
> >> We have some metadata -- e.g. a request UUID -- that we log to every log
> >> line using Log4J's MDC [1]. The UUID logging allows us to connect any
> log
> >> lines we have for a given request across servers. Sort of like Zipkin
> [2].
> >>
> >> Currently we're using EmbeddedSolrServer without sharding, so adding the
> >> UUID is fairly simple, since everything is in one process and one
> thread.
> >> But, we're testing a sharded HTTP implementation and running into some
> >> difficulties getting this data passed around in a way that lets us trace
> >> all log lines generated by a request to its UUID.
> >>
>


Re: Duplicate Unique Key

2014-04-07 Thread Erick Erickson
Hmmm, that's odd. I just tried it (admittedly with post.jar rather
than SolrJ) and it works just fine.

what server are you using (e.g. CloudSolrServer)? And can you create a
self-contained program that illustrates the problem?

Best,
Erick

On Mon, Apr 7, 2014 at 8:50 AM, Simon  wrote:
> Hi all,
>
> I know someone has posted similar question before.  But my case is little
> different as I don't have the schema set up issue mentioned in those posts
> but still get duplicate records.
>
> My unique key in schema is
>
>  multiValued="false" required="true"/>
>
>
> id$
>
>
>
> Search on Solr- admin UI:   id$:1
>
> I got two documents
> {
>"id$": "1",
>"_version_": 1464225014071951400,
> "_root_": 1
> },
> {
> "id$": "1",
> "_version_": 1464236728284872700,
> "_root_": 1
> }
>
> I use SolrJ api to add documents.  My understanding solr uniqueKey is like a
> database primary key. I am wondering how could I end up with two documents
> with same uniqueKey in the index.
>
> Thanks,
> Simon
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Regex For *|* at hl.regex.pattern

2014-04-07 Thread Furkan KAMACI
One more question: does that regex works on analyzed field or raw data?


2014-04-07 19:21 GMT+03:00 Furkan KAMACI :

> Hi;
>
> I try that but it does not work do I miss anything:
>
> q=portu&hl.regex.pattern=.*\*\|\*.*&hl.fragsize=120&hl.regex.slop=0.2
>
> My aim is to check whether it includes *|* or not (that's why I've put .*
> beginning and end of the regex to achieve whatever you match)
>
> How to fix it?
>
> Thanks;
> Furkan KAMACI
>


Ranking code

2014-04-07 Thread azhar2007
Hi does anybody know where the ranking code is held. Which file in Solr
stores it the solr schema.xml or solrconfig.xml file?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Ranking-code-tp4129664.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Reading Solr index

2014-04-07 Thread François Schiettecatte
Maybe you should try a more recent release of Luke:

https://github.com/DmitryKey/luke/releases

François

On Apr 7, 2014, at 12:27 PM, azhar2007  wrote:

> Hi All,
> 
> I have a solr index which is indexed ins Solr.4.7.0.
> 
> Ive attempted to open the index with Luke4.0.0 and also other verisons with
> no luck.
> Gives me an error message.
> 
> Is there a way of reading the data?
> 
> I would like to convert the file to a readable format where i can see the
> terms it holds from the documents etc. 
> 
> Please Help!!
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Reading-Solr-index-tp4129662.html
> Sent from the Solr - User mailing list archive at Nabble.com.



signature.asc
Description: Message signed with OpenPGP using GPGMail


Reading Solr index

2014-04-07 Thread azhar2007
Hi All,

I have a solr index which is indexed ins Solr.4.7.0.

Ive attempted to open the index with Luke4.0.0 and also other verisons with
no luck.
Gives me an error message.

Is there a way of reading the data?

I would like to convert the file to a readable format where i can see the
terms it holds from the documents etc. 

Please Help!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reading-Solr-index-tp4129662.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr interface

2014-04-07 Thread Daniel Collins
I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
~400M documents in total, with 4-way replication (so its quite a big
setup!)  I had thought that HTTP would slow things down, so we recently
trialed a JNI approach (clients are C++) so we could call SolrJ and get the
benefits of JavaBin encoding for our indexing

Once we had done benchmarks with both solutions, I think we saved about 1ms
per document (on average) with JNI, so it wasn't as big a gain as we were
expecting.  There are other benefits of SolrJ (zookeeper integration,
better routing, etc) and we were doing local HTTP (so it was literally just
a TCP port to localhost, no actual net traffic) but that just goes to prove
what other posters have said here.  Check whether HTTP really *is* the
bottleneck before you try to replace it!


On 7 April 2014 17:05, Shawn Heisey  wrote:

> On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
>
>> Do you mean to tell me that the people on this list that are indexing
>> 100s of millions of documents are doing this over http?  I have been using
>> custom Lucene code to index files, as I thought this would be faster for
>> many documents and I wanted some non-standard OCR and index fields.  Is
>> there a better way?
>>
>> To the OP: You can also use Lucene to locally index files for Solr.
>>
>
> My sharded index has 94 million docs in it.  All normal indexing and
> maintenance is done with SolrJ, over http.Currently full rebuilds are done
> with the dataimport handler loading from MySQL, but that is legacy.  This
> is NOT a SolrCloud installation.  It is also not a replicated setup -- my
> indexing program keeps both copies up to date independently, similar to
> what happens behind the scenes with SolrCloud.
>
> The single-thread DIH is very well optimized, and is faster than what I
> have written myself -- also single-threaded.
>
> The real reason that we still use DIH for rebuilds is that I can run the
> DIH simultaenously on all shards.  A full rebuild that way takes about 5
> hours.  A SolrJ process feeding all shards with a single thread would take
> a lot longer.  Once I have time to work on it, I can make the SolrJ rebuild
> multi-threaded, and I expect it will be similar to DIH in rebuild speed.
>  Hopefully I can make it faster.
>
> There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
> high enough to matter.
>
> Using Lucene to index files for Solr is an option -- but that requires
> writing a custom Lucene application, and knowledge about how to turn the
> Solr schema into Lucene code.  A lot of users on this list (me included) do
> not have the skills required.  I know SolrJ reasonably well, but Lucene is
> a nut that I haven't cracked.
>
> Thanks,
> Shawn
>
>


Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Alexandre Rafalovitch
So to rephrase:

Solr will barf at unknown parameters, so we cannot currently send them in
band.

And the out of band dies not work due to post body handling complexity.

You are proposing effectively a dynamic set with common prefix to stop the
complaints. Plus the code to propagate those params.

Is that a good general description? I am just wondering if this can be
matched to some other real issues as well.

Regards,
 Alex
On 07/04/2014 11:23 pm, "Gregg Donovan"  wrote:

> That was my first attempt, but it's much trickier than I anticipated.
>
> A filter that calls HttpServletRequest#getParameter() before
> SolrDispatchFilter will trigger an exception  -- see
> getParameterIncompatibilityException [1] -- if the request is a POST. It
> seems that Solr depends on the configured per-core SolrRequestParser to
> properly parse the request parameters. A servlet filter that came before
> SolrDispatchFilter would need to fetch the correct SolrRequestParser for
> the requested core, parse the request, and reset the InputStream before
> pulling the data into the MDC. It also duplicates the work of request
> parsing. It's especially tricky if you want to remove the tracing
> parameters from the SolrParams and just have them in the MDC to avoid them
> being logged twice.
>
>
> [1]
>
> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628
>
>
> On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch  >wrote:
>
> > On the second thought,
> >
> > If you are already managing to pass the value using the request
> > parameters, what stops you from just having a servlet filter looking
> > for that parameter and assigning it directly to the MDC context?
> >
> > Regards,
> >Alex.
> > Personal website: http://www.outerthoughts.com/
> > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > proficiency
> >
> >
> > On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
> >  wrote:
> > > I like the idea. No comments about implementation, leave it to others.
> > >
> > > But if it is done, maybe somebody very familiar with logging can also
> > > review Solr's current logging config. I suspect it is not optimized
> > > for troubleshooting at this point.
> > >
> > > Regards,
> > >Alex.
> > > Personal website: http://www.outerthoughts.com/
> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > proficiency
> > >
> > >
> > > On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 
> > wrote:
> > >> We have some metadata -- e.g. a request UUID -- that we log to every
> log
> > >> line using Log4J's MDC [1]. The UUID logging allows us to connect any
> > log
> > >> lines we have for a given request across servers. Sort of like Zipkin
> > [2].
> > >>
> > >> Currently we're using EmbeddedSolrServer without sharding, so adding
> the
> > >> UUID is fairly simple, since everything is in one process and one
> > thread.
> > >> But, we're testing a sharded HTTP implementation and running into some
> > >> difficulties getting this data passed around in a way that lets us
> trace
> > >> all log lines generated by a request to its UUID.
> > >>
> >
>


Re: How do I add another unrelated query results to solr index

2014-04-07 Thread sanjay92
I think it was not just rootEntity="true".

We need to add transformer="TemplateTransformer"  and make sure that each
entity has some kind of Unique column across all entities e.g. in this case 



is a made up column and this doc_id values should be unique across all
entities. template clause is like transformation e.g. doc_id values are made
up by prefixing salg_ and values of ${salgrade.GRADE} in the first entity
section while in the second entity section, it is using different prefix and
different variable to make it Unique.

schema.xml have   doc_id
and also add following :
   
   
   
   
   
   
   


   

   




  
  




  

  





   
   



  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-I-add-another-unrelated-query-results-to-solr-index-tp4128932p4129678.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Michael Sokolov
I had to grapple with something like this problem when I wrote Lux's 
app-server.  I extended SolrDispatchFilter and handle parameter 
swizzling to keep everything nicey-nicey for Solr while being able to 
play games with parameters of my own.  Perhaps this will give you some 
ideas:


https://github.com/msokolov/lux/blob/master/src/main/java/lux/solr/LuxDispatchFilter.java

It's definitely hackish, but seems to get the job done - for me - it's 
not a reusable component, but might serve as an illustration of one way 
to handle the problem


-Mike

On 04/07/2014 12:23 PM, Gregg Donovan wrote:

That was my first attempt, but it's much trickier than I anticipated.

A filter that calls HttpServletRequest#getParameter() before
SolrDispatchFilter will trigger an exception  -- see
getParameterIncompatibilityException [1] -- if the request is a POST. It
seems that Solr depends on the configured per-core SolrRequestParser to
properly parse the request parameters. A servlet filter that came before
SolrDispatchFilter would need to fetch the correct SolrRequestParser for
the requested core, parse the request, and reset the InputStream before
pulling the data into the MDC. It also duplicates the work of request
parsing. It's especially tricky if you want to remove the tracing
parameters from the SolrParams and just have them in the MDC to avoid them
being logged twice.


[1]
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628


On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch wrote:


On the second thought,

If you are already managing to pass the value using the request
parameters, what stops you from just having a servlet filter looking
for that parameter and assigning it directly to the MDC context?

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency


On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
 wrote:

I like the idea. No comments about implementation, leave it to others.

But if it is done, maybe somebody very familiar with logging can also
review Solr's current logging config. I suspect it is not optimized
for troubleshooting at this point.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr

proficiency


On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 

wrote:

We have some metadata -- e.g. a request UUID -- that we log to every log
line using Log4J's MDC [1]. The UUID logging allows us to connect any

log

lines we have for a given request across servers. Sort of like Zipkin

[2].

Currently we're using EmbeddedSolrServer without sharding, so adding the
UUID is fairly simple, since everything is in one process and one

thread.

But, we're testing a sharded HTTP implementation and running into some
difficulties getting this data passed around in a way that lets us trace
all log lines generated by a request to its UUID.





Re: Ranking code

2014-04-07 Thread Shawn Heisey

On 4/7/2014 10:29 AM, azhar2007 wrote:

Hi does anybody know where the ranking code is held. Which file in Solr
stores it the solr schema.xml or solrconfig.xml file?


Your question is very generic.  It needs to be more specific -- what are 
you actually trying to do?


The generic answer is "both" ... query parameters that affect relevancy 
ranking can go in solrconfig.xml or included on an individual query.  
You can change which similarity class is used in schema.xml.  The 
analysis chain and field parameters you choose can also affect relevancy 
ranking, and those live in schema.xml.


https://wiki.apache.org/solr/SchemaXml#Similarity
https://wiki.apache.org/solr/SolrRelevancyFAQ

The actual code is not in either file -- it's in the java source code 
files that get compiled into Lucene and Solr.


Thanks,
Shawn



Re: Duplicate Unique Key

2014-04-07 Thread Simon
Erick,

It's indeed quite odd.  And after I trigger re-indexing all documents (via
the normal process of existing program). The duplication is gone.  It can
not be reproduced easily.  But it did occur occasionally and that makes it a
frustrating task to troubleshoot. 

Thanks,
Simon



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651p4129701.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fetching uniqueKey and other int quickly from documentCache?

2014-04-07 Thread Gregg Donovan
Yonik,

Requesting
fl=unique_key:field(unique_key),secondary_key:field(secondary_key),score vs
fl=unique_key,secondary_key,score was a nice performance win, as unique_key
and secondary_key were both already in the fieldCache. We removed our
documentCache, in fact, as it got very such little use.

We do see a code path that fetches stored fields, though, in
BinaryResponseWriter, for the case of *only* pseudo-fields being requested.
I opened a ticket and attached a patch to
https://issues.apache.org/jira/browse/SOLR-5968.




On Mon, Mar 3, 2014 at 11:30 AM, Yonik Seeley  wrote:

> On Mon, Mar 3, 2014 at 11:14 AM, Gregg Donovan  wrote:
> > Yonik,
> >
> > That's a very clever idea. Unfortunately, I think that will skip the
> > distributed query optimization we were hoping to take advantage of in
> > SOLR-1880 [1], but it should work with the proposed distrib.singlePass
> > optimization in SOLR-5768 [2]. Does that sound right?
>
>
> Yep, the two together should do the trick.
>
> -Yonik
> http://heliosearch.org - native off-heap filters and fieldcache for solr
>
>
> > --Gregg
> >
> > [1] https://issues.apache.org/jira/browse/SOLR-1880
> > [2] https://issues.apache.org/jira/browse/SOLR-5768
> >
> >
> > On Wed, Feb 26, 2014 at 8:53 PM, Yonik Seeley 
> wrote:
> >
> >> You could try forcing things to go through function queries (via
> >> pseudo-fields):
> >>
> >> fl=field(id), field(myfield)
> >>
> >> If you're not requesting any stored fields, that *might* currently
> >> skip that step.
> >>
> >> -Yonik
> >> http://heliosearch.org - native off-heap filters and fieldcache for
> solr
> >>
> >>
> >> On Mon, Feb 24, 2014 at 9:58 PM, Gregg Donovan 
> wrote:
> >> > We fetch a large number of documents -- 1000+ -- for each search. Each
> >> > request fetches only the uniqueKey or the uniqueKey plus one secondary
> >> > integer key. Despite this, we find that we spent a sizable amount of
> time
> >> > in SolrIndexSearcher#doc(int docId, Set fields). Time is spent
> >> > fetching the two stored fields, LZ4 decoding, etc.
> >> >
> >> > I would love to be able to tell Solr to always fetch these two fields
> >> from
> >> > memory. We have them both in the fieldCache so we're already spending
> the
> >> > RAM. I've seen this asked previously [1], so it seems like a fairly
> >> common
> >> > need, especially for distributed search. Any ideas?
> >> >
> >> > A few possible ideas I had:
> >> >
> >> > --Check FieldCache.html#getCacheEntries() before going to stored
> fields.
> >> > --Give the documentCache config a list of fields it should load from
> the
> >> > fieldCache
> >> >
> >> >
> >> > Having an in-memory mapping from docId->uniqueKey has come up for us
> >> > before. We've used a custom SolrCache maintaining that mapping to
> quickly
> >> > filter over personalized collections. Maybe the uniqueKey should be
> more
> >> > optimized out of the box? Perhaps a custom "uniqueKey" codec that also
> >> > maintained the docId->uniqueKey mapping in memory?
> >> >
> >> > --Gregg
> >> >
> >> > [1] http://search-lucene.com/m/oCUKJ1heHUU1
> >>
>


Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Gregg Donovan
Michael,

Thanks! Unfortunately, as we use POSTs, that approach would trigger the
getParameterIncompatibilityException call due to the Enumeration of
getParameterNames before SolrDispatchFilter has a chance to access the
InputStream.

I opened https://issues.apache.org/jira/browse/SOLR-5969 to discuss further
and attached our current patch.


On Mon, Apr 7, 2014 at 2:02 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> I had to grapple with something like this problem when I wrote Lux's
> app-server.  I extended SolrDispatchFilter and handle parameter swizzling
> to keep everything nicey-nicey for Solr while being able to play games with
> parameters of my own.  Perhaps this will give you some ideas:
>
> https://github.com/msokolov/lux/blob/master/src/main/java/
> lux/solr/LuxDispatchFilter.java
>
> It's definitely hackish, but seems to get the job done - for me - it's not
> a reusable component, but might serve as an illustration of one way to
> handle the problem
>
> -Mike
>
>
> On 04/07/2014 12:23 PM, Gregg Donovan wrote:
>
>> That was my first attempt, but it's much trickier than I anticipated.
>>
>> A filter that calls HttpServletRequest#getParameter() before
>> SolrDispatchFilter will trigger an exception  -- see
>> getParameterIncompatibilityException [1] -- if the request is a POST. It
>> seems that Solr depends on the configured per-core SolrRequestParser to
>> properly parse the request parameters. A servlet filter that came before
>> SolrDispatchFilter would need to fetch the correct SolrRequestParser for
>> the requested core, parse the request, and reset the InputStream before
>> pulling the data into the MDC. It also duplicates the work of request
>> parsing. It's especially tricky if you want to remove the tracing
>> parameters from the SolrParams and just have them in the MDC to avoid them
>> being logged twice.
>>
>>
>> [1]
>> https://github.com/apache/lucene-solr/blob/trunk/solr/
>> core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628
>>
>>
>> On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch > >wrote:
>>
>>  On the second thought,
>>>
>>> If you are already managing to pass the value using the request
>>> parameters, what stops you from just having a servlet filter looking
>>> for that parameter and assigning it directly to the MDC context?
>>>
>>> Regards,
>>> Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>>> proficiency
>>>
>>>
>>> On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
>>>  wrote:
>>>
 I like the idea. No comments about implementation, leave it to others.

 But if it is done, maybe somebody very familiar with logging can also
 review Solr's current logging config. I suspect it is not optimized
 for troubleshooting at this point.

 Regards,
 Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr

>>> proficiency
>>>

 On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 

>>> wrote:
>>>
 We have some metadata -- e.g. a request UUID -- that we log to every log
> line using Log4J's MDC [1]. The UUID logging allows us to connect any
>
 log
>>>
 lines we have for a given request across servers. Sort of like Zipkin
>
 [2].
>>>
 Currently we're using EmbeddedSolrServer without sharding, so adding the
> UUID is fairly simple, since everything is in one process and one
>
 thread.
>>>
 But, we're testing a sharded HTTP implementation and running into some
> difficulties getting this data passed around in a way that lets us
> trace
> all log lines generated by a request to its UUID.
>
>
>


Re: Full Indexing is Causing a Java Heap Out of Memory Exception

2014-04-07 Thread Candygram For Mongo
I wanted to take a moment and say thank you for your help.  We haven't
solved the problem yet but it seems like we may be on the path.

Responses to your questions below:

1) We are using settings of 6GBs for -Xmx and -Xms on a production server
where this process is failing on about 30 million relatively small records.
 We have the need to execute the same processes on much larger data sets
(10x or more).  There seems to be a somewhat linear requirement for memory
which is not sustainable.

2) We do not use the MDSolrDIHTransformer.jar.  That jar is some legacy
code that is commented out.  We are using the following jars:
common.jar, webapp.jar, commons-pool-1.4.jar.
 The first two have our custom code in it that include filters.  The last
is from Apache.

3) We have Solr configured to switch what it uses based on the environment.
 Looking at the INFOSTREAM.txt file, it is using MMap in the environment in
question.

4) Incrementing the batchSize to 5,000 or 10,000 accelerates the OOM error
(using the 64MB heap size) and it is not able to execute the query.  See
the error below:



*java.sql.SQLException: Protocol violation: [2]*

*at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:527)*

*at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)*

*at
oracle.jdbc.driver.T4C7Ocommoncall.doOLOGOFF(T4C7Ocommoncall.java:61)*

*at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:574)*

*at
oracle.jdbc.driver.PhysicalConnection.close(PhysicalConnection.java:4011)*

*at
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:410)*

*at
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:395)*

*at
org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:284)*

*at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273)*

*at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)*

*at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)*

*at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)*



*Apr 07, 2014 11:11:54 AM org.apache.solr.common.SolrException log*

*SEVERE: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException: org.apache.solr.handler.dataimport.Data*

*at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:266)*

*at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)*

*at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)*

*at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)*

*Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemor*

*at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:406)*

*at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)*

*at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)*

*... 3 more*

*Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap space*

*at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:535)*

*at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)*



We also suspect that the copyfield may be the culprit.  We are trying the
CSV process now.





On Sat, Apr 5, 2014 at 3:16 AM, Ahmet Arslan  wrote:

> Hi,
>
> Now we have a more informative error
> : org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap space
>
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:535)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
>
> 1) Does this happen when you increase -Xmx64m -Xms64m ?
>
> 2) I see you use custom jars called "MDSolrDIHTransformer JARs inside"
>  But I don't see any Transformers used in database.xm, why is that. I would
> remove them just to be sure.
>
> 3) I see you have org.apache.solr.core.StandardDirectoryFactory declared
> in sorlconfig. Assuming you are using, 64 bit windows, it is recommended to
> use MMap
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
>
> 4) In your previous mail you had batch size set, now there is not
> batchSize defined in database.xml. For MySQL it is recommended to use -1.
> Not sure about oracle, I personally used 10,000 once for Oracle.
> http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going

Re: Duplicate Unique Key

2014-04-07 Thread Erick Erickson
Oh my yes! I feel a great sense of relief every time an intermittent
problem becomes reproducible... The problem is not solved, but at
least I have a good feeling that once I don't see it any more it's
_really_ gone!

One possibility is index merging, see:
https://wiki.apache.org/solr/MergingSolrIndexes. When you merge
indexes, there is no duplicate id checking performed, so you can well
have duplicates. That's a wild shot in the dark though.

Best,
Erick

On Mon, Apr 7, 2014 at 12:26 PM, Simon  wrote:
> Erick,
>
> It's indeed quite odd.  And after I trigger re-indexing all documents (via
> the normal process of existing program). The duplication is gone.  It can
> not be reproduced easily.  But it did occur occasionally and that makes it a
> frustrating task to troubleshoot.
>
> Thanks,
> Simon
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651p4129701.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Analysis of Japanese characters

2014-04-07 Thread T. Kuro Kurosaka

Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.

On 04/02/2014 10:33 AM, Tom Burton-West wrote:

Hi Shawn,

I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?

Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese character sets.  For
example the config given in the JavaDocs tells it to make bigrams across 3
of the different Japanese character sets.  (Is the issue related to Romaji?)

  



http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html

Tom


On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey  wrote:


My company is setting up a system for a customer from Japan.  We have an
existing system that handles primarily English.

Here's my general text analysis chain:

http://apaste.info/xa5

After talking to the customer about problems they are encountering with
search, we have determined that some of the problems are caused because
ICUTokenizer splits on *any* character set change, including changes
between different Japanase character sets.

Knowing the risk of this being an XY problem, here's my question: Can
someone help me develop a rule file for the ICU Tokenizer that will *not*
split when the character set changes from one of the japanese character
sets to another japanese character set, but still split on other character
set changes?

Thanks,
Shawn






Re: Analysis of Japanese characters

2014-04-07 Thread Shawn Heisey

On 4/7/2014 2:07 PM, T. Kuro Kurosaka wrote:

Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.


Is JapaneseAnalyzer configurable with regard to what it does with 
non-japanese text?  If it's not, it won't work for me.


We use a combination of tokenizers and filters because there are no full 
analyzers that do what we require.  My analysis chain (for our index 
that's primarily english) has evolved over the last few years into its 
current form:


http://apaste.info/xa5

For our Japanese customer, we have recently changed from 
ICUFoldingFilter to ASCIIFoldingFilter and ICUNormalizer2Filter, because 
they do not want us to fold accent marks on Japanese characters.  I do 
not understand enough about Japanese to have an opinion on this, beyond 
the general "we should normalize EVERYTHING" approach.  The data from 
this customer is not purely Japanese - there is a lot of English as 
well, and quite possibly a small amount of other languages.


Thanks,
Shawn



Re: Solr interface

2014-04-07 Thread Michael Della Bitta
The speed of ingest via HTTP improves greatly once you do two things:

1. Batch multiple documents into a single request.
2. Index with multiple threads at once.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins wrote:

> I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
> ~400M documents in total, with 4-way replication (so its quite a big
> setup!)  I had thought that HTTP would slow things down, so we recently
> trialed a JNI approach (clients are C++) so we could call SolrJ and get the
> benefits of JavaBin encoding for our indexing
>
> Once we had done benchmarks with both solutions, I think we saved about 1ms
> per document (on average) with JNI, so it wasn't as big a gain as we were
> expecting.  There are other benefits of SolrJ (zookeeper integration,
> better routing, etc) and we were doing local HTTP (so it was literally just
> a TCP port to localhost, no actual net traffic) but that just goes to prove
> what other posters have said here.  Check whether HTTP really *is* the
> bottleneck before you try to replace it!
>
>
> On 7 April 2014 17:05, Shawn Heisey  wrote:
>
> > On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
> >
> >> Do you mean to tell me that the people on this list that are indexing
> >> 100s of millions of documents are doing this over http?  I have been
> using
> >> custom Lucene code to index files, as I thought this would be faster for
> >> many documents and I wanted some non-standard OCR and index fields.  Is
> >> there a better way?
> >>
> >> To the OP: You can also use Lucene to locally index files for Solr.
> >>
> >
> > My sharded index has 94 million docs in it.  All normal indexing and
> > maintenance is done with SolrJ, over http.Currently full rebuilds are
> done
> > with the dataimport handler loading from MySQL, but that is legacy.  This
> > is NOT a SolrCloud installation.  It is also not a replicated setup -- my
> > indexing program keeps both copies up to date independently, similar to
> > what happens behind the scenes with SolrCloud.
> >
> > The single-thread DIH is very well optimized, and is faster than what I
> > have written myself -- also single-threaded.
> >
> > The real reason that we still use DIH for rebuilds is that I can run the
> > DIH simultaenously on all shards.  A full rebuild that way takes about 5
> > hours.  A SolrJ process feeding all shards with a single thread would
> take
> > a lot longer.  Once I have time to work on it, I can make the SolrJ
> rebuild
> > multi-threaded, and I expect it will be similar to DIH in rebuild speed.
> >  Hopefully I can make it faster.
> >
> > There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
> > high enough to matter.
> >
> > Using Lucene to index files for Solr is an option -- but that requires
> > writing a custom Lucene application, and knowledge about how to turn the
> > Solr schema into Lucene code.  A lot of users on this list (me included)
> do
> > not have the skills required.  I know SolrJ reasonably well, but Lucene
> is
> > a nut that I haven't cracked.
> >
> > Thanks,
> > Shawn
> >
> >
>


Re: Full Indexing is Causing a Java Heap Out of Memory Exception

2014-04-07 Thread Ahmet Arslan
Hi,

I had similar problems before. We were trying to do same thing as you, fetching 
too many small documents from Oracle with dih. We were getting 

Caused by: java.sql.SQLException: ORA-01652: unable to extend temp segment by 
128 in tablespace TS_TEMP ORA-06512: at "IZCI.GET_FEED_KEYWORDS", line 20 at 
oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:450) at 
oracle.jdbc.driver.T4CTTIoer.processError(T4CTTIoer.java:399) at 
oracle.jdbc.driver.T4C8Oall.processError(T4C8Oall.java:837) at 
oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:459) at 
oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:193) at 
oracle.jdbc.driver.T4C8Oall.doOALL(T4C8Oall.java:531) at 
oracle.jdbc.driver.T4CStatement.doOall8(T4CStatement.java:197) at 
oracle.jdbc.driver.T4CStatement.fetch(T4CStatement.java:1348) at 
oracle.jdbc.driver.OracleResultSetImpl.close_or_fetch_from_next(OracleResultSetImpl.java:635)
 at oracle.jdbc.driver.OracleResultSetImpl.next(OracleResultSetImpl.java:514) 
at 
org.apache.solr.handler.dataimport.JdbcDataSource$ResultSetIterator.hasnext(JdbcDataSource.java:334)
 ... 12 more


DB admins explained it but I don't remember. At the end we sliced our SQL 
sentence and did smaller imports with clean=false. 


Ahmet


On Monday, April 7, 2014 11:00 PM, Candygram For Mongo <> wrote:
I wanted to take a moment and say thank you for your help.  We haven't
solved the problem yet but it seems like we may be on the path.

Responses to your questions below:

1) We are using settings of 6GBs for -Xmx and -Xms on a production server
where this process is failing on about 30 million relatively small records.
We have the need to execute the same processes on much larger data sets
(10x or more).  There seems to be a somewhat linear requirement for memory
which is not sustainable.

2) We do not use the MDSolrDIHTransformer.jar.  That jar is some legacy
code that is commented out.  We are using the following jars:
common.jar, webapp.jar, commons-pool-1.4.jar.
The first two have our custom code in it that include filters.  The last
is from Apache.

3) We have Solr configured to switch what it uses based on the environment.
Looking at the INFOSTREAM.txt file, it is using MMap in the environment in
question.

4) Incrementing the batchSize to 5,000 or 10,000 accelerates the OOM error
(using the 64MB heap size) and it is not able to execute the query.  See
the error below:



*java.sql.SQLException: Protocol violation: [2]*

*        at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:527)*

*        at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)*

*        at
oracle.jdbc.driver.T4C7Ocommoncall.doOLOGOFF(T4C7Ocommoncall.java:61)*

*        at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:574)*

*        at
oracle.jdbc.driver.PhysicalConnection.close(PhysicalConnection.java:4011)*

*        at
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:410)*

*        at
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:395)*

*        at
org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:284)*

*        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273)*

*        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)*

*        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)*

*        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)*



*Apr 07, 2014 11:11:54 AM org.apache.solr.common.SolrException log*

*SEVERE: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException: org.apache.solr.handler.dataimport.Data*

*        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:266)*

*        at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)*

*        at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)*

*        at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)*

*Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemor*

*        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:406)*

*        at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)*

*        at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)*

*        ... 3 more*

*Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap space*

*        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:535)*

*        at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)*



We also suspect that the copyfield may be the culprit.  We are trying the
CSV process now.





On Sat, Apr 5, 2014 at 3:16 AM, Ahmet Arslan  wrote:

> Hi,
>
> Now we ha

Re: Searching multivalue fields.

2014-04-07 Thread Vijay Kokatnur
Yes I did restart solr, but did not re-index.  Is that necessary?  We've
got 80G of indexed data, is there a "preferred" way of doing it without
impacting performance?


On Sat, Apr 5, 2014 at 9:44 AM, Ahmet Arslan  wrote:

> Hi,
>
> Did restart solr and you re-index after schema change?
>On Saturday, April 5, 2014 2:39 AM, Vijay Kokatnur <
> kokatnur.vi...@gmail.com> wrote:
>  I had already tested with omitTermFreqAndPositions="false" .  I still
> got the same error.
>
> Is there something that I am overlooking?
>
> On Fri, Apr 4, 2014 at 2:45 PM, Ahmet Arslan  wrote:
>
> Hi Vijay,
>
> Add omitTermFreqAndPositions="false"  attribute to fieldType definitions.
>
>  omitTermFreqAndPositions="false" sortMissingLast="true" />
>
> omitTermFreqAndPositions="false" precisionStep="0"
> positionIncrementGap="0"/>
>
> You don't need termVectors  for this.
>
>1.2: omitTermFreqAndPositions attribute introduced, true by default
> except for text fields.
>
> And please reply to solr user mail, so others can use the threat later on.
>
> Ahmet
>   On Saturday, April 5, 2014 12:18 AM, Vijay Kokatnur <
> kokatnur.vi...@gmail.com> wrote:
>   Hey Ahmet,
>
> Sorry it took some time to test this.  But schema definition seem to
> conflict with SpanQuery.  I get following error when I use Spans
>
>  field "OrderLineType" was indexed without position data; cannot run
> SpanTermQuery (term=11)
>
> I changed field definition in the schema but can't find the right
> attribute to set this.  My last attempt was with following definition
>
> multiValued="true" *termVectors="true" termPositions="true"
> termOffsets="true"*/>
>
>  Any ideas what I am doing wrong?
>
> Thanks,
> -Vijay
>
> On Wed, Mar 26, 2014 at 1:54 PM, Ahmet Arslan  wrote:
>
> Hi Vijay,
>
> After reading the documentation it seems that following query is what you
> are after. It will return OrderId:345 without matching OrderId:123
>
> SpanQuery q1  = new SpanTermQuery(new Term("BookingRecordId", "234"));
> SpanQuery q2  = new SpanTermQuery(new Term("OrderLineType", "11"));
> SpanQuery q2m new FieldMaskingSpanQuery(q2, "BookingRecordId");
> Query q = new SpanNearQuery(new SpanQuery[]{q1, q2m}, -1, false);
>
> Ahmet
>
>
>
> On Wednesday, March 26, 2014 10:39 PM, Ahmet Arslan 
> wrote:
> Hi Vijay,
>
> I personally don't understand joins very well. Just a guess may
> be FieldMaskingSpanQuery could be used?
>
>
> http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html
>
>
> Ahmet
>
>
>
>
> On Wednesday, March 26, 2014 9:46 PM, Vijay Kokatnur <
> kokatnur.vi...@gmail.com> wrote:
> Hi,
>
> I am bumping this thread again one last time to see if anyone has a
> solution.
>
> In it's current state, our application is storing child items as multivalue
> fields.  Consider some orders, for example -
>
>
> {
> OrderId:123
> BookingRecordId : ["145", "987", "*234*"]
> OrderLineType : ["11", "12", "*13*"]
> .
> }
> {
> OrderId:345
> BookingRecordId : ["945", "882", "*234*"]
> OrderLineType : ["1", "12", "*11*"]
> .
> }
> {
> OrderId:678
> BookingRecordId : ["444"]
> OrderLineType : ["11"]
> .
> }
>
>
> Here, If you look up for an Order with BookingRecordId: 234 And
> OrderLineType:11.  You will get two orders with orderId : 123 and 345,
> which is correct.  You have two arrays in both the orders that satisfy this
> condition.
>
> However, for OrderId:123, the value at 3rd index of OrderLineType array is
> 13 and not 11( this is for OrderId:345).  So orderId 123 should be
> excluded. This is what I am trying to achieve.
>
> I got some suggestions from a solr-user to use FieldsCollapsing, Join,
> Block-join or string concatenation.  None of these approaches can be used
> without re-indexing schema.
>
> Has anyone found a non-invasive solution for this?
>
> Thanks,
>
> -Vijay
>
>
>
>
>
>
>
>


Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Michael Sokolov
Yes, I see.  SolrDispatchFilter is  - not really written with 
extensibility in mind.


-Mike

On 4/7/14 3:50 PM, Gregg Donovan wrote:

Michael,

Thanks! Unfortunately, as we use POSTs, that approach would trigger the
getParameterIncompatibilityException call due to the Enumeration of
getParameterNames before SolrDispatchFilter has a chance to access the
InputStream.

I opened https://issues.apache.org/jira/browse/SOLR-5969 to discuss further
and attached our current patch.


On Mon, Apr 7, 2014 at 2:02 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:


I had to grapple with something like this problem when I wrote Lux's
app-server.  I extended SolrDispatchFilter and handle parameter swizzling
to keep everything nicey-nicey for Solr while being able to play games with
parameters of my own.  Perhaps this will give you some ideas:

https://github.com/msokolov/lux/blob/master/src/main/java/
lux/solr/LuxDispatchFilter.java

It's definitely hackish, but seems to get the job done - for me - it's not
a reusable component, but might serve as an illustration of one way to
handle the problem

-Mike


On 04/07/2014 12:23 PM, Gregg Donovan wrote:


That was my first attempt, but it's much trickier than I anticipated.

A filter that calls HttpServletRequest#getParameter() before
SolrDispatchFilter will trigger an exception  -- see
getParameterIncompatibilityException [1] -- if the request is a POST. It
seems that Solr depends on the configured per-core SolrRequestParser to
properly parse the request parameters. A servlet filter that came before
SolrDispatchFilter would need to fetch the correct SolrRequestParser for
the requested core, parse the request, and reset the InputStream before
pulling the data into the MDC. It also duplicates the work of request
parsing. It's especially tricky if you want to remove the tracing
parameters from the SolrParams and just have them in the MDC to avoid them
being logged twice.


[1]
https://github.com/apache/lucene-solr/blob/trunk/solr/
core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628


On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch 
wrote:

  On the second thought,

If you are already managing to pass the value using the request
parameters, what stops you from just having a servlet filter looking
for that parameter and assigning it directly to the MDC context?

Regards,
 Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency


On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
 wrote:


I like the idea. No comments about implementation, leave it to others.

But if it is done, maybe somebody very familiar with logging can also
review Solr's current logging config. I suspect it is not optimized
for troubleshooting at this point.

Regards,
 Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr


proficiency


On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 


wrote:


We have some metadata -- e.g. a request UUID -- that we log to every log

line using Log4J's MDC [1]. The UUID logging allows us to connect any


log
lines we have for a given request across servers. Sort of like Zipkin
[2].
Currently we're using EmbeddedSolrServer without sharding, so adding the

UUID is fairly simple, since everything is in one process and one


thread.
But, we're testing a sharded HTTP implementation and running into some

difficulties getting this data passed around in a way that lets us
trace
all log lines generated by a request to its UUID.






Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Steve Davids
I have had this exact same use case and we ended up just setting a header 
value, then in a Servlet Filter we read the header value and set the MDC 
property within the filter. By reading the header value it didn’t complain 
about reading the request before making it to the SolrDispatchFilter. We used 
the Jetty web defaults to jam this functionality at the beginning of the 
servlet processing chain without having to crack open the war.

-Steve

On Apr 7, 2014, at 8:01 PM, Michael Sokolov  
wrote:

> Yes, I see.  SolrDispatchFilter is  - not really written with extensibility 
> in mind.
> 
> -Mike
> 
> On 4/7/14 3:50 PM, Gregg Donovan wrote:
>> Michael,
>> 
>> Thanks! Unfortunately, as we use POSTs, that approach would trigger the
>> getParameterIncompatibilityException call due to the Enumeration of
>> getParameterNames before SolrDispatchFilter has a chance to access the
>> InputStream.
>> 
>> I opened https://issues.apache.org/jira/browse/SOLR-5969 to discuss further
>> and attached our current patch.
>> 
>> 
>> On Mon, Apr 7, 2014 at 2:02 PM, Michael Sokolov <
>> msoko...@safaribooksonline.com> wrote:
>> 
>>> I had to grapple with something like this problem when I wrote Lux's
>>> app-server.  I extended SolrDispatchFilter and handle parameter swizzling
>>> to keep everything nicey-nicey for Solr while being able to play games with
>>> parameters of my own.  Perhaps this will give you some ideas:
>>> 
>>> https://github.com/msokolov/lux/blob/master/src/main/java/
>>> lux/solr/LuxDispatchFilter.java
>>> 
>>> It's definitely hackish, but seems to get the job done - for me - it's not
>>> a reusable component, but might serve as an illustration of one way to
>>> handle the problem
>>> 
>>> -Mike
>>> 
>>> 
>>> On 04/07/2014 12:23 PM, Gregg Donovan wrote:
>>> 
 That was my first attempt, but it's much trickier than I anticipated.
 
 A filter that calls HttpServletRequest#getParameter() before
 SolrDispatchFilter will trigger an exception  -- see
 getParameterIncompatibilityException [1] -- if the request is a POST. It
 seems that Solr depends on the configured per-core SolrRequestParser to
 properly parse the request parameters. A servlet filter that came before
 SolrDispatchFilter would need to fetch the correct SolrRequestParser for
 the requested core, parse the request, and reset the InputStream before
 pulling the data into the MDC. It also duplicates the work of request
 parsing. It's especially tricky if you want to remove the tracing
 parameters from the SolrParams and just have them in the MDC to avoid them
 being logged twice.
 
 
 [1]
 https://github.com/apache/lucene-solr/blob/trunk/solr/
 core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628
 
 
 On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch  wrote:
  On the second thought,
> If you are already managing to pass the value using the request
> parameters, what stops you from just having a servlet filter looking
> for that parameter and assigning it directly to the MDC context?
> 
> Regards,
> Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
> 
> 
> On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
>  wrote:
> 
>> I like the idea. No comments about implementation, leave it to others.
>> 
>> But if it is done, maybe somebody very familiar with logging can also
>> review Solr's current logging config. I suspect it is not optimized
>> for troubleshooting at this point.
>> 
>> Regards,
>> Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> 
> proficiency
> 
>> On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 
>> 
> wrote:
> 
>> We have some metadata -- e.g. a request UUID -- that we log to every log
>>> line using Log4J's MDC [1]. The UUID logging allows us to connect any
>>> 
>> log
>> lines we have for a given request across servers. Sort of like Zipkin
>> [2].
>> Currently we're using EmbeddedSolrServer without sharding, so adding the
>>> UUID is fairly simple, since everything is in one process and one
>>> 
>> thread.
>> But, we're testing a sharded HTTP implementation and running into some
>>> difficulties getting this data passed around in a way that lets us
>>> trace
>>> all log lines generated by a request to its UUID.
>>> 
>>> 
> 



Re: Solr interface

2014-04-07 Thread Jason Hellman
This.  And so much this.  As much this as you can muster.

On Apr 7, 2014, at 1:49 PM, Michael Della Bitta 
 wrote:

> The speed of ingest via HTTP improves greatly once you do two things:
> 
> 1. Batch multiple documents into a single request.
> 2. Index with multiple threads at once.
> 
> Michael Della Bitta
> 
> Applications Developer
> 
> o: +1 646 532 3062
> 
> appinions inc.
> 
> "The Science of Influence Marketing"
> 
> 18 East 41st Street
> 
> New York, NY 10017
> 
> t: @appinions  | g+:
> plus.google.com/appinions
> w: appinions.com 
> 
> 
> On Mon, Apr 7, 2014 at 12:40 PM, Daniel Collins wrote:
> 
>> I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
>> ~400M documents in total, with 4-way replication (so its quite a big
>> setup!)  I had thought that HTTP would slow things down, so we recently
>> trialed a JNI approach (clients are C++) so we could call SolrJ and get the
>> benefits of JavaBin encoding for our indexing
>> 
>> Once we had done benchmarks with both solutions, I think we saved about 1ms
>> per document (on average) with JNI, so it wasn't as big a gain as we were
>> expecting.  There are other benefits of SolrJ (zookeeper integration,
>> better routing, etc) and we were doing local HTTP (so it was literally just
>> a TCP port to localhost, no actual net traffic) but that just goes to prove
>> what other posters have said here.  Check whether HTTP really *is* the
>> bottleneck before you try to replace it!
>> 
>> 
>> On 7 April 2014 17:05, Shawn Heisey  wrote:
>> 
>>> On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
>>> 
 Do you mean to tell me that the people on this list that are indexing
 100s of millions of documents are doing this over http?  I have been
>> using
 custom Lucene code to index files, as I thought this would be faster for
 many documents and I wanted some non-standard OCR and index fields.  Is
 there a better way?
 
 To the OP: You can also use Lucene to locally index files for Solr.
 
>>> 
>>> My sharded index has 94 million docs in it.  All normal indexing and
>>> maintenance is done with SolrJ, over http.Currently full rebuilds are
>> done
>>> with the dataimport handler loading from MySQL, but that is legacy.  This
>>> is NOT a SolrCloud installation.  It is also not a replicated setup -- my
>>> indexing program keeps both copies up to date independently, similar to
>>> what happens behind the scenes with SolrCloud.
>>> 
>>> The single-thread DIH is very well optimized, and is faster than what I
>>> have written myself -- also single-threaded.
>>> 
>>> The real reason that we still use DIH for rebuilds is that I can run the
>>> DIH simultaenously on all shards.  A full rebuild that way takes about 5
>>> hours.  A SolrJ process feeding all shards with a single thread would
>> take
>>> a lot longer.  Once I have time to work on it, I can make the SolrJ
>> rebuild
>>> multi-threaded, and I expect it will be similar to DIH in rebuild speed.
>>> Hopefully I can make it faster.
>>> 
>>> There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
>>> high enough to matter.
>>> 
>>> Using Lucene to index files for Solr is an option -- but that requires
>>> writing a custom Lucene application, and knowledge about how to turn the
>>> Solr schema into Lucene code.  A lot of users on this list (me included)
>> do
>>> not have the skills required.  I know SolrJ reasonably well, but Lucene
>> is
>>> a nut that I haven't cracked.
>>> 
>>> Thanks,
>>> Shawn
>>> 
>>> 
>> 



Re: Commit Within and /update/extract handler

2014-04-07 Thread Jamie Johnson
Below is the log showing what I believe to be the commit

07-Apr-2014 23:40:55.846 INFO [catalina-exec-5]
org.apache.solr.update.processor.LogUpdateProcessor.finish [forums]
webapp=/solr path=/update/extract
params={uprefix=attr_&literal.source_id=e4bb4bb6-96ab-4f8f-8a2a-1cf37dc1bcce&literal.content_group=File&
literal.id=e4bb4bb6-96ab-4f8f-8a2a-1cf37dc1bcce&literal.forum_id=3&literal.content_type=application/octet-stream&wt=javabin&literal.uploaded_by=+&version=2&literal.content_type=application/octet-stream&literal.file_name=exclusions}
{add=[e4bb4bb6-96ab-4f8f-8a2a-1cf37dc1bcce (1464785652471037952)]} 0 563
07-Apr-2014 23:41:10.847 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.DirectUpdateHandler2.commit start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
07-Apr-2014 23:41:10.847 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IW][commitScheduler-10-thread-1]: commit: start
07-Apr-2014 23:41:10.848 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IW][commitScheduler-10-thread-1]: commit: enter lock
07-Apr-2014 23:41:10.848 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IW][commitScheduler-10-thread-1]: commit: now prepare
07-Apr-2014 23:41:10.848 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IW][commitScheduler-10-thread-1]: prepareCommit: flush
07-Apr-2014 23:41:10.849 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IW][commitScheduler-10-thread-1]:   index before flush _y(4.6):C1
_10(4.6):C1 _11(4.6):C1 _12(4.6):C1
07-Apr-2014 23:41:10.849 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DW][commitScheduler-10-thread-1]: commitScheduler-10-thread-1
startFullFlush
07-Apr-2014 23:41:10.849 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DW][commitScheduler-10-thread-1]: anyChanges? numDocsInRam=1 deletes=true
hasTickets:false pendingChangesInFullFlush: false
07-Apr-2014 23:41:10.850 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DWFC][commitScheduler-10-thread-1]: addFlushableState
DocumentsWriterPerThread [pendingDeletes=gen=0, segment=_14,
aborting=false, numDocsInRAM=1, deleteQueue=DWDQ: [ generation: 2 ]]
07-Apr-2014 23:41:10.852 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DWPT][commitScheduler-10-thread-1]: flush postings as segment _14 numDocs=1
07-Apr-2014 23:41:10.904 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DWPT][commitScheduler-10-thread-1]: new segment has 0 deleted docs
07-Apr-2014 23:41:10.904 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DWPT][commitScheduler-10-thread-1]: new segment has no vectors; norms; no
docValues; prox; freqs
07-Apr-2014 23:41:10.904 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DWPT][commitScheduler-10-thread-1]: flushedFiles=[_14.nvd,
_14_Lucene41_0.pos, _14_Lucene41_0.tip, _14_Lucene41_0.tim, _14.nvm,
_14.fdx, _14_Lucene41_0.doc, _14.fnm, _14.fdt]
07-Apr-2014 23:41:10.905 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DWPT][commitScheduler-10-thread-1]: flushed codec=Lucene46
07-Apr-2014 23:41:10.905 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DWPT][commitScheduler-10-thread-1]: flushed: segment=_14 ramUsed=0.122 MB
newFlushedSize(includes docstores)=0.003 MB docs/MB=322.937
07-Apr-2014 23:41:10.907 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[DW][commitScheduler-10-thread-1]: publishFlushedSegment seg-private
updates=null
07-Apr-2014 23:41:10.907 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IW][commitScheduler-10-thread-1]: publishFlushedSegment
07-Apr-2014 23:41:10.907 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[BD][commitScheduler-10-thread-1]: push deletes  1 deleted terms (unique
count=1) bytesUsed=1024 delGen=4 packetCount=1 totBytesUsed=1024
07-Apr-2014 23:41:10.907 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IW][commitScheduler-10-thread-1]: publish sets newSegment delGen=5
seg=_14(4.6):C1
07-Apr-2014 23:41:10.908 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IFD][commitScheduler-10-thread-1]: now checkpoint "_y(4.6):C1 _10(4.6):C1
_11(4.6):C1 _12(4.6):C1 _14(4.6):C1" [5 segments ; isCommit = false]
07-Apr-2014 23:41:10.908 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.LoggingInfoStream.message
[IFD][commitScheduler-10-thread-1]: 0 msec to checkpoint
07-Apr-2014 23:41:10.908 INFO [commitScheduler-10-thread-1]
org.apache.solr.update.L

Re: Regex For *|* at hl.regex.pattern

2014-04-07 Thread Jack Krupansky
The regex pattern should match the text of the fragment. IOW, exclude 
whatever delimiters are not allowed in the fragment.


The default is:

[-\w ,\n"']{20,200}

-- Jack Krupansky

-Original Message- 
From: Furkan KAMACI

Sent: Monday, April 7, 2014 10:21 AM
To: solr-user@lucene.apache.org
Subject: Regex For *|* at hl.regex.pattern

Hi;

I try that but it does not work do I miss anything:

q=portu&hl.regex.pattern=.*\*\|\*.*&hl.fragsize=120&hl.regex.slop=0.2

My aim is to check whether it includes *|* or not (that's why I've put .*
beginning and end of the regex to achieve whatever you match)

How to fix it?

Thanks;
Furkan KAMACI 



Re: Reading Solr index

2014-04-07 Thread Dmitry Kan
Thanks, François.

azhar2007: remember to set the perm gen size:

java -XX:MaxPermSize=512m -jar luke-with-deps.jar

Dmitry


On Mon, Apr 7, 2014 at 7:29 PM, François Schiettecatte <
fschietteca...@gmail.com> wrote:

> Maybe you should try a more recent release of Luke:
>
> https://github.com/DmitryKey/luke/releases
>
> François
>
> On Apr 7, 2014, at 12:27 PM, azhar2007  wrote:
>
> > Hi All,
> >
> > I have a solr index which is indexed ins Solr.4.7.0.
> >
> > Ive attempted to open the index with Luke4.0.0 and also other verisons
> with
> > no luck.
> > Gives me an error message.
> >
> > Is there a way of reading the data?
> >
> > I would like to convert the file to a readable format where i can see the
> > terms it holds from the documents etc.
> >
> > Please Help!!
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/Reading-Solr-index-tp4129662.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


Re: Bad request on update.distrib=FROMLEADER

2014-04-07 Thread Cihad Guzel
Hi,

Have all of your nodes the same configuration?


2014-04-07 12:45 GMT+03:00 Gastone Penzo :

> Hello,
> i have a problem of bad request during indexing data.
> I have for nodes with solr cloud. The architecture is this:
>
> 10.0.0.86   10.0.0.87
> NODE1  NODE 2
>  |  |
>  |  |
>  |  |
>  |  |
> NODE 3 NODE 4
> 10.0.0.88   10.0.0.89
>
> 2 shards (node1 and node 2) with 2 replicas (node 3 and node4)
>
>
> I tried to index data in node1 with DataImportHandler (Mysql) and
> fullimport
> the index were created, but only half. and i had this error
>
> bad request
>
> request:
>
> http://10.0.0.88:9002/solr/collection1/update?update.distrib=FROMLEADER&distrib.from=http://10.0.0.86:9000/solr/collection1/&wt=javabin&version=2
> at org.apache.solr.client.solrj.i
>
> mpl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:724)
>
> i think node 1 call node 2 to give it an half on index, but the parameter
> distrib.from is incomplete.Why?
> if i create index with post.jar there are no problems. is it a problem of
> Datahandler??
>
> thank you
>
>
> --
> *Gastone Penzo*
>