date:20140407

I've defined a elevator as like that:


 
   
 
 
   
 
 
   
 
 
   
 


When I send a query it gives error
of: org.apache.solr.common.SolrException: Boosting query defined twice for
query

When I check the source code it says:

map.containsKey( elev.analyzed )

What I want is that:

when a user enters a query i.e.:

rüna telecom

I want to show id1. But not when a user enters that:

telecom

I do not want to elevate it?

Thanks;
Furkan KAMACI

Re: Commit Within and /update/extract handler

You say you see the commit happen in the log, is openSearcher
specified? This sounds like you're somehow getting a commit
with openSearcher=false...

Best,
Erick

On Sun, Apr 6, 2014 at 5:37 PM, Jamie Johnson  wrote:
> I'm running solr 4.6.0 and am noticing that commitWithin doesn't seem to
> work when I am using the /update/extract request handler.  It looks like a
> commit is happening from the logs, but the documents don't become available
> for search until I do a commit manually.  Could this be some type of
> configuration issue?

Re: Solr XML Messages

See: https://tika.apache.org/1.4/formats.html

short answer "yes".

Longer answer: It would be a lot easier to reply meaningfully if you
told us what you were trying to do.

You might want to review:

http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

On Sun, Apr 6, 2014 at 11:20 PM, Александр Вандышев
 wrote:
> Tell me whether it is possible to use Solr XML Messages for indexing via
> update
> extract hendler?

Regex for hl.bs.chars

Could I define a pattern for hl.bs.chars? I mean *$* shows the start or end
of a string at my documents and I want to define it as regex to hl.bs.chars?

On the other hand I do not use currently termVectors=on, termPositions=on
and termOffsets=on at my fields. Does it cause a performance issue or
breaks the expected behavior too?

RE: Solr interface

2014-04-07 Thread Toke Eskildsen

On Mon, 2014-04-07 at 13:52 +0200, Jonathan Varsanik wrote:
> Do you mean to tell me that the people on this list that are indexing
> 100s of millions of documents are doing this over http?

Some of us do. Our net archive indexer runs a lot of Tika processes that
sends their analysed documents through http. We're building 1TB indexes
of about 3-400M documents each. The Tika-analysis is by far the heavy
part of the setup: 1 Solr instance easily keeps up with 30 Tikas on a 24
core machine (or 48, depending on how you count). This setup makes it
easy to scale up & out, basically by starting new Tika processes on
whatever machines we have available.

In other setups, where the pre-index analysis is lighter, the choice of
transport layer might matter more. As always, optimize where it it
needed.

- Toke Eskildsen, State and University Library, Denmark

Re: Routing distance with Solr?

2014-04-07 Thread david.w.smi...@gmail.com

Hi,
This is definitely not possible with Solr.  Use GraphHopper.
~ David


On Mon, Apr 7, 2014 at 5:09 AM, Matteo Tarantino  wrote:

> Hi all,
> this is my first message on this mailing list, so I hope I'm doing all
> correctly.
>
> My problem is: I have to create a search engine of dealers that are in a
> well defined routing distance from the address entered by the user. I have
> already used Solr for some previous works, but I never needed geospatial
> search, so i'm a newbie in this field.
>
> On the web I have read that Solr can calculate only the distance "as the
> crow flies" between two points, but for my purposes I need the exact
> routing distance. This is not possible with Solr, can you confirm this? (If
> so, I think I'll have to refine results with additional calculations with
> GoogleMap Api's or some OSM tools like GraphHopper)
>
>
> Thank you in advance!
> Matteo
>

Re: what is geodist default value

2014-04-07 Thread david.w.smi...@gmail.com

Hi,

I'm not sure why you are asking or maybe I'm not getting what you *really*
want to know.  You'll get the geodesic distance (i.e. the "great circle
distance", the distance on the surface of a sphere) from 0,0 (off the coast
of Africa), to each point indexed in your "location" field.

~ David

On Mon, Apr 7, 2014 at 7:06 AM, Aman Tandon  wrote:

> Hello,
>
> In my index, i am using the LatlonType, for using the geodist to calculate
> the distance, and i am using it like geodist(lat, lon, location). Can
> anybody told me what value the geodist will return if i will pass
> geodist(0, 0, location)
>
> Thanks
> Aman Tandon
>

Re: Solr interface

2014-04-07 Thread Andre Bois-Crettez


You can use Solrj : https://wiki.apache.org/solr/Solrj
Anyway, even using http the performance is good.

André

On 2014-04-07 13:52, Jonathan Varsanik wrote:

Do you mean to tell me that the people on this list that are indexing 100s of 
millions of documents are doing this over http?  I have been using custom 
Lucene code to index files, as I thought this would be faster for many 
documents and I wanted some non-standard OCR and index fields.  Is there a 
better way?

To the OP: You can also use Lucene to locally index files for Solr.



-Original Message-
From: Erik Hatcher [mailto:erik.hatc...@gmail.com]
Sent: Thursday, April 03, 2014 8:47 AM
To:solr-user@lucene.apache.org
Cc: Solr User
Subject: Re: Solr interface

Yes. But why?

DataImportHandler kinda does this (still use http to kick off an indexing job). 
 And there's EmbeddedSolrServer too.

 Erik


On Apr 3, 2014, at 8:39, Александр Вандышев  wrote:

Is it possible to index files not via HTTP interface?


--
André Bois-Crettez

Software Architect
Big Data Developer
http://www.kelkoo.com/


Kelkoo SAS
Société par Actions Simplifiée
Au capital de € 4.168.964,30
Siège social : 8, rue du Sentier 75002 Paris
425 093 069 RCS Paris

Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
message, merci de le détruire et d'en avertir l'expéditeur.

Re: Solr Search on Fields name

2014-04-07 Thread anuragwalia

Thanks Ahmat and Jack for replying.

I found a another way to solve the problem by using FilterQuery.

fq=RuleA:*+OR+RuleC:*

but due to development platform query parsing stuck some where else.

Hopefully after platform fix it will work for me.

I will get back to you if any other issue occurred.






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Search-on-Fields-name-tp4129119p4129648.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ArrayIndexOutOfBoundsException while reindexing via DIH

On 4/7/2014 3:00 AM, Ralf Matulat wrote:
we are currently facing a new problem while reindexing one of our SOLR
4.4 instances:

We are using SOLR 4.4 getting data via DIH out of a MySQL Server.
The data is constantly growing.

We have reindexed our data a lot of times without any trouble.
The problem can be reproduced.

There is another server, configured exactly the same way (via git)
which was reindex 3 days ago against the same MySQL Server without
problems.
But: That server has more RAM and more powerfull CPUs as the one
making headaches today.

The error log says:

java.lang.ArrayIndexOutOfBoundsException
at
org.apache.lucene.util.packed.Packed64SingleBlock$Packed64SingleBlock4.get(Packed64SingleBlock.java:336)
at
org.apache.lucene.util.packed.GrowableWriter.get(GrowableWriter.java:56)
at
org.apache.lucene.util.packed.AbstractPagedMutable.get(AbstractPagedMutable.java:88)

at org.apache.lucene.util.fst.NodeHash.addNew(NodeHash.java:151)
at org.apache.lucene.util.fst.NodeHash.rehash(NodeHash.java:169)
at org.apache.lucene.util.fst.NodeHash.add(NodeHash.java:133)
at
org.apache.lucene.util.fst.Builder.compileNode(Builder.java:197)
at
org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:289)

at org.apache.lucene.util.fst.Builder.add(Builder.java:394)

This looks a little bit like a problem that recently surfaced in
automated testing. That particular problem was caused by IBM's J9 Java
(based on JDK7) miscompiling a low-level lucene function.

Are you using a JVM from a vendor other than Oracle? At the moment, the
JVM recommendation is Oracle Java 7u25. When 7u60 comes out (expected
in May 2014), that will most likely be the recommended version. Are
there other differences between these two systems, like the garbage
collector being used, 32-bit vs. 64-bit, different max heap size,
running in a different servlet container, etc?

Are there any other errors, such as an OutOfMemory error?

I could be completely wrong with my guess.

Thanks,
Shawn

Re: Commit Within and /update/extract handler

What does the call look like? Are you setting opening a new searcher
or not? That should be in the log line where the commit is recorded...

FWIW,
Erick

On Sun, Apr 6, 2014 at 5:37 PM, Jamie Johnson  wrote:
> I'm running solr 4.6.0 and am noticing that commitWithin doesn't seem to
> work when I am using the /update/extract request handler.  It looks like a
> commit is happening from the logs, but the documents don't become available
> for search until I do a commit manually.  Could this be some type of
> configuration issue?

Duplicate Unique Key

2014-04-07 Thread Simon

Hi all,

I know someone has posted similar question before.  But my case is little
different as I don't have the schema set up issue mentioned in those posts
but still get duplicate records.

My unique key in schema is 




id$



Search on Solr- admin UI:   id$:1

I got two documents
{
   "id$": "1",
   "_version_": 1464225014071951400,
"_root_": 1
},
{
"id$": "1",
"_version_": 1464236728284872700,
"_root_": 1
}

I use SolrJ api to add documents.  My understanding solr uniqueKey is like a
database primary key. I am wondering how could I end up with two documents
with same uniqueKey in the index.

Thanks,
Simon




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr interface


On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:

Do you mean to tell me that the people on this list that are indexing 100s of 
millions of documents are doing this over http?  I have been using custom 
Lucene code to index files, as I thought this would be faster for many 
documents and I wanted some non-standard OCR and index fields.  Is there a 
better way?

To the OP: You can also use Lucene to locally index files for Solr.


My sharded index has 94 million docs in it.  All normal indexing and 
maintenance is done with SolrJ, over http.Currently full rebuilds are 
done with the dataimport handler loading from MySQL, but that is 
legacy.  This is NOT a SolrCloud installation.  It is also not a 
replicated setup -- my indexing program keeps both copies up to date 
independently, similar to what happens behind the scenes with SolrCloud.


The single-thread DIH is very well optimized, and is faster than what I 
have written myself -- also single-threaded.


The real reason that we still use DIH for rebuilds is that I can run the 
DIH simultaenously on all shards.  A full rebuild that way takes about 5 
hours.  A SolrJ process feeding all shards with a single thread would 
take a lot longer.  Once I have time to work on it, I can make the SolrJ 
rebuild multi-threaded, and I expect it will be similar to DIH in 
rebuild speed.  Hopefully I can make it faster.


There is always overhead with HTTP.  On a gigabit LAN, I don't think 
it's high enough to matter.


Using Lucene to index files for Solr is an option -- but that requires 
writing a custom Lucene application, and knowledge about how to turn the 
Solr schema into Lucene code.  A lot of users on this list (me included) 
do not have the skills required.  I know SolrJ reasonably well, but 
Lucene is a nut that I haven't cracked.


Thanks,
Shawn

Regex For | at hl.regex.pattern

Hi;

I try that but it does not work do I miss anything:

q=portu&hl.regex.pattern=.*\*\|\*.*&hl.fragsize=120&hl.regex.slop=0.2

My aim is to check whether it includes *|* or not (that's why I've put .*
beginning and end of the regex to achieve whatever you match)

How to fix it?

Thanks;
Furkan KAMACI

Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Gregg Donovan

That was my first attempt, but it's much trickier than I anticipated.

A filter that calls HttpServletRequest#getParameter() before
SolrDispatchFilter will trigger an exception  -- see
getParameterIncompatibilityException [1] -- if the request is a POST. It
seems that Solr depends on the configured per-core SolrRequestParser to
properly parse the request parameters. A servlet filter that came before
SolrDispatchFilter would need to fetch the correct SolrRequestParser for
the requested core, parse the request, and reset the InputStream before
pulling the data into the MDC. It also duplicates the work of request
parsing. It's especially tricky if you want to remove the tracing
parameters from the SolrParams and just have them in the MDC to avoid them
being logged twice.

[1]
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628

On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch wrote:

> On the second thought,
>
> If you are already managing to pass the value using the request
> parameters, what stops you from just having a servlet filter looking
> for that parameter and assigning it directly to the MDC context?
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
>  wrote:
> > I like the idea. No comments about implementation, leave it to others.
> >
> > But if it is done, maybe somebody very familiar with logging can also
> > review Solr's current logging config. I suspect it is not optimized
> > for troubleshooting at this point.
> >
> > Regards,
> >Alex.
> > Personal website: http://www.outerthoughts.com/
> > Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
> >
> >
> > On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 
> wrote:
> >> We have some metadata -- e.g. a request UUID -- that we log to every log
> >> line using Log4J's MDC [1]. The UUID logging allows us to connect any
> log
> >> lines we have for a given request across servers. Sort of like Zipkin
> [2].
> >>
> >> Currently we're using EmbeddedSolrServer without sharding, so adding the
> >> UUID is fairly simple, since everything is in one process and one
> thread.
> >> But, we're testing a sharded HTTP implementation and running into some
> >> difficulties getting this data passed around in a way that lets us trace
> >> all log lines generated by a request to its UUID.
> >>
>

Re: Duplicate Unique Key

Hmmm, that's odd. I just tried it (admittedly with post.jar rather
than SolrJ) and it works just fine.

what server are you using (e.g. CloudSolrServer)? And can you create a
self-contained program that illustrates the problem?

Best,
Erick

On Mon, Apr 7, 2014 at 8:50 AM, Simon  wrote:
> Hi all,
>
> I know someone has posted similar question before.  But my case is little
> different as I don't have the schema set up issue mentioned in those posts
> but still get duplicate records.
>
> My unique key in schema is
>
>  multiValued="false" required="true"/>
>
>
> id$
>
>
>
> Search on Solr- admin UI:   id$:1
>
> I got two documents
> {
>"id$": "1",
>"_version_": 1464225014071951400,
> "_root_": 1
> },
> {
> "id$": "1",
> "_version_": 1464236728284872700,
> "_root_": 1
> }
>
> I use SolrJ api to add documents.  My understanding solr uniqueKey is like a
> database primary key. I am wondering how could I end up with two documents
> with same uniqueKey in the index.
>
> Thanks,
> Simon
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Regex For | at hl.regex.pattern

One more question: does that regex works on analyzed field or raw data?


2014-04-07 19:21 GMT+03:00 Furkan KAMACI :

> Hi;
>
> I try that but it does not work do I miss anything:
>
> q=portu&hl.regex.pattern=.*\*\|\*.*&hl.fragsize=120&hl.regex.slop=0.2
>
> My aim is to check whether it includes *|* or not (that's why I've put .*
> beginning and end of the regex to achieve whatever you match)
>
> How to fix it?
>
> Thanks;
> Furkan KAMACI
>

Ranking code

2014-04-07 Thread azhar2007

Hi does anybody know where the ranking code is held. Which file in Solr
stores it the solr schema.xml or solrconfig.xml file?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Ranking-code-tp4129664.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Reading Solr index

2014-04-07 Thread François Schiettecatte

Maybe you should try a more recent release of Luke:

https://github.com/DmitryKey/luke/releases

François

On Apr 7, 2014, at 12:27 PM, azhar2007  wrote:

> Hi All,
> 
> I have a solr index which is indexed ins Solr.4.7.0.
> 
> Ive attempted to open the index with Luke4.0.0 and also other verisons with
> no luck.
> Gives me an error message.
> 
> Is there a way of reading the data?
> 
> I would like to convert the file to a readable format where i can see the
> terms it holds from the documents etc. 
> 
> Please Help!!
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Reading-Solr-index-tp4129662.html
> Sent from the Solr - User mailing list archive at Nabble.com.



signature.asc
Description: Message signed with OpenPGP using GPGMail

Reading Solr index

2014-04-07 Thread azhar2007

Hi All,

I have a solr index which is indexed ins Solr.4.7.0.

Ive attempted to open the index with Luke4.0.0 and also other verisons with
no luck.
Gives me an error message.

Is there a way of reading the data?

I would like to convert the file to a readable format where i can see the
terms it holds from the documents etc. 

Please Help!!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Reading-Solr-index-tp4129662.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr interface

2014-04-07 Thread Daniel Collins

I have to agree with Shawn.  We have a SolrCloud setup with 256 shards,
~400M documents in total, with 4-way replication (so its quite a big
setup!)  I had thought that HTTP would slow things down, so we recently
trialed a JNI approach (clients are C++) so we could call SolrJ and get the
benefits of JavaBin encoding for our indexing

Once we had done benchmarks with both solutions, I think we saved about 1ms
per document (on average) with JNI, so it wasn't as big a gain as we were
expecting.  There are other benefits of SolrJ (zookeeper integration,
better routing, etc) and we were doing local HTTP (so it was literally just
a TCP port to localhost, no actual net traffic) but that just goes to prove
what other posters have said here.  Check whether HTTP really *is* the
bottleneck before you try to replace it!


On 7 April 2014 17:05, Shawn Heisey  wrote:

> On 4/7/2014 5:52 AM, Jonathan Varsanik wrote:
>
>> Do you mean to tell me that the people on this list that are indexing
>> 100s of millions of documents are doing this over http?  I have been using
>> custom Lucene code to index files, as I thought this would be faster for
>> many documents and I wanted some non-standard OCR and index fields.  Is
>> there a better way?
>>
>> To the OP: You can also use Lucene to locally index files for Solr.
>>
>
> My sharded index has 94 million docs in it.  All normal indexing and
> maintenance is done with SolrJ, over http.Currently full rebuilds are done
> with the dataimport handler loading from MySQL, but that is legacy.  This
> is NOT a SolrCloud installation.  It is also not a replicated setup -- my
> indexing program keeps both copies up to date independently, similar to
> what happens behind the scenes with SolrCloud.
>
> The single-thread DIH is very well optimized, and is faster than what I
> have written myself -- also single-threaded.
>
> The real reason that we still use DIH for rebuilds is that I can run the
> DIH simultaenously on all shards.  A full rebuild that way takes about 5
> hours.  A SolrJ process feeding all shards with a single thread would take
> a lot longer.  Once I have time to work on it, I can make the SolrJ rebuild
> multi-threaded, and I expect it will be similar to DIH in rebuild speed.
>  Hopefully I can make it faster.
>
> There is always overhead with HTTP.  On a gigabit LAN, I don't think it's
> high enough to matter.
>
> Using Lucene to index files for Solr is an option -- but that requires
> writing a custom Lucene application, and knowledge about how to turn the
> Solr schema into Lucene code.  A lot of users on this list (me included) do
> not have the skills required.  I know SolrJ reasonably well, but Lucene is
> a nut that I haven't cracked.
>
> Thanks,
> Shawn
>
>

Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Alexandre Rafalovitch

So to rephrase:

Solr will barf at unknown parameters, so we cannot currently send them in
band.

And the out of band dies not work due to post body handling complexity.

You are proposing effectively a dynamic set with common prefix to stop the
complaints. Plus the code to propagate those params.

Is that a good general description? I am just wondering if this can be
matched to some other real issues as well.

Regards,
 Alex
On 07/04/2014 11:23 pm, "Gregg Donovan"  wrote:

> That was my first attempt, but it's much trickier than I anticipated.
>
> A filter that calls HttpServletRequest#getParameter() before
> SolrDispatchFilter will trigger an exception  -- see
> getParameterIncompatibilityException [1] -- if the request is a POST. It
> seems that Solr depends on the configured per-core SolrRequestParser to
> properly parse the request parameters. A servlet filter that came before
> SolrDispatchFilter would need to fetch the correct SolrRequestParser for
> the requested core, parse the request, and reset the InputStream before
> pulling the data into the MDC. It also duplicates the work of request
> parsing. It's especially tricky if you want to remove the tracing
> parameters from the SolrParams and just have them in the MDC to avoid them
> being logged twice.
>
>
> [1]
>
> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628
>
>
> On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch  >wrote:
>
> > On the second thought,
> >
> > If you are already managing to pass the value using the request
> > parameters, what stops you from just having a servlet filter looking
> > for that parameter and assigning it directly to the MDC context?
> >
> > Regards,
> >Alex.
> > Personal website: http://www.outerthoughts.com/
> > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > proficiency
> >
> >
> > On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
> >  wrote:
> > > I like the idea. No comments about implementation, leave it to others.
> > >
> > > But if it is done, maybe somebody very familiar with logging can also
> > > review Solr's current logging config. I suspect it is not optimized
> > > for troubleshooting at this point.
> > >
> > > Regards,
> > >Alex.
> > > Personal website: http://www.outerthoughts.com/
> > > Current project: http://www.solr-start.com/ - Accelerating your Solr
> > proficiency
> > >
> > >
> > > On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 
> > wrote:
> > >> We have some metadata -- e.g. a request UUID -- that we log to every
> log
> > >> line using Log4J's MDC [1]. The UUID logging allows us to connect any
> > log
> > >> lines we have for a given request across servers. Sort of like Zipkin
> > [2].
> > >>
> > >> Currently we're using EmbeddedSolrServer without sharding, so adding
> the
> > >> UUID is fairly simple, since everything is in one process and one
> > thread.
> > >> But, we're testing a sharded HTTP implementation and running into some
> > >> difficulties getting this data passed around in a way that lets us
> trace
> > >> all log lines generated by a request to its UUID.
> > >>
> >
>

Re: How do I add another unrelated query results to solr index

2014-04-07 Thread sanjay92

I think it was not just rootEntity="true".

We need to add transformer="TemplateTransformer"  and make sure that each
entity has some kind of Unique column across all entities e.g. in this case 



is a made up column and this doc_id values should be unique across all
entities. template clause is like transformation e.g. doc_id values are made
up by prefixing salg_ and values of ${salgrade.GRADE} in the first entity
section while in the second entity section, it is using different prefix and
different variable to make it Unique.

schema.xml have   doc_id
and also add following :
   
   
   
   
   
   
   


   

   




  
  




  

  





   
   



  




--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-do-I-add-another-unrelated-query-results-to-solr-index-tp4128932p4129678.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Michael Sokolov

I had to grapple with something like this problem when I wrote Lux's
app-server. I extended SolrDispatchFilter and handle parameter
swizzling to keep everything nicey-nicey for Solr while being able to
play games with parameters of my own. Perhaps this will give you some
ideas:

https://github.com/msokolov/lux/blob/master/src/main/java/lux/solr/LuxDispatchFilter.java

It's definitely hackish, but seems to get the job done - for me - it's
not a reusable component, but might serve as an illustration of one way
to handle the problem

-Mike

On 04/07/2014 12:23 PM, Gregg Donovan wrote:

That was my first attempt, but it's much trickier than I anticipated.

A filter that calls HttpServletRequest#getParameter() before
SolrDispatchFilter will trigger an exception -- see
getParameterIncompatibilityException [1] -- if the request is a POST. It
seems that Solr depends on the configured per-core SolrRequestParser to
properly parse the request parameters. A servlet filter that came before
SolrDispatchFilter would need to fetch the correct SolrRequestParser for
the requested core, parse the request, and reset the InputStream before
pulling the data into the MDC. It also duplicates the work of request
parsing. It's especially tricky if you want to remove the tracing
parameters from the SolrParams and just have them in the MDC to avoid them
being logged twice.

[1]
https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628

On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch wrote:

On the second thought,

If you are already managing to pass the value using the request
parameters, what stops you from just having a servlet filter looking
for that parameter and assigning it directly to the MDC context?

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency

On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
wrote:

I like the idea. No comments about implementation, leave it to others.

But if it is done, maybe somebody very familiar with logging can also
review Solr's current logging config. I suspect it is not optimized
for troubleshooting at this point.

Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr

proficiency

On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan

wrote:

We have some metadata -- e.g. a request UUID -- that we log to every log
line using Log4J's MDC [1]. The UUID logging allows us to connect any

log

lines we have for a given request across servers. Sort of like Zipkin

[2].

Currently we're using EmbeddedSolrServer without sharding, so adding the
UUID is fairly simple, since everything is in one process and one

thread.

But, we're testing a sharded HTTP implementation and running into some
difficulties getting this data passed around in a way that lets us trace
all log lines generated by a request to its UUID.

Re: Ranking code


On 4/7/2014 10:29 AM, azhar2007 wrote:

Hi does anybody know where the ranking code is held. Which file in Solr
stores it the solr schema.xml or solrconfig.xml file?


Your question is very generic.  It needs to be more specific -- what are 
you actually trying to do?


The generic answer is "both" ... query parameters that affect relevancy 
ranking can go in solrconfig.xml or included on an individual query.  
You can change which similarity class is used in schema.xml.  The 
analysis chain and field parameters you choose can also affect relevancy 
ranking, and those live in schema.xml.


https://wiki.apache.org/solr/SchemaXml#Similarity
https://wiki.apache.org/solr/SolrRelevancyFAQ

The actual code is not in either file -- it's in the java source code 
files that get compiled into Lucene and Solr.


Thanks,
Shawn

Re: Duplicate Unique Key

2014-04-07 Thread Simon

Erick,

It's indeed quite odd.  And after I trigger re-indexing all documents (via
the normal process of existing program). The duplication is gone.  It can
not be reproduced easily.  But it did occur occasionally and that makes it a
frustrating task to troubleshoot. 

Thanks,
Simon



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651p4129701.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fetching uniqueKey and other int quickly from documentCache?

2014-04-07 Thread Gregg Donovan

Yonik,

Requesting
fl=unique_key:field(unique_key),secondary_key:field(secondary_key),score vs
fl=unique_key,secondary_key,score was a nice performance win, as unique_key
and secondary_key were both already in the fieldCache. We removed our
documentCache, in fact, as it got very such little use.

We do see a code path that fetches stored fields, though, in
BinaryResponseWriter, for the case of *only* pseudo-fields being requested.
I opened a ticket and attached a patch to
https://issues.apache.org/jira/browse/SOLR-5968.




On Mon, Mar 3, 2014 at 11:30 AM, Yonik Seeley  wrote:

> On Mon, Mar 3, 2014 at 11:14 AM, Gregg Donovan  wrote:
> > Yonik,
> >
> > That's a very clever idea. Unfortunately, I think that will skip the
> > distributed query optimization we were hoping to take advantage of in
> > SOLR-1880 [1], but it should work with the proposed distrib.singlePass
> > optimization in SOLR-5768 [2]. Does that sound right?
>
>
> Yep, the two together should do the trick.
>
> -Yonik
> http://heliosearch.org - native off-heap filters and fieldcache for solr
>
>
> > --Gregg
> >
> > [1] https://issues.apache.org/jira/browse/SOLR-1880
> > [2] https://issues.apache.org/jira/browse/SOLR-5768
> >
> >
> > On Wed, Feb 26, 2014 at 8:53 PM, Yonik Seeley 
> wrote:
> >
> >> You could try forcing things to go through function queries (via
> >> pseudo-fields):
> >>
> >> fl=field(id), field(myfield)
> >>
> >> If you're not requesting any stored fields, that *might* currently
> >> skip that step.
> >>
> >> -Yonik
> >> http://heliosearch.org - native off-heap filters and fieldcache for
> solr
> >>
> >>
> >> On Mon, Feb 24, 2014 at 9:58 PM, Gregg Donovan 
> wrote:
> >> > We fetch a large number of documents -- 1000+ -- for each search. Each
> >> > request fetches only the uniqueKey or the uniqueKey plus one secondary
> >> > integer key. Despite this, we find that we spent a sizable amount of
> time
> >> > in SolrIndexSearcher#doc(int docId, Set fields). Time is spent
> >> > fetching the two stored fields, LZ4 decoding, etc.
> >> >
> >> > I would love to be able to tell Solr to always fetch these two fields
> >> from
> >> > memory. We have them both in the fieldCache so we're already spending
> the
> >> > RAM. I've seen this asked previously [1], so it seems like a fairly
> >> common
> >> > need, especially for distributed search. Any ideas?
> >> >
> >> > A few possible ideas I had:
> >> >
> >> > --Check FieldCache.html#getCacheEntries() before going to stored
> fields.
> >> > --Give the documentCache config a list of fields it should load from
> the
> >> > fieldCache
> >> >
> >> >
> >> > Having an in-memory mapping from docId->uniqueKey has come up for us
> >> > before. We've used a custom SolrCache maintaining that mapping to
> quickly
> >> > filter over personalized collections. Maybe the uniqueKey should be
> more
> >> > optimized out of the box? Perhaps a custom "uniqueKey" codec that also
> >> > maintained the docId->uniqueKey mapping in memory?
> >> >
> >> > --Gregg
> >> >
> >> > [1] http://search-lucene.com/m/oCUKJ1heHUU1
> >>
>

Re: Distributed tracing for Solr via adding HTTP headers?

2014-04-07 Thread Gregg Donovan

Michael,

Thanks! Unfortunately, as we use POSTs, that approach would trigger the
getParameterIncompatibilityException call due to the Enumeration of
getParameterNames before SolrDispatchFilter has a chance to access the
InputStream.

I opened https://issues.apache.org/jira/browse/SOLR-5969 to discuss further
and attached our current patch.


On Mon, Apr 7, 2014 at 2:02 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> I had to grapple with something like this problem when I wrote Lux's
> app-server.  I extended SolrDispatchFilter and handle parameter swizzling
> to keep everything nicey-nicey for Solr while being able to play games with
> parameters of my own.  Perhaps this will give you some ideas:
>
> https://github.com/msokolov/lux/blob/master/src/main/java/
> lux/solr/LuxDispatchFilter.java
>
> It's definitely hackish, but seems to get the job done - for me - it's not
> a reusable component, but might serve as an illustration of one way to
> handle the problem
>
> -Mike
>
>
> On 04/07/2014 12:23 PM, Gregg Donovan wrote:
>
>> That was my first attempt, but it's much trickier than I anticipated.
>>
>> A filter that calls HttpServletRequest#getParameter() before
>> SolrDispatchFilter will trigger an exception  -- see
>> getParameterIncompatibilityException [1] -- if the request is a POST. It
>> seems that Solr depends on the configured per-core SolrRequestParser to
>> properly parse the request parameters. A servlet filter that came before
>> SolrDispatchFilter would need to fetch the correct SolrRequestParser for
>> the requested core, parse the request, and reset the InputStream before
>> pulling the data into the MDC. It also duplicates the work of request
>> parsing. It's especially tricky if you want to remove the tracing
>> parameters from the SolrParams and just have them in the MDC to avoid them
>> being logged twice.
>>
>>
>> [1]
>> https://github.com/apache/lucene-solr/blob/trunk/solr/
>> core/src/java/org/apache/solr/servlet/SolrRequestParsers.java#L621:L628
>>
>>
>> On Sun, Apr 6, 2014 at 2:20 PM, Alexandre Rafalovitch > >wrote:
>>
>>  On the second thought,
>>>
>>> If you are already managing to pass the value using the request
>>> parameters, what stops you from just having a servlet filter looking
>>> for that parameter and assigning it directly to the MDC context?
>>>
>>> Regards,
>>> Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>>> proficiency
>>>
>>>
>>> On Sat, Apr 5, 2014 at 7:45 AM, Alexandre Rafalovitch
>>>  wrote:
>>>
 I like the idea. No comments about implementation, leave it to others.

 But if it is done, maybe somebody very familiar with logging can also
 review Solr's current logging config. I suspect it is not optimized
 for troubleshooting at this point.

 Regards,
 Alex.
 Personal website: http://www.outerthoughts.com/
 Current project: http://www.solr-start.com/ - Accelerating your Solr

>>> proficiency
>>>

 On Sat, Apr 5, 2014 at 3:16 AM, Gregg Donovan 

>>> wrote:
>>>
 We have some metadata -- e.g. a request UUID -- that we log to every log
> line using Log4J's MDC [1]. The UUID logging allows us to connect any
>
 log
>>>
 lines we have for a given request across servers. Sort of like Zipkin
>
 [2].
>>>
 Currently we're using EmbeddedSolrServer without sharding, so adding the
> UUID is fairly simple, since everything is in one process and one
>
 thread.
>>>
 But, we're testing a sharded HTTP implementation and running into some
> difficulties getting this data passed around in a way that lets us
> trace
> all log lines generated by a request to its UUID.
>
>
>

Re: Full Indexing is Causing a Java Heap Out of Memory Exception

2014-04-07 Thread Candygram For Mongo

I wanted to take a moment and say thank you for your help.  We haven't
solved the problem yet but it seems like we may be on the path.

Responses to your questions below:

1) We are using settings of 6GBs for -Xmx and -Xms on a production server
where this process is failing on about 30 million relatively small records.
 We have the need to execute the same processes on much larger data sets
(10x or more).  There seems to be a somewhat linear requirement for memory
which is not sustainable.

2) We do not use the MDSolrDIHTransformer.jar.  That jar is some legacy
code that is commented out.  We are using the following jars:
common.jar, webapp.jar, commons-pool-1.4.jar.
 The first two have our custom code in it that include filters.  The last
is from Apache.

3) We have Solr configured to switch what it uses based on the environment.
 Looking at the INFOSTREAM.txt file, it is using MMap in the environment in
question.

4) Incrementing the batchSize to 5,000 or 10,000 accelerates the OOM error
(using the 64MB heap size) and it is not able to execute the query.  See
the error below:



*java.sql.SQLException: Protocol violation: [2]*

*at oracle.jdbc.driver.T4CTTIfun.receive(T4CTTIfun.java:527)*

*at oracle.jdbc.driver.T4CTTIfun.doRPC(T4CTTIfun.java:227)*

*at
oracle.jdbc.driver.T4C7Ocommoncall.doOLOGOFF(T4C7Ocommoncall.java:61)*

*at oracle.jdbc.driver.T4CConnection.logoff(T4CConnection.java:574)*

*at
oracle.jdbc.driver.PhysicalConnection.close(PhysicalConnection.java:4011)*

*at
org.apache.solr.handler.dataimport.JdbcDataSource.closeConnection(JdbcDataSource.java:410)*

*at
org.apache.solr.handler.dataimport.JdbcDataSource.close(JdbcDataSource.java:395)*

*at
org.apache.solr.handler.dataimport.DocBuilder.closeEntityProcessorWrappers(DocBuilder.java:284)*

*at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:273)*

*at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)*

*at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)*

*at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)*



*Apr 07, 2014 11:11:54 AM org.apache.solr.common.SolrException log*

*SEVERE: Full Import failed:java.lang.RuntimeException:
java.lang.RuntimeException: org.apache.solr.handler.dataimport.Data*

*at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:266)*

*at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:422)*

*at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:487)*

*at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:468)*

*Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemor*

*at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:406)*

*at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:319)*

*at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:227)*

*... 3 more*

*Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.OutOfMemoryError: Java heap space*

*at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:535)*

*at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)*



We also suspect that the copyfield may be the culprit.  We are trying the
CSV process now.





On Sat, Apr 5, 2014 at 3:16 AM, Ahmet Arslan  wrote:

> Hi,
>
> Now we have a more informative error
> : org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap space
>
> Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
> java.lang.OutOfMemoryError: Java heap space
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:535)
> at
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:404)
>
> 1) Does this happen when you increase -Xmx64m -Xms64m ?
>
> 2) I see you use custom jars called "MDSolrDIHTransformer JARs inside"
>  But I don't see any Transformers used in database.xm, why is that. I would
> remove them just to be sure.
>
> 3) I see you have org.apache.solr.core.StandardDirectoryFactory declared
> in sorlconfig. Assuming you are using, 64 bit windows, it is recommended to
> use MMap
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
>
> 4) In your previous mail you had batch size set, now there is not
> batchSize defined in database.xml. For MySQL it is recommended to use -1.
> Not sure about oracle, I personally used 10,000 once for Oracle.
> http://wiki.apache.org/solr/DataImportHandlerFaq#I.27m_using_DataImportHandler_with_a_MySQL_database._My_table_is_huge_and_DataImportHandler_is_going

Re: Duplicate Unique Key

Oh my yes! I feel a great sense of relief every time an intermittent
problem becomes reproducible... The problem is not solved, but at
least I have a good feeling that once I don't see it any more it's
_really_ gone!

One possibility is index merging, see:
https://wiki.apache.org/solr/MergingSolrIndexes. When you merge
indexes, there is no duplicate id checking performed, so you can well
have duplicates. That's a wild shot in the dark though.

Best,
Erick

On Mon, Apr 7, 2014 at 12:26 PM, Simon  wrote:
> Erick,
>
> It's indeed quite odd.  And after I trigger re-indexing all documents (via
> the normal process of existing program). The duplication is gone.  It can
> not be reproduced easily.  But it did occur occasionally and that makes it a
> frustrating task to troubleshoot.
>
> Thanks,
> Simon
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Duplicate-Unique-Key-tp4129651p4129701.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Analysis of Japanese characters

2014-04-07 Thread T. Kuro Kurosaka


Tom,
You should be using JapaneseAnalyzer (kuromoji).
Neither CJK nor ICU tokenize at word boundaries.

On 04/02/2014 10:33 AM, Tom Burton-West wrote:

Hi Shawn,

I'm not sure I understand the problem and why you need to solve it at the
ICUTokenizer level rather than the CJKBigramFilter
Can you perhaps give a few examples of the problem?

Have you looked at the flags for the CJKBigramfilter?
You can tell it to make bigrams of different Japanese character sets.  For
example the config given in the JavaDocs tells it to make bigrams across 3
of the different Japanese character sets.  (Is the issue related to Romaji?)

  



http://lucene.apache.org/core/4_7_1/analyzers-common/org/apache/lucene/analysis/cjk/CJKBigramFilterFactory.html

Tom


On Wed, Apr 2, 2014 at 1:19 PM, Shawn Heisey  wrote:


My company is setting up a system for a customer from Japan.  We have an
existing system that handles primarily English.

Here's my general text analysis chain:

http://apaste.info/xa5

After talking to the customer about problems they are encountering with
search, we have determined that some of the problems are caused because
ICUTokenizer splits on *any* character set change, including changes
between different Japanase character sets.

Knowing the risk of this being an XY problem, here's my question: Can
someone help me develop a rule file for the ICU Tokenizer that will *not*
split when the character set changes from one of the japanese character
sets to another japanese character set, but still split on other character
set changes?

Thanks,
Shawn

Re: Analysis of Japanese characters