Re: indexing unique keys

2014-09-05 Thread Mikhail Khludnev
Hello,

You are asking without giving a context. What's the size of sets, desired
TPS, key length, and even values?
It's hard to answer definitely. It's not primary usage for Lucene, it adds
some unnecessary overhead. However, community collected a few workaround
for such kind of problem. From the other side, as far as I know executing
queries like WHERE x IN (1,,2324) is not a piece of cake for SQL
servers, also.

you can follow link at
https://plus.google.com/u/0/+MichaelMcCandless/posts/8VNydNi3wvK to find a
relevant benchmark. it might help you to get least estimates for the Lucene
solution.



On Thu, Sep 4, 2014 at 5:53 PM, Mark , N  wrote:

> I have a use-case where we want to store unique keys ( Hashes)  which would
> be
> used to compare against another set of  keys ( Hashes)
>
> For example
>
>  Index  set= { h1, h2 , h3 , h4 }
>
> comparision set = { h1 , h2 }
>
> result set = h1,h2
>
> Would it be an advantage to store "index set" in  Solr instead of storing
> in traditional databases?
>
> Thanks in advance
>
>
>
>
>
>
> *Nipen Mark *
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan
HI Shawn,

Thanks for your reply.

The memory setting of my Solr box is

12G physically memory.
4G for java (-Xmx4096m)
The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0.

I do think the RAM size of java is one of the reasons for this slowness. I'm 
doing one big commit and when the ingestion process finished 50%, I can see the 
solr server already used over 90% of full memory.

I'll try to assign more RAM to Solr Java. But from your experience, does 4G 
sounds like a good number for Java heap size for my scenario? Is there any way 
to reduce memory usage during index time? (One thing I know is do a few commits 
instead of one commit. )  My concern is providing I have 12 G in total, If I 
assign too much to Solr server, I may not have enough for the OS to cache Solr 
index file.

I had a look to solr config file, but couldn't find anything that obviously 
wrong, Just wondering which part of that config file would impact the index 
time?

Thanks,
Ryan





One possible source of problems with that particular upgrade is the fact
that stored field compression was added in 4.1, and termvector
compression was added in 4.2.  They are on by default and cannot be
turned off.  The compression is typically fast, but with very large
documents like yours, it might result in pretty major computational
overhead.  It can also require additional java heap, which ties into
what follows:

Another problem might be RAM-related.

If your java heap is very large, or just a little bit too small, there
can be major performance issues from garbage collection.  Based on the
fact that the earlier version performed well, a too-small heap is more
likely than a very large heap.

If your index size is such that it can't be effectively cached by the
amount of total RAM on the machine (minus the java heap assigned to
Solr), that can cause performance problems.  Your index size is likely
to be several gigabytes, and might even reach double-digit gigabytes.
Can you relate those numbers -- index size, java heap size, and total
system RAM?  If you can, it would also be a good idea to share your
solrconfig.xml.

Here's a wiki page that goes into more detail about possible performance
issues.  It doesn't mention the possible compression problem:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn


RE: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan
Hi Erick,

As Ryan Ernst noticed, those big fields (eg majorTextSignalStem)  is not 
stored. There are a few stored fields in my schema, but they are very small 
fields basically name or id for that document.  I tried turn them off(only 
store id filed) and that didn't make any difference.

Thanks,
Ryan

Ryan:

As it happens, there's a discssion on the dev list about this.

If at all possible, could you try a brief experiment? Turn off
all the storage, i.e. set stored="false" on all fields. It's a lot
to ask, but it'd help the discussion.

Or join the discussion at https://issues.apache.org/jira/browse/LUCENE-5914.

Best,
Erick


From: Li, Ryan
Sent: Friday, September 05, 2014 3:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr add document over 20 times slower after upgrade from 4.0 to 
4.9


HI Shawn,

Thanks for your reply.

The memory setting of my Solr box is

12G physically memory.
4G for java (-Xmx4096m)
The index size is around 4G in Solr 4.9, I think it was over 6G in Solr 4.0.

I do think the RAM size of java is one of the reasons for this slowness. I'm 
doing one big commit and when the ingestion process finished 50%, I can see the 
solr server already used over 90% of full memory.

I'll try to assign more RAM to Solr Java. But from your experience, does 4G 
sounds like a good number for Java heap size for my scenario? Is there any way 
to reduce memory usage during index time? (One thing I know is do a few commits 
instead of one commit. )  My concern is providing I have 12 G in total, If I 
assign too much to Solr server, I may not have enough for the OS to cache Solr 
index file.

I had a look to solr config file, but couldn't find anything that obviously 
wrong, Just wondering which part of that config file would impact the index 
time?

Thanks,
Ryan





One possible source of problems with that particular upgrade is the fact
that stored field compression was added in 4.1, and termvector
compression was added in 4.2.  They are on by default and cannot be
turned off.  The compression is typically fast, but with very large
documents like yours, it might result in pretty major computational
overhead.  It can also require additional java heap, which ties into
what follows:

Another problem might be RAM-related.

If your java heap is very large, or just a little bit too small, there
can be major performance issues from garbage collection.  Based on the
fact that the earlier version performed well, a too-small heap is more
likely than a very large heap.

If your index size is such that it can't be effectively cached by the
amount of total RAM on the machine (minus the java heap assigned to
Solr), that can cause performance problems.  Your index size is likely
to be several gigabytes, and might even reach double-digit gigabytes.
Can you relate those numbers -- index size, java heap size, and total
system RAM?  If you can, it would also be a good idea to share your
solrconfig.xml.

Here's a wiki page that goes into more detail about possible performance
issues.  It doesn't mention the possible compression problem:

http://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn


Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Li, Ryan
Hi Guys,

Just some update.

I've tried with Solr 4.10 (same code for Solr 4.9). And that has the same index 
speed as 4.0. The only problem left now is that Solr 4.10 takes more memory 
than 4.0 so I'm trying to figure out what is the best number for Java heap size.

I think that proves there is some performance issue with Solr 4.9 when index 
big document (even just over 1mb).

Thanks,
Ryan


FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Hello all,
  as the migration from FAST to Solr is a relevant topic for several of
our customers, there is one issue that does not seem to be addressed by
Lucene/Solr: document vectors FAST-style. These document vectors are
used to form metrics of similarity, i.e., they may be used as a
"semantic fingerprint" of documents to define similarity relations. I
can think of several ways of approximating a mapping of this mechanism
to Solr, but there are always drawbacks - mostly performance-wise.

Has anybody else encountered and possibly approached this challenge so far?

Is there anything in the roadmap of Solr that has not revealed itself to
me, addressing this issue?

Your input is greatly appreciated!

Cheers,
--Jürgen



SolrJ 4.10.0 errors

2014-09-05 Thread Guido Medina

Hi,

I have upgraded to from Solr 4.9 to 4.10 and the server side seems fine 
but the client is reporting the following exception:


org.apache.solr.client.solrj.SolrServerException: IOException occured 
when talking to server at: solr_host.somedomain
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)

at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)
at ... (company's related packages)
Caused by: org.apache.http.NoHttpResponseException: solr_host.somedomain 
failed to respond
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:161)
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:153)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254)
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
at 
org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)

... 9 more

To test I downgraded the client to 4.9 and the error is gone.

Best regards,

Guido.


Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread jim ferenczi
Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" :

> Hello all,
>   as the migration from FAST to Solr is a relevant topic for several of
> our customers, there is one issue that does not seem to be addressed by
> Lucene/Solr: document vectors FAST-style. These document vectors are
> used to form metrics of similarity, i.e., they may be used as a
> "semantic fingerprint" of documents to define similarity relations. I
> can think of several ways of approximating a mapping of this mechanism
> to Solr, but there are always drawbacks - mostly performance-wise.
>
> Has anybody else encountered and possibly approached this challenge so far?
>
> Is there anything in the roadmap of Solr that has not revealed itself to
> me, addressing this issue?
>
> Your input is greatly appreciated!
>
> Cheers,
> --Jürgen
>
>


Re: SolrJ 4.10.0 errors

2014-09-05 Thread Guido Medina
Sorry I didn't give enough information so I'm adding to it, the SolrJ 
client is on our webapp and the documents are getting indexed properly 
into Solr, the only problem we are seeing is that with SolrJ 4.10 once 
Solr server response comes back it seems like SolrJ client doesn't know 
what to with such response and reports the exception I mentioned, I then 
downgraded the SolrJ client to 4.9 and the exception is now gone, I'm 
using the following relevant libraries:


Java 7u67 64 bits at both webapp client side side and Jetty's
HTTP client/mine 4.3.5
HTTP core 4.3.2

Here is a list of my Solr war modified lib folder, I usually don't stay 
with the standard jars because I believe most of them are out of date if 
you are running a JDK 7u55+:


   antlr-runtime-3.5.jar
   asm-4.2.jar
   asm-commons-4.2.jar
   commons-cli-1.2.jar
   commons-codec-1.9.jar
   commons-configuration-1.9.jar
   commons-fileupload-1.3.1.jar
   commons-io-2.4.jar
   commons-lang-2.6.jar
   concurrentlinkedhashmap-lru-1.4.jar
   dom4j-1.6.1.jar
   guava-18.0.jar
   hadoop-annotations-2.2.0.jar
   hadoop-auth-2.2.0.jar
   hadoop-common-2.2.0.jar
   hadoop-hdfs-2.2.0.jar
   hppc-0.5.2.jar
   httpclient-4.3.5.jar
   httpcore-4.3.2.jar
   httpmime-4.3.5.jar
   joda-time-2.2.jar
   lucene-analyzers-common-4.10.0.jar
   lucene-analyzers-kuromoji-4.10.0.jar
   lucene-analyzers-phonetic-4.10.0.jar
   lucene-codecs-4.10.0.jar
   lucene-core-4.10.0.jar
   lucene-expressions-4.10.0.jar
   lucene-grouping-4.10.0.jar
   lucene-highlighter-4.10.0.jar
   lucene-join-4.10.0.jar
   lucene-memory-4.10.0.jar
   lucene-misc-4.10.0.jar
   lucene-queries-4.10.0.jar
   lucene-queryparser-4.10.0.jar
   lucene-spatial-4.10.0.jar
   lucene-suggest-4.10.0.jar
   noggit-0.5.jar
   org.restlet-2.1.1.jar
   org.restlet.ext.servlet-2.1.1.jar
   protobuf-java-2.6.0.jar
   solr-core-4.10.0.jar
   solr-solrj-4.10.0.jar
   spatial4j-0.4.1.jar
   wstx-asl-3.2.7.jar
   zookeeper-3.4.6.jar

Best regards,

Guido.

On 05/09/14 09:42, Guido Medina wrote:

Hi,

I have upgraded to from Solr 4.9 to 4.10 and the server side seems 
fine but the client is reporting the following exception:


org.apache.solr.client.solrj.SolrServerException: IOException occured 
when talking to server at: solr_host.somedomain
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:562)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:210)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:206)
at 
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:124)
at 
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:68)
at 
org.apache.solr.client.solrj.SolrServer.add(SolrServer.java:54)

at ... (company's related packages)
Caused by: org.apache.http.NoHttpResponseException: 
solr_host.somedomain failed to respond
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:260)
at 
org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:161)
at 
org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:153)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:271)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:254)
at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:195)
at 
org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)
at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)
at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.executeMethod(HttpSolrServer.java:448)

... 9 more

To test I downgraded the client to 4.9 and the error is gone.

Best regards,

Guido.




Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am
presently mapping docvectors to these mechanisms and create term vectors
myself from third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the
performance of MoreLikeThis queries based on TermVectors is suboptimal
on large document sets, so a more efficient support of such retrievals
in the Lucene kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:
> Hi,
> Something like ?:
> https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
> And just to show some impressive search functionality of the wiki: ;)
> https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors
>
> Cheers,
> Jim
>
>
> 2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" > :
>> Hello all,
>>   as the migration from FAST to Solr is a relevant topic for several of
>> our customers, there is one issue that does not seem to be addressed by
>> Lucene/Solr: document vectors FAST-style. These document vectors are
>> used to form metrics of similarity, i.e., they may be used as a
>> "semantic fingerprint" of documents to define similarity relations. I
>> can think of several ways of approximating a mapping of this mechanism
>> to Solr, but there are always drawbacks - mostly performance-wise.
>>
>> Has anybody else encountered and possibly approached this challenge so far?
>>
>> Is there anything in the roadmap of Solr that has not revealed itself to
>> me, addressing this issue?
>>
>> Your input is greatly appreciated!
>>
>> Cheers,
>> --Jürgen
>>
>>


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
, URL: www.devoteam.de



Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071




Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Alexandre Rafalovitch
Why do one big commit? You could do hard commits along the way but keep
searcher open and not see the changes until the end.

Obviously a separate issue from memory consumption discussion, but thought
I'll add it anyway.

Regards,
 Alex
On 05/09/2014 3:30 am, "Li, Ryan"  wrote:

> HI Shawn,
>
> Thanks for your reply.
>
> The memory setting of my Solr box is
>
> 12G physically memory.
> 4G for java (-Xmx4096m)
> The index size is around 4G in Solr 4.9, I think it was over 6G in Solr
> 4.0.
>
> I do think the RAM size of java is one of the reasons for this slowness.
> I'm doing one big commit and when the ingestion process finished 50%, I can
> see the solr server already used over 90% of full memory.
>
> I'll try to assign more RAM to Solr Java. But from your experience, does
> 4G sounds like a good number for Java heap size for my scenario? Is there
> any way to reduce memory usage during index time? (One thing I know is do a
> few commits instead of one commit. )  My concern is providing I have 12 G
> in total, If I assign too much to Solr server, I may not have enough for
> the OS to cache Solr index file.
>
> I had a look to solr config file, but couldn't find anything that
> obviously wrong, Just wondering which part of that config file would impact
> the index time?
>
> Thanks,
> Ryan
>
>
>
>
>
> One possible source of problems with that particular upgrade is the fact
> that stored field compression was added in 4.1, and termvector
> compression was added in 4.2.  They are on by default and cannot be
> turned off.  The compression is typically fast, but with very large
> documents like yours, it might result in pretty major computational
> overhead.  It can also require additional java heap, which ties into
> what follows:
>
> Another problem might be RAM-related.
>
> If your java heap is very large, or just a little bit too small, there
> can be major performance issues from garbage collection.  Based on the
> fact that the earlier version performed well, a too-small heap is more
> likely than a very large heap.
>
> If your index size is such that it can't be effectively cached by the
> amount of total RAM on the machine (minus the java heap assigned to
> Solr), that can cause performance problems.  Your index size is likely
> to be several gigabytes, and might even reach double-digit gigabytes.
> Can you relate those numbers -- index size, java heap size, and total
> system RAM?  If you can, it would also be a good idea to share your
> solrconfig.xml.
>
> Here's a wiki page that goes into more detail about possible performance
> issues.  It doesn't mention the possible compression problem:
>
> http://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>


statuscode list

2014-09-05 Thread Jan Verweij - Reeleez
Hi,

If I'm correct you will get a statuscode="0" in the response if you
use XML messages for updating the solr index.
Is there a list of possible other statuscodes you can receive in case
anything fails and what these errorcodes mean?

THNX,

Jan.


Re: Solr API for getting shard's leader/replica status

2014-09-05 Thread manohar211
Thanks for the comments!!
I found out the solution on how I can get the replica's state. Here's the
piece of code.

while (iter.hasNext()) {
Slice slice = iter.next();
for(Replica replica:slice.getReplicas()) {

System.out.println("replica state for " + replica.getStr("core")
+ " : "+ replica.getStr( "state" ));

System.out.println(slice.getName());
System.out.println(slice.getState());
}
}



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-API-for-getting-shard-s-leader-replica-status-tp4156902p4157108.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Mikhail Khludnev
On Fri, Sep 5, 2014 at 3:22 PM, Alexandre Rafalovitch 
wrote:

> Why do one big commit? You could do hard commits along the way but keep
> searcher open and not see the changes until the end.
>

Alexandre,
I don't think it's can happen in solr-user list, next search pickups the
new searcher.

Ryan,
Regularly, commit is judged by application requirement, ie. when to make
updates visible. Memory consumption is judged by ramBufferSizeMB and
maxIndexingThreads. Exceeding the buffer, causes flush to disk, but doesn't
trigger commit.


> Obviously a separate issue from memory consumption discussion, but thought
> I'll add it anyway.
>
> Regards,
>  Alex
> On 05/09/2014 3:30 am, "Li, Ryan"  wrote:
>
> > HI Shawn,
> >
> > Thanks for your reply.
> >
> > The memory setting of my Solr box is
> >
> > 12G physically memory.
> > 4G for java (-Xmx4096m)
> > The index size is around 4G in Solr 4.9, I think it was over 6G in Solr
> > 4.0.
> >
> > I do think the RAM size of java is one of the reasons for this slowness.
> > I'm doing one big commit and when the ingestion process finished 50%, I
> can
> > see the solr server already used over 90% of full memory.
> >
> > I'll try to assign more RAM to Solr Java. But from your experience, does
> > 4G sounds like a good number for Java heap size for my scenario? Is there
> > any way to reduce memory usage during index time? (One thing I know is
> do a
> > few commits instead of one commit. )  My concern is providing I have 12 G
> > in total, If I assign too much to Solr server, I may not have enough for
> > the OS to cache Solr index file.
> >
> > I had a look to solr config file, but couldn't find anything that
> > obviously wrong, Just wondering which part of that config file would
> impact
> > the index time?
> >
> > Thanks,
> > Ryan
> >
> >
> >
> >
> >
> > One possible source of problems with that particular upgrade is the fact
> > that stored field compression was added in 4.1, and termvector
> > compression was added in 4.2.  They are on by default and cannot be
> > turned off.  The compression is typically fast, but with very large
> > documents like yours, it might result in pretty major computational
> > overhead.  It can also require additional java heap, which ties into
> > what follows:
> >
> > Another problem might be RAM-related.
> >
> > If your java heap is very large, or just a little bit too small, there
> > can be major performance issues from garbage collection.  Based on the
> > fact that the earlier version performed well, a too-small heap is more
> > likely than a very large heap.
> >
> > If your index size is such that it can't be effectively cached by the
> > amount of total RAM on the machine (minus the java heap assigned to
> > Solr), that can cause performance problems.  Your index size is likely
> > to be several gigabytes, and might even reach double-digit gigabytes.
> > Can you relate those numbers -- index size, java heap size, and total
> > system RAM?  If you can, it would also be a good idea to share your
> > solrconfig.xml.
> >
> > Here's a wiki page that goes into more detail about possible performance
> > issues.  It doesn't mention the possible compression problem:
> >
> > http://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Thanks,
> > Shawn
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jack Krupansky
For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for each 
item in the query result in the docvector managed property.

The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain a 
string parameter with the value of the docvector managed property of the item 
that is to be used as the similarity reference. The similarity vector consists 
of a set of "term,weight" expressions, indicating the most important terms or 
concepts in the item and the corresponding perceived importance (weight). Terms 
can be single words or phrases.

The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.

The similarity vector is created during item processing and indicates the most 
important terms or concepts in the item and the corresponding weight.”

See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky

From: "Jürgen Wagner (DVT)" 
Sent: Friday, September 5, 2014 7:03 AM
To: solr-user@lucene.apache.org 
Subject: Re: FAST-like document vector data structures in Solr?

Hello Jim,
  yes, I am aware of the TermVector and MoreLikeThis stuff. I am presently 
mapping docvectors to these mechanisms and create term vectors myself from 
third-party text mining components.

However, it's not quite like the FAST docvectors. Particularily, the 
performance of MoreLikeThis queries based on TermVectors is suboptimal on large 
document sets, so a more efficient support of such retrievals in the Lucene 
kernel would be preferred.

Cheers,
--Jürgen

On 05.09.2014 10:55, jim ferenczi wrote:

Hi,
Something like ?:
https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
And just to show some impressive search functionality of the wiki: ;)
https://cwiki.apache.org/confluence/dosearchsite.action?where=solr&spaceSearch=true&queryString=document+vectors

Cheers,
Jim


2014-09-05 9:44 GMT+02:00 "Jürgen Wagner (DVT)" 

Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Mikhail Khludnev
Jürgen,

I can't get it. Can you tell more about this feature or point to the doc?
Thanks


On Fri, Sep 5, 2014 at 11:44 AM, "Jürgen Wagner (DVT)" <
juergen.wag...@devoteam.com> wrote:

> Hello all,
>   as the migration from FAST to Solr is a relevant topic for several of
> our customers, there is one issue that does not seem to be addressed by
> Lucene/Solr: document vectors FAST-style. These document vectors are
> used to form metrics of similarity, i.e., they may be used as a
> "semantic fingerprint" of documents to define similarity relations. I
> can think of several ways of approximating a mapping of this mechanism
> to Solr, but there are always drawbacks - mostly performance-wise.
>
> Has anybody else encountered and possibly approached this challenge so far?
>
> Is there anything in the roadmap of Solr that has not revealed itself to
> me, addressing this issue?
>
> Your input is greatly appreciated!
>
> Cheers,
> --Jürgen
>
>


-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





How to implement multilingual word components fields schema?

2014-09-05 Thread Ilia Sretenskii
Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.


Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Alexandre Rafalovitch
On Fri, Sep 5, 2014 at 9:55 AM, Mikhail Khludnev
 wrote:
>> Why do one big commit? You could do hard commits along the way but keep
>> searcher open and not see the changes until the end.
>>
>
> Alexandre,
> I don't think it's can happen in solr-user list, next search pickups the
> new searcher.

Why not? Isn't that what the Solr example configuration doing at:
https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/solr/collection1/conf/solrconfig.xml#L386
?
Hard commit does not reopen the searcher. The soft commit does
(further down), but that can be disabled to get the effect I am
proposing.

What am I missing?

Regards,
   Alex.

Personal: http://www.outerthoughts.com/ and @arafalov
Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Jack Krupansky
It comes down to how you personally want to value compromises between 
conflicting requirements, such as relative weighting of false positives and 
false negatives. Provide a few use cases that illustrate the boundary cases 
that you care most about. For example field values that have snippets in one 
language embedded within larger values in a different language. And, whether 
your fields are always long or sometimes short - the former can work well 
for language detection, but not the latter, unless all fields of a given 
document are always in the same language.


Otherwise simply index the same source text in multiple fields, one for each 
language. You can then do a dismax query on that set of fields.


-- Jack Krupansky

-Original Message- 
From: Ilia Sretenskii

Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different
languages parts and seach queries of the same complexity, and it is a
worldwide used online application, so users generate content in all the
possible world languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their
commercial plugins and it defines tokenizer/filter language per field type,
which is not a universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii. 



Is there any sentence tokenizers in sold 4.9.0?

2014-09-05 Thread Sandeep B A
Hi,

I was looking out the options for sentence tokenizers default in solr but
could not find it. Does any one used? Integrated from any other language
tokenizers to solr. Example python etc.. Please let me know.


Thanks and regards,
Sandeep


Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jürgen Wagner (DVT)
Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...1). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:
> For reference:
>
> “Item Similarity Vector Reference
>
> This property represents a similarity reference when searching for similar 
> items. This is a similarity vector representation that is returned for each 
> item in the query result in the docvector managed property.
>
> The value is a string formatted according to the following format:
>
> [string1,weight1][string2,weight2]...[stringN,weightN]
>
> When performing a find similar query, the SimilarTo element should contain a 
> string parameter with the value of the docvector managed property of the item 
> that is to be used as the similarity reference. The similarity vector 
> consists of a set of "term,weight" expressions, indicating the most important 
> terms or concepts in the item and the corresponding perceived importance 
> (weight). Terms can be single words or phrases.
>
> The weight is a float value between 0 and 1, where 1 indicates the highest 
> relevance.
>
> The similarity vector is created during item processing and indicates the 
> most important terms or concepts in the item and the corresponding weight.”
>
> See:
> http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx
>
> -- Jack Krupansky



Re: FAST-like document vector data structures in Solr?

2014-09-05 Thread Jack Krupansky
Sounds like a great future to add to Solr, especially if it would facilitate 
more automatic relevancy enhancement. LucidWorks Search has a feature called 
"unsupervised feedback" that does that but something like a docvector might 
make it a more realistic default.


-- Jack Krupansky

-Original Message- 
From: "Jürgen Wagner (DVT)"

Sent: Friday, September 5, 2014 10:29 AM
To: solr-user@lucene.apache.org
Subject: Re: FAST-like document vector data structures in Solr?

Thanks for posting this. I was just about to send off a message of
similar content :-)

Important to add:

- In FAST ESP, you could have more than one such docvector associated
with a document, in order to reflect different metrics.

- Term weights in docvectors are document-relative, not absolute.

- Processing is done in the search processor (close to the index), not
in the QR server (providing transformations on the result list).

This docvector could be used for unsupervised clustering,
related-to/similarity search, tag clouds or more weird stuff like
identifying experts on topics contained in a particular document.

With Solr, it seems I have to handcraft the term vectors to reflect the
right weights, to approximate the effect of FAST docvectors, e.g., by
normalizing them to [0...1). Processing performance would still be
different from the classical FAST docvectors. The space consumption may
become ugly for a 200+ GB range shard, however, FAST has also been quite
generous with disk space, anyway.

So, the interesting question is whether there is a more canonical way of
handling this in Solr/Lucene, or if something the like is planned for 5.0+.

Best regards,
--Jürgen

On 05.09.2014 16:02, Jack Krupansky wrote:

For reference:

“Item Similarity Vector Reference

This property represents a similarity reference when searching for similar 
items. This is a similarity vector representation that is returned for 
each item in the query result in the docvector managed property.


The value is a string formatted according to the following format:

[string1,weight1][string2,weight2]...[stringN,weightN]

When performing a find similar query, the SimilarTo element should contain 
a string parameter with the value of the docvector managed property of the 
item that is to be used as the similarity reference. The similarity vector 
consists of a set of "term,weight" expressions, indicating the most 
important terms or concepts in the item and the corresponding perceived 
importance (weight). Terms can be single words or phrases.


The weight is a float value between 0 and 1, where 1 indicates the highest 
relevance.


The similarity vector is created during item processing and indicates the 
most important terms or concepts in the item and the corresponding 
 weight.”


See:
http://msdn.microsoft.com/en-us/library/office/ff521597(v=office.14).aspx

-- Jack Krupansky




Re: Is there any sentence tokenizers in sold 4.9.0?

2014-09-05 Thread Sandeep B A
Sorry for typo it is solr 4.9.0 instead of sold 4.9.0
 On Sep 5, 2014 7:48 PM, "Sandeep B A"  wrote:

> Hi,
>
> I was looking out the options for sentence tokenizers default in solr but
> could not find it. Does any one used? Integrated from any other language
> tokenizers to solr. Example python etc.. Please let me know.
>
>
> Thanks and regards,
> Sandeep
>


Re: Query ReRanking question

2014-09-05 Thread Ravi Solr
Thank you very much for responding. I want to do exactly the opposite of
what you said. I want to sort the relevant docs in reverse chronology. If
you sort by date before hand then the relevancy is lost. So I want to get
Top N relevant results and then rerank those Top N to achieve relevant
reverse chronological results.

If you ask Why would I want to do that ??

Lets take a example about Malaysian airline crash. several articles might
have been published over a period of time. When I search for - malaysia
airline crash blackbox - I would want to see "relevant" results but would
also like to see the the recent developments on the top i.e. effectively a
reverse chronological order within the relevant results, like telling a
story over a period of time

Hope i am clear. Thanks for your help.

Thanks

Ravi Kiran Bhaskar


On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein  wrote:

> If you want the main query to be sorted by date then the top N docs
> reranked by a query, that should work. Try something like this:
>
> q=foo&sort=date+desc&rq={!rerank reRandDocs=1000
> reRankQuery=$myquery}&myquery=blah
>
>
> Joel Bernstein
> Search Engineer at Heliosearch
>
>
> On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr  wrote:
>
> > Can the ReRanking API be used to sort within docs retrieved by a date
> field
> > ? Can somebody help me understand how to write such a query ?
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
>


RE: Query ReRanking question

2014-09-05 Thread Markus Jelsma
Hi - You can already achieve this by boosting on the document's recency. The 
result set won't be exactly ordered by date but you will get the most relevant 
and recent documents on top.

Markus 

-Original message-
> From:Ravi Solr mailto:ravis...@gmail.com> >
> Sent: Friday 5th September 2014 18:06
> To: solr-user@lucene.apache.org  
> Subject: Re: Query ReRanking question
> 
> Thank you very much for responding. I want to do exactly the opposite of
> what you said. I want to sort the relevant docs in reverse chronology. If
> you sort by date before hand then the relevancy is lost. So I want to get
> Top N relevant results and then rerank those Top N to achieve relevant
> reverse chronological results.
> 
> If you ask Why would I want to do that ??
> 
> Lets take a example about Malaysian airline crash. several articles might
> have been published over a period of time. When I search for - malaysia
> airline crash blackbox - I would want to see "relevant" results but would
> also like to see the the recent developments on the top i.e. effectively a
> reverse chronological order within the relevant results, like telling a
> story over a period of time
> 
> Hope i am clear. Thanks for your help.
> 
> Thanks
> 
> Ravi Kiran Bhaskar
> 
> 
> On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein   > wrote:
> 
> > If you want the main query to be sorted by date then the top N docs
> > reranked by a query, that should work. Try something like this:
> >
> > q=foo&sort=date+desc&rq={!rerank reRandDocs=1000
> > reRankQuery=$myquery}&myquery=blah
> >
> >
> > Joel Bernstein
> > Search Engineer at Heliosearch
> >
> >
> > On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr  >  > wrote:
> >
> > > Can the ReRanking API be used to sort within docs retrieved by a date
> > field
> > > ? Can somebody help me understand how to write such a query ?
> > >
> > > Thanks
> > >
> > > Ravi Kiran Bhaskar
> > >
> >
> 



Re: Solr add document over 20 times slower after upgrade from 4.0 to 4.9

2014-09-05 Thread Erick Erickson
Alexandre:

It Depends (tm) of course. It all hinges on the setting in ,
whether  is true or false.

In the former case, you, well, open a new searcher. In the latter you don't.

I agree, though, this is all tangential to the memory consumption issue since
the RAM buffer will be flushed regardless of these settings.

FWIW,
Erick

On Fri, Sep 5, 2014 at 7:11 AM, Alexandre Rafalovitch
 wrote:
> On Fri, Sep 5, 2014 at 9:55 AM, Mikhail Khludnev
>  wrote:
>>> Why do one big commit? You could do hard commits along the way but keep
>>> searcher open and not see the changes until the end.
>>>
>>
>> Alexandre,
>> I don't think it's can happen in solr-user list, next search pickups the
>> new searcher.
>
> Why not? Isn't that what the Solr example configuration doing at:
> https://github.com/apache/lucene-solr/blob/lucene_solr_4_10_0/solr/example/solr/collection1/conf/solrconfig.xml#L386
> ?
> Hard commit does not reopen the searcher. The soft commit does
> (further down), but that can be disabled to get the effect I am
> proposing.
>
> What am I missing?
>
> Regards,
>Alex.
>
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853


Re: Query ReRanking question

2014-09-05 Thread Erick Erickson
OK, why can't you switch the clauses from Joel's suggestion?

Something like:
q=Malaysia plane crash&rq={!rerank reRankDocs=1000
reRankQuery=$myquery}&myquery=*:*&sort=date+desc

(haven't tried this yet, but you get the idea).

Best,
Erick

On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
 wrote:
> Hi - You can already achieve this by boosting on the document's recency. The 
> result set won't be exactly ordered by date but you will get the most 
> relevant and recent documents on top.
>
> Markus
>
> -Original message-
>> From:Ravi Solr mailto:ravis...@gmail.com> >
>> Sent: Friday 5th September 2014 18:06
>> To: solr-user@lucene.apache.org 
>> Subject: Re: Query ReRanking question
>>
>> Thank you very much for responding. I want to do exactly the opposite of
>> what you said. I want to sort the relevant docs in reverse chronology. If
>> you sort by date before hand then the relevancy is lost. So I want to get
>> Top N relevant results and then rerank those Top N to achieve relevant
>> reverse chronological results.
>>
>> If you ask Why would I want to do that ??
>>
>> Lets take a example about Malaysian airline crash. several articles might
>> have been published over a period of time. When I search for - malaysia
>> airline crash blackbox - I would want to see "relevant" results but would
>> also like to see the the recent developments on the top i.e. effectively a
>> reverse chronological order within the relevant results, like telling a
>> story over a period of time
>>
>> Hope i am clear. Thanks for your help.
>>
>> Thanks
>>
>> Ravi Kiran Bhaskar
>>
>>
>> On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein >  > wrote:
>>
>> > If you want the main query to be sorted by date then the top N docs
>> > reranked by a query, that should work. Try something like this:
>> >
>> > q=foo&sort=date+desc&rq={!rerank reRandDocs=1000
>> > reRankQuery=$myquery}&myquery=blah
>> >
>> >
>> > Joel Bernstein
>> > Search Engineer at Heliosearch
>> >
>> >
>> > On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr > >  > wrote:
>> >
>> > > Can the ReRanking API be used to sort within docs retrieved by a date
>> > field
>> > > ? Can somebody help me understand how to write such a query ?
>> > >
>> > > Thanks
>> > >
>> > > Ravi Kiran Bhaskar
>> > >
>> >
>>
>


Re: Query ReRanking question

2014-09-05 Thread Walter Underwood
Boosting on recency is probably a better approach. A fixed re-ranking horizon 
will always be a compromise, a guess at the precision of the query. It will 
give poor results for queries that are more or less specific than the 
assumption.

Think of the recency boost as a tie-breaker. When documents are similar in 
relevance, show the most recent. This can work over a wide range of queries.

For “malaysian airlines crash”, there are two sets of relevant documents, one 
set on MH 370 starting six months ago, and one set on MH 17, two months ago. 
But four hours ago, The Guardian published a “six months on” article on MH 370. 
A recency boost will handle that complexity.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Sep 5, 2014, at 10:23 AM, Erick Erickson  wrote:

> OK, why can't you switch the clauses from Joel's suggestion?
> 
> Something like:
> q=Malaysia plane crash&rq={!rerank reRankDocs=1000
> reRankQuery=$myquery}&myquery=*:*&sort=date+desc
> 
> (haven't tried this yet, but you get the idea).
> 
> Best,
> Erick
> 
> On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
>  wrote:
>> Hi - You can already achieve this by boosting on the document's recency. The 
>> result set won't be exactly ordered by date but you will get the most 
>> relevant and recent documents on top.
>> 
>> Markus
>> 
>> -Original message-
>>> From:Ravi Solr mailto:ravis...@gmail.com> >
>>> Sent: Friday 5th September 2014 18:06
>>> To: solr-user@lucene.apache.org 
>>> Subject: Re: Query ReRanking question
>>> 
>>> Thank you very much for responding. I want to do exactly the opposite of
>>> what you said. I want to sort the relevant docs in reverse chronology. If
>>> you sort by date before hand then the relevancy is lost. So I want to get
>>> Top N relevant results and then rerank those Top N to achieve relevant
>>> reverse chronological results.
>>> 
>>> If you ask Why would I want to do that ??
>>> 
>>> Lets take a example about Malaysian airline crash. several articles might
>>> have been published over a period of time. When I search for - malaysia
>>> airline crash blackbox - I would want to see "relevant" results but would
>>> also like to see the the recent developments on the top i.e. effectively a
>>> reverse chronological order within the relevant results, like telling a
>>> story over a period of time
>>> 
>>> Hope i am clear. Thanks for your help.
>>> 
>>> Thanks
>>> 
>>> Ravi Kiran Bhaskar
>>> 
>>> 
>>> On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein >>  > wrote:
>>> 
 If you want the main query to be sorted by date then the top N docs
 reranked by a query, that should work. Try something like this:
 
 q=foo&sort=date+desc&rq={!rerank reRandDocs=1000
 reRankQuery=$myquery}&myquery=blah
 
 
 Joel Bernstein
 Search Engineer at Heliosearch
 
 
 On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr >>>  > wrote:
 
> Can the ReRanking API be used to sort within docs retrieved by a date
 field
> ? Can somebody help me understand how to write such a query ?
> 
> Thanks
> 
> Ravi Kiran Bhaskar
> 
 
>>> 
>> 



Re: Edismax mm and efficiency

2014-09-05 Thread Walter Underwood
Great!

We have some very long queries, where students paste entire homework problems. 
One of them was 1051 words. Many of them are over 100 words. This could help.

In the Jira discussion, I saw some comments about handling the most sparse 
lists first. We did something like that in the Infoseek Ultra engine about 
twenty years ago. Short termlists (documents matching a term) were processed 
first, which kept the in-memory lists of matching docs small. It also allowed 
early short-circuiting for no-hits queries.

What would be a high mm value, 75%?

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/


On Sep 4, 2014, at 11:52 PM, Mikhail Khludnev  
wrote:

> indeed https://issues.apache.org/jira/browse/LUCENE-4571
> my feeling is it gives a significant gain in mm high values.
> 
> 
> 
> On Fri, Sep 5, 2014 at 3:01 AM, Walter Underwood 
> wrote:
> 
>> Are there any speed advantages to using “mm”? I can imagine pruning the
>> set of matching documents early, which could help, but is that (or
>> something else) done?
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/
>> 
>> 
>> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
> 
> 
> 



Re: SolrJ 4.10.0 errors

2014-09-05 Thread Shawn Heisey
On 9/5/2014 3:50 AM, Guido Medina wrote:
> Sorry I didn't give enough information so I'm adding to it, the SolrJ
> client is on our webapp and the documents are getting indexed properly
> into Solr, the only problem we are seeing is that with SolrJ 4.10 once
> Solr server response comes back it seems like SolrJ client doesn't know
> what to with such response and reports the exception I mentioned, I then
> downgraded the SolrJ client to 4.9 and the exception is now gone, I'm
> using the following relevant libraries:
> 
> Java 7u67 64 bits at both webapp client side side and Jetty's
> HTTP client/mine 4.3.5
> HTTP core 4.3.2
> 
> Here is a list of my Solr war modified lib folder, I usually don't stay
> with the standard jars because I believe most of them are out of date if
> you are running a JDK 7u55+:

You're in uncharted territory if you're going to modify the jars
included with Solr itself.  We do upgrade these from time to time, and
usually it's completely harmless, but we also run all the tests when we
do it, to make sure that nothing will get broken.  Some of the
components are on specific versions because upgrading them isn't as
simple as simply changing the jar.

What happens if you return Solr to what's in the release war?

Thanks,
Shawn



RE: How to implement multilingual word components fields schema?

2014-09-05 Thread Susheel Kumar
Agree with the approach Jack suggested to use same source text in multiple 
fields for each language and then doing a dismax query.  Would love to hear if 
it works for you?

Thanks,
Susheel

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com]
Sent: Friday, September 05, 2014 10:21 AM
To: solr-user@lucene.apache.org
Subject: Re: How to implement multilingual word components fields schema?

It comes down to how you personally want to value compromises between 
conflicting requirements, such as relative weighting of false positives and 
false negatives. Provide a few use cases that illustrate the boundary cases 
that you care most about. For example field values that have snippets in one 
language embedded within larger values in a different language. And, whether 
your fields are always long or sometimes short - the former can work well for 
language detection, but not the latter, unless all fields of a given document 
are always in the same language.

Otherwise simply index the same source text in multiple fields, one for each 
language. You can then do a dismax query on that set of fields.

-- Jack Krupansky

-Original Message-
From: Ilia Sretenskii
Sent: Friday, September 5, 2014 10:06 AM
To: solr-user@lucene.apache.org
Subject: How to implement multilingual word components fields schema?

Hello.
We have documents with multilingual words which consist of different languages 
parts and seach queries of the same complexity, and it is a worldwide used 
online application, so users generate content in all the possible world 
languages.

For example:
言語-aware
Løgismose-alike
ຄໍາຮ້ອງສະຫມັກ-dependent

So I guess our schema requires a single field with universal analyzers.

Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.

But then it requires stemming and lemmatization.

How to implement a schema with universal stemming/lemmatization which would 
probably utilize the ICU generated token script attribute?
http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html

By the way, I have already examined the Basistech schema of their commercial 
plugins and it defines tokenizer/filter language per field type, which is not a 
universal solution for such complex multilingual texts.

Please advise how to address this task.

Sincerely, Ilia Sretenskii.

This e-mail message may contain confidential or legally privileged information 
and is intended only for the use of the intended recipient(s). Any unauthorized 
disclosure, dissemination, distribution, copying or the taking of any action in 
reliance on the information herein is prohibited. E-mails are not secure and 
cannot be guaranteed to be error free as they can be intercepted, amended, or 
contain viruses. Anyone who communicates with us by e-mail is deemed to have 
accepted these risks. The Digital Group is not responsible for errors or 
omissions in this message and denies any responsibility for any damage arising 
from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  
any material which could be reasonably branded to be a species of plagiarism 
and other statements contained in this message and any attachment are solely 
those of the author and do not necessarily represent those of the company.


Re: How to implement multilingual word components fields schema?

2014-09-05 Thread Tom Burton-West
Hi Ilia,

I don't know if it would be helpful but below I've listed  some academic
papers on this issue of how best to deal with mixed language/mixed script
queries and documents.  They are probably taking a more complex approach
than you will want to use, but perhaps they will help to think about the
various ways of approaching the problem.

We haven't tackled this problem yet. We have over 200 languages.  Currently
we are using the ICUTokenizer and ICUFolding filter but don't do any
stemming due to a concern with overstemming (we have very high recall, so
don't want to hurt precision by stemming)  and the difficulty of correct
language identification of short queries.

If you have languages where there is only one language per script however,
you might be able to do much more.  I'm not sure if I'm remembering
correctly but I believe some of the stemmers such as the Greek stemmer will
pass through any strings that don't contain characters in the Greek script.
  So it might be possible to at least do stemming on some of your
languages/scripts.

 I'll be very interested to learn what approach you end up using.

Tom

--

Some papers:

Mohammed Mustafa, Izzedin Osman, and Hussein Suleman. 2011. Indexing and
weighting of multilingual and mixed documents. In *Proceedings of the South
African Institute of Computer Scientists and Information Technologists
Conference on Knowledge, Innovation and Leadership in a Diverse,
Multidisciplinary Environment* (SAICSIT '11). ACM, New York, NY, USA,
161-170. DOI=10.1145/2072221.2072240
http://doi.acm.org/10.1145/2072221.2072240

That paper and some others are here:
http://www.husseinsspace.com/research/students/mohammedmustafaali.html

There is also some code from this article:

Parth Gupta, Kalika Bali, Rafael E. Banchs, Monojit Choudhury, and Paolo
Rosso. 2014. Query expansion for mixed-script information retrieval.
In *Proceedings
of the 37th international ACM SIGIR conference on Research & development in
information retrieval* (SIGIR '14). ACM, New York, NY, USA, 677-686.
DOI=10.1145/2600428.2609622 http://doi.acm.org/10.1145/2600428.2609622

Code:
http://users.dsic.upv.es/~pgupta/mixed-script-ir.html

Tom Burton-West
Information Retrieval Programmer
Digital Library Production Service
University of Michigan Library
tburt...@umich.edu
http://www.hathitrust.org/blogs/large-scale-search


On Fri, Sep 5, 2014 at 10:06 AM, Ilia Sretenskii 
wrote:

> Hello.
> We have documents with multilingual words which consist of different
> languages parts and seach queries of the same complexity, and it is a
> worldwide used online application, so users generate content in all the
> possible world languages.
>
> For example:
> 言語-aware
> Løgismose-alike
> ຄໍາຮ້ອງສະຫມັກ-dependent
>
> So I guess our schema requires a single field with universal analyzers.
>
> Luckily, there exist ICUTokenizer and ICUFoldingFilter for that.
>
> But then it requires stemming and lemmatization.
>
> How to implement a schema with universal stemming/lemmatization which would
> probably utilize the ICU generated token script attribute?
>
> http://lucene.apache.org/core/4_10_0/analyzers-icu/org/apache/lucene/analysis/icu/tokenattributes/ScriptAttribute.html
>
> By the way, I have already examined the Basistech schema of their
> commercial plugins and it defines tokenizer/filter language per field type,
> which is not a universal solution for such complex multilingual texts.
>
> Please advise how to address this task.
>
> Sincerely, Ilia Sretenskii.
>


RE: Is there any sentence tokenizers in sold 4.9.0?

2014-09-05 Thread Susheel Kumar
There is SmartChineseSentenceTokenizerFactory or SentenceTokenizer  which is 
getting being deprecated & replaced with HMMChineseTokenizer.  Not aware of 
other tokenizer but you may to either build your own similar to 
SentenceTokenizer or employ any external Sentence detection/recognizer & built 
Solr tokenizer on top of it.

Don't know how complex your use case is but I would suggest to look 
SentenceTokenizer and create similar tokenizer.

Thanks,
Susheel

-Original Message-
From: Sandeep B A [mailto:belgavi.sand...@gmail.com]
Sent: Friday, September 05, 2014 10:40 AM
To: solr-user@lucene.apache.org
Subject: Re: Is there any sentence tokenizers in sold 4.9.0?

Sorry for typo it is solr 4.9.0 instead of sold 4.9.0  On Sep 5, 2014 7:48 PM, 
"Sandeep B A"  wrote:

> Hi,
>
> I was looking out the options for sentence tokenizers default in solr
> but could not find it. Does any one used? Integrated from any other
> language tokenizers to solr. Example python etc.. Please let me know.
>
>
> Thanks and regards,
> Sandeep
>
This e-mail message may contain confidential or legally privileged information 
and is intended only for the use of the intended recipient(s). Any unauthorized 
disclosure, dissemination, distribution, copying or the taking of any action in 
reliance on the information herein is prohibited. E-mails are not secure and 
cannot be guaranteed to be error free as they can be intercepted, amended, or 
contain viruses. Anyone who communicates with us by e-mail is deemed to have 
accepted these risks. The Digital Group is not responsible for errors or 
omissions in this message and denies any responsibility for any damage arising 
from the use of e-mail. Any opinion defamatory or deemed to be defamatory or  
any material which could be reasonably branded to be a species of plagiarism 
and other statements contained in this message and any attachment are solely 
those of the author and do not necessarily represent those of the company.


Re: Query ReRanking question

2014-09-05 Thread Ravi Solr
Erick, I believe when you apply sort this way it runs the query and sort
first and then tries to rerank...so basically it already lost the true
relevancy because of sort taking precedence. Am I making sense ?

Ravi Kiran Bhaskar


On Fri, Sep 5, 2014 at 1:23 PM, Erick Erickson 
wrote:

> OK, why can't you switch the clauses from Joel's suggestion?
>
> Something like:
> q=Malaysia plane crash&rq={!rerank reRankDocs=1000
> reRankQuery=$myquery}&myquery=*:*&sort=date+desc
>
> (haven't tried this yet, but you get the idea).
>
> Best,
> Erick
>
> On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
>  wrote:
> > Hi - You can already achieve this by boosting on the document's recency.
> The result set won't be exactly ordered by date but you will get the most
> relevant and recent documents on top.
> >
> > Markus
> >
> > -Original message-
> >> From:Ravi Solr mailto:ravis...@gmail.com> >
> >> Sent: Friday 5th September 2014 18:06
> >> To: solr-user@lucene.apache.org 
> >> Subject: Re: Query ReRanking question
> >>
> >> Thank you very much for responding. I want to do exactly the opposite of
> >> what you said. I want to sort the relevant docs in reverse chronology.
> If
> >> you sort by date before hand then the relevancy is lost. So I want to
> get
> >> Top N relevant results and then rerank those Top N to achieve relevant
> >> reverse chronological results.
> >>
> >> If you ask Why would I want to do that ??
> >>
> >> Lets take a example about Malaysian airline crash. several articles
> might
> >> have been published over a period of time. When I search for - malaysia
> >> airline crash blackbox - I would want to see "relevant" results but
> would
> >> also like to see the the recent developments on the top i.e.
> effectively a
> >> reverse chronological order within the relevant results, like telling a
> >> story over a period of time
> >>
> >> Hope i am clear. Thanks for your help.
> >>
> >> Thanks
> >>
> >> Ravi Kiran Bhaskar
> >>
> >>
> >> On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein   > wrote:
> >>
> >> > If you want the main query to be sorted by date then the top N docs
> >> > reranked by a query, that should work. Try something like this:
> >> >
> >> > q=foo&sort=date+desc&rq={!rerank reRandDocs=1000
> >> > reRankQuery=$myquery}&myquery=blah
> >> >
> >> >
> >> > Joel Bernstein
> >> > Search Engineer at Heliosearch
> >> >
> >> >
> >> > On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr   > wrote:
> >> >
> >> > > Can the ReRanking API be used to sort within docs retrieved by a
> date
> >> > field
> >> > > ? Can somebody help me understand how to write such a query ?
> >> > >
> >> > > Thanks
> >> > >
> >> > > Ravi Kiran Bhaskar
> >> > >
> >> >
> >>
> >
>


Re: Query ReRanking question

2014-09-05 Thread Ravi Solr
Walter, thank you for the valuable insight. The problem I am facing is that
between the term frequencies, mm, date boost and stemming the results can
become very inconsistent...Look at the following examples

Here the chronology is all over the place because of what I mentioned above
http://www.washingtonpost.com/pb/newssearch/?query=malaysian+airline+crash

Now take the instance of an old topic/news which was covered a a while ago
for a period of time but not actively updated recently...In this case, the
date boosting predominantly takes over because of common terms and we get a
rash of irrelevant content

http://www.washingtonpost.com/pb/newssearch/?query=faces+of+the+fallen

This has become such a balancing act and hence I was looking to see if
reRanking might help

Thanks

Ravi Kiran Bhaskar





On Fri, Sep 5, 2014 at 1:32 PM, Walter Underwood 
wrote:

> Boosting on recency is probably a better approach. A fixed re-ranking
> horizon will always be a compromise, a guess at the precision of the query.
> It will give poor results for queries that are more or less specific than
> the assumption.
>
> Think of the recency boost as a tie-breaker. When documents are similar in
> relevance, show the most recent. This can work over a wide range of queries.
>
> For “malaysian airlines crash”, there are two sets of relevant documents,
> one set on MH 370 starting six months ago, and one set on MH 17, two months
> ago. But four hours ago, The Guardian published a “six months on” article
> on MH 370. A recency boost will handle that complexity.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/
>
>
> On Sep 5, 2014, at 10:23 AM, Erick Erickson 
> wrote:
>
> > OK, why can't you switch the clauses from Joel's suggestion?
> >
> > Something like:
> > q=Malaysia plane crash&rq={!rerank reRankDocs=1000
> > reRankQuery=$myquery}&myquery=*:*&sort=date+desc
> >
> > (haven't tried this yet, but you get the idea).
> >
> > Best,
> > Erick
> >
> > On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
> >  wrote:
> >> Hi - You can already achieve this by boosting on the document's
> recency. The result set won't be exactly ordered by date but you will get
> the most relevant and recent documents on top.
> >>
> >> Markus
> >>
> >> -Original message-
> >>> From:Ravi Solr mailto:ravis...@gmail.com> >
> >>> Sent: Friday 5th September 2014 18:06
> >>> To: solr-user@lucene.apache.org 
> >>> Subject: Re: Query ReRanking question
> >>>
> >>> Thank you very much for responding. I want to do exactly the opposite
> of
> >>> what you said. I want to sort the relevant docs in reverse chronology.
> If
> >>> you sort by date before hand then the relevancy is lost. So I want to
> get
> >>> Top N relevant results and then rerank those Top N to achieve relevant
> >>> reverse chronological results.
> >>>
> >>> If you ask Why would I want to do that ??
> >>>
> >>> Lets take a example about Malaysian airline crash. several articles
> might
> >>> have been published over a period of time. When I search for - malaysia
> >>> airline crash blackbox - I would want to see "relevant" results but
> would
> >>> also like to see the the recent developments on the top i.e.
> effectively a
> >>> reverse chronological order within the relevant results, like telling a
> >>> story over a period of time
> >>>
> >>> Hope i am clear. Thanks for your help.
> >>>
> >>> Thanks
> >>>
> >>> Ravi Kiran Bhaskar
> >>>
> >>>
> >>> On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein   > wrote:
> >>>
>  If you want the main query to be sorted by date then the top N docs
>  reranked by a query, that should work. Try something like this:
> 
>  q=foo&sort=date+desc&rq={!rerank reRandDocs=1000
>  reRankQuery=$myquery}&myquery=blah
> 
> 
>  Joel Bernstein
>  Search Engineer at Heliosearch
> 
> 
>  On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr   > wrote:
> 
> > Can the ReRanking API be used to sort within docs retrieved by a date
>  field
> > ? Can somebody help me understand how to write such a query ?
> >
> > Thanks
> >
> > Ravi Kiran Bhaskar
> >
> 
> >>>
> >>
>
>


How to solve?

2014-09-05 Thread William Bell
We have a core with each document as a person.

We want to boost based on the sweater color, but if the person has sweaters
in their closet which are the same manufactuer we want to boost even more
by adding them together.

Peter Smit - Sweater: Blue = 1 : Nike, Sweater: Red = 2: Nike, Sweater:
Blue=1 : Polo
Tony S - Sweater: Red =2: Nike
 Bill O - Sweater:Red = 2: Polo, Blue=1: Polo

Scores:

Peter Smit - 1+2 = 3.
Tony S - 2
Bill O - 2 + 1

I thought about using payloads.

sweaters_payload
Blue: Nike: 1
Red: Nike: 2
Blue: Polo: 1

How do I query this?

http://localhost:8983/solr/persons?q=*:*&sort=??

Ideas?




-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Query ReRanking question

2014-09-05 Thread Joel Bernstein
You can probably use the FunctionQParserPlugin in conjunction with Query
ReRanking to achieve what you're trying to do.

q=foo&rq={!rerank reRankDocs=1000 reRankQuery=$qq}&qq={!func}someFunction()

What this is going to do is rerank the docs based on a function query.

Your function query will need to return a float because the query reranker
is expecting a score which is a float. So you'll have to devise function
query logic that will transform your date to a float.





Joel Bernstein
Search Engineer at Heliosearch


On Fri, Sep 5, 2014 at 7:06 PM, Ravi Solr  wrote:

> Walter, thank you for the valuable insight. The problem I am facing is that
> between the term frequencies, mm, date boost and stemming the results can
> become very inconsistent...Look at the following examples
>
> Here the chronology is all over the place because of what I mentioned above
> http://www.washingtonpost.com/pb/newssearch/?query=malaysian+airline+crash
>
> Now take the instance of an old topic/news which was covered a a while ago
> for a period of time but not actively updated recently...In this case, the
> date boosting predominantly takes over because of common terms and we get a
> rash of irrelevant content
>
> http://www.washingtonpost.com/pb/newssearch/?query=faces+of+the+fallen
>
> This has become such a balancing act and hence I was looking to see if
> reRanking might help
>
> Thanks
>
> Ravi Kiran Bhaskar
>
>
>
>
>
> On Fri, Sep 5, 2014 at 1:32 PM, Walter Underwood 
> wrote:
>
> > Boosting on recency is probably a better approach. A fixed re-ranking
> > horizon will always be a compromise, a guess at the precision of the
> query.
> > It will give poor results for queries that are more or less specific than
> > the assumption.
> >
> > Think of the recency boost as a tie-breaker. When documents are similar
> in
> > relevance, show the most recent. This can work over a wide range of
> queries.
> >
> > For “malaysian airlines crash”, there are two sets of relevant documents,
> > one set on MH 370 starting six months ago, and one set on MH 17, two
> months
> > ago. But four hours ago, The Guardian published a “six months on” article
> > on MH 370. A recency boost will handle that complexity.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/
> >
> >
> > On Sep 5, 2014, at 10:23 AM, Erick Erickson 
> > wrote:
> >
> > > OK, why can't you switch the clauses from Joel's suggestion?
> > >
> > > Something like:
> > > q=Malaysia plane crash&rq={!rerank reRankDocs=1000
> > > reRankQuery=$myquery}&myquery=*:*&sort=date+desc
> > >
> > > (haven't tried this yet, but you get the idea).
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Sep 5, 2014 at 9:33 AM, Markus Jelsma
> > >  wrote:
> > >> Hi - You can already achieve this by boosting on the document's
> > recency. The result set won't be exactly ordered by date but you will get
> > the most relevant and recent documents on top.
> > >>
> > >> Markus
> > >>
> > >> -Original message-
> > >>> From:Ravi Solr mailto:ravis...@gmail.com> >
> > >>> Sent: Friday 5th September 2014 18:06
> > >>> To: solr-user@lucene.apache.org 
> > >>> Subject: Re: Query ReRanking question
> > >>>
> > >>> Thank you very much for responding. I want to do exactly the opposite
> > of
> > >>> what you said. I want to sort the relevant docs in reverse
> chronology.
> > If
> > >>> you sort by date before hand then the relevancy is lost. So I want to
> > get
> > >>> Top N relevant results and then rerank those Top N to achieve
> relevant
> > >>> reverse chronological results.
> > >>>
> > >>> If you ask Why would I want to do that ??
> > >>>
> > >>> Lets take a example about Malaysian airline crash. several articles
> > might
> > >>> have been published over a period of time. When I search for -
> malaysia
> > >>> airline crash blackbox - I would want to see "relevant" results but
> > would
> > >>> also like to see the the recent developments on the top i.e.
> > effectively a
> > >>> reverse chronological order within the relevant results, like
> telling a
> > >>> story over a period of time
> > >>>
> > >>> Hope i am clear. Thanks for your help.
> > >>>
> > >>> Thanks
> > >>>
> > >>> Ravi Kiran Bhaskar
> > >>>
> > >>>
> > >>> On Thu, Sep 4, 2014 at 5:08 PM, Joel Bernstein  >  > wrote:
> > >>>
> >  If you want the main query to be sorted by date then the top N docs
> >  reranked by a query, that should work. Try something like this:
> > 
> >  q=foo&sort=date+desc&rq={!rerank reRandDocs=1000
> >  reRankQuery=$myquery}&myquery=blah
> > 
> > 
> >  Joel Bernstein
> >  Search Engineer at Heliosearch
> > 
> > 
> >  On Thu, Sep 4, 2014 at 4:25 PM, Ravi Solr  >  > wrote:
> > 
> > > Can the ReRanking API be used to sort within docs retrieved by a
> date
> >  field
> > > ? Can somebody help me understand how to write