weak documents

2013-11-27 Thread Thomas Scheffler

Hi,

I am relatively new to SOLR and I am looking for a neat way to implement 
weak documents with SOLR.


Whenever a document is updated or deleted all it's dependent documents 
should be removed from the index. In other words they exist as long as 
the document exist they refer to when they were indexed - in that 
specific version. On "update" they will be indexed after their master 
document.


I could like to have some kind of "dependsOn" field that carries the 
uniqueKey value of the master document.


Can this be done efficiently with SOLR?

I need this technique because on update and on delete I don't know how 
many dependent documents exists in the SOLR index. Especially for batch 
index processes, I need a more efficient way than query before every 
update or delete.


kind regards,

Thomas


Persist solr cache

2013-11-27 Thread Prasi S
Hi all,
Is there any way the solr caches ( document / field/ query) cache can be
persisted on disk. In case of system crash, can i make the new cache loaded
from the persisted cache.

Thanks,
Prasi


Re: weak documents

2013-11-27 Thread Upayavira
Just a guess, I haven't investigated them fully yet, but I wonder if
block joins could serve you here, as they involve creating docs in a
parent child relationship.

Or, you could easily fake it:


  abcd
  parent:abcd


Not sure if that syntax is completely right, but using that sort of
thing would get you there, For deletes, think.

There isn't yet an update by query (batch update) feature, one that
would be very useful.

Upayavira

On Wed, Nov 27, 2013, at 08:13 AM, Thomas Scheffler wrote:
> Hi,
> 
> I am relatively new to SOLR and I am looking for a neat way to implement 
> weak documents with SOLR.
> 
> Whenever a document is updated or deleted all it's dependent documents 
> should be removed from the index. In other words they exist as long as 
> the document exist they refer to when they were indexed - in that 
> specific version. On "update" they will be indexed after their master 
> document.
> 
> I could like to have some kind of "dependsOn" field that carries the 
> uniqueKey value of the master document.
> 
> Can this be done efficiently with SOLR?
> 
> I need this technique because on update and on delete I don't know how 
> many dependent documents exists in the SOLR index. Especially for batch 
> index processes, I need a more efficient way than query before every 
> update or delete.
> 
> kind regards,
> 
> Thomas


Re: Solr Autowarmed queries on jvm crash

2013-11-27 Thread michael.boom
As Shawn stated above, when you start up Solr there will be no such thing as
caches or old searchers.
If you want to warm up, you can only rely on firstSearcher and newSearcher
queries. 

/"What would happen to the autowarmed queries , cache , old searcher now ?"/
They're all gone.



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Autowarmed-queries-on-jvm-crash-tp4103451p4103466.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Persist solr cache

2013-11-27 Thread michael.boom
Caches are only valid as long as the Index Searcher is valid. So, if you make
a commit with opening a new searcher then caches will be invalidated.
However, in this scenario you can configure your caches so that the new
searcher will keep a certain number of cache entries from the previous one
(autowarmCount). 
That's the only cache "persistence" Solr can offer. On restarting/crash you
can't reuse caches.

Why do you need to persist caches in case of a crash? What's your usage
scenario?
Do you have problems with performance after startup?

You can read more at http://wiki.apache.org/solr/SolrCaching#Overview



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Persist-solr-cache-tp4103463p4103469.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: weak documents

2013-11-27 Thread Paul Libbrecht
Thomas,

our experience with Curriki.org is that evaluating what I call the "related 
documents" is a procedure that needs access to the complete content and thus is 
run at the DB level and no thte sold-level.

For example, if a user changes a part of its name, we need to reindex all of 
his resources. Sure we could try to run a solr query for this, and maybe add 
index fields for it, but we felt it better to run this on the index-trigger 
side, the thing in our (XWiki) wiki which listens to changes and requests the 
reindexing of a few documents (including deletions).

For the maintenance operation, the same issue has appeared.
So, if the indexer or listener or solr has been down for a few minutes or 
hours, we'd need to reindex not only all changed documents but all changed 
documents and their related documents.

If you are able to work through your solution that would be solr-only,  to 
write down all depends-on at index time, it means you would index-update all 
"inverse related" documents every time that changes. For the relation above 
(documents of a user), it means the user documents needs reindexing every time 
a new document is added. I wonder if this makes a scale difference.

Paul


Le 27 nov. 2013 à 09:13, Thomas Scheffler  a 
écrit :

> Hi,
> 
> I am relatively new to SOLR and I am looking for a neat way to implement weak 
> documents with SOLR.
> 
> Whenever a document is updated or deleted all it's dependent documents should 
> be removed from the index. In other words they exist as long as the document 
> exist they refer to when they were indexed - in that specific version. On 
> "update" they will be indexed after their master document.
> 
> I could like to have some kind of "dependsOn" field that carries the 
> uniqueKey value of the master document.
> 
> Can this be done efficiently with SOLR?
> 
> I need this technique because on update and on delete I don't know how many 
> dependent documents exists in the SOLR index. Especially for batch index 
> processes, I need a more efficient way than query before every update or 
> delete.
> 
> kind regards,
> 
> Thomas



smime.p7s
Description: S/MIME cryptographic signature


Re: syncronization between replicas

2013-11-27 Thread adfel70
I'm sorry, I forgot to write the problem.


adfel70 wrote
> 1. take one of the replicas of shard1 down(it doesn't matter which one)
> 2. continue indexing documents(that's important for this scenario)
> 3. take down the second replica of shard1(now the shard is down and we
> cannot index anymore)
> 4. take the replica from step 1 up(that's important that this replica will
> go up first)
> 5. take the replica from step 3 up

after the second replica is up, it has data that the first replica doesn't
have(step 2, we continued to index while the first replica was down), I need
to know if there is a way that the second replica tell the first one that it
has data to sync with him...




--
View this message in context: 
http://lucene.472066.n3.nabble.com/syncronization-between-replicas-tp4103046p4103477.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr 3.6.1 stalling with high CPU and blocking on field cache

2013-11-27 Thread Guido Medina
We have a webapp running with a very high HEAP size (24GB) and we have 
no problems with it AFTER we enabled the new GC that is meant to replace 
sometime in the future the CMS GC, but you have to have Java 6 update 
"Some number I couldn't find but latest should cover" to be able to use:


1. Remove all GC options you have and...
2. Replace them with /"-XX:+UseG1GC -XX:MaxGCPauseMillis=50"/

As a test of course, more information you can read on the following (and 
interesting) article, we also have Solr running with these options, no 
more pauses or HEAP size hitting the sky.


Don't get bored reading the 1st (and small) introduction page of the 
article, page 2 and 3 will make lot of sense: 
http://www.drdobbs.com/jvm/g1-javas-garbage-first-garbage-collector/219401061


HTH,

Guido.

On 26/11/13 21:59, Patrick O'Lone wrote:

We do perform a lot of sorting - on multiple fields in fact. We have
different kinds of Solr configurations - our news searches do little
with regards to faceting, but heavily sort. We provide classified ad
searches and that heavily uses faceting. I might try reducing the JVM
memory some and amount of perm generation as suggested earlier. It feels
like a GC issue and loading the cache just happens to be the victim of a
stop-the-world event at the worse possible time.


My gut instinct is that your heap size is way too high. Try decreasing it to 
like 5-10G. I know you say it uses more than that, but that just seems bizarre 
unless you're doing something like faceting and/or sorting on every field.

-Michael

-Original Message-
From: Patrick O'Lone [mailto:pol...@townnews.com]
Sent: Tuesday, November 26, 2013 11:59 AM
To: solr-user@lucene.apache.org
Subject: Solr 3.6.1 stalling with high CPU and blocking on field cache

I've been tracking a problem in our Solr environment for awhile with periodic 
stalls of Solr 3.6.1. I'm running up to a wall on ideas to try and thought I 
might get some insight from some others on this list.

The load on the server is normally anywhere between 1-3. It's an 8-core machine 
with 40GB of RAM. I have about 25GB of index data that is replicated to this 
server every 5 minutes. It's taking about 200 connections per second and 
roughly every 5-10 minutes it will stall for about 30 seconds to a minute. The 
stall causes the load to go to as high as 90. It is all CPU bound in user space 
- all cores go to 99% utilization (spinlock?). When doing a thread dump, the 
following line is blocked in all running Tomcat threads:

org.apache.lucene.search.FieldCacheImpl$Cache.get (
FieldCacheImpl.java:230 )

Looking the source code in 3.6.1, that is a function call to
syncronized() which blocks all threads and causes the backlog. I've tried to 
correlate these events to the replication events - but even with replication 
disabled - this still happens. We run multiple data centers using Solr and I 
was comparing garbage collection processes between and noted that the old 
generation is collected very differently on this data center versus others. The 
old generation is collected as a massive collect event (several gigabytes 
worth) - the other data center is more saw toothed and collects only in 
500MB-1GB at a time. Here's my parameters to java (the same in all 
environments):

/usr/java/jre/bin/java \
-verbose:gc \
-XX:+PrintGCDetails \
-server \
-Dcom.sun.management.jmxremote \
-XX:+UseConcMarkSweepGC \
-XX:+UseParNewGC \
-XX:+CMSIncrementalMode \
-XX:+CMSParallelRemarkEnabled \
-XX:+CMSIncrementalPacing \
-XX:NewRatio=3 \
-Xms30720M \
-Xmx30720M \
-Djava.endorsed.dirs=/usr/local/share/apache-tomcat/endorsed \ -classpath 
/usr/local/share/apache-tomcat/bin/bootstrap.jar \ 
-Dcatalina.base=/usr/local/share/apache-tomcat \ 
-Dcatalina.home=/usr/local/share/apache-tomcat \ -Djava.io.tmpdir=/tmp \ 
org.apache.catalina.startup.Bootstrap start

I've tried a few GC option changes from this (been running this way for a 
couple of years now) - primarily removing CMS Incremental mode as we have 8 
cores and remarks on the internet suggest that it is only for smaller SMP 
setups. Removing CMS did not fix anything.

I've considered that the heap is way too large (30GB from 40GB) and may not 
leave enough memory for mmap operations (MMap appears to be used in the field 
cache). Based on active memory utilization in Java, seems like I might be able 
to reduce down to 22GB safely - but I'm not sure if that will help with the CPU 
issues.

I think field cache is used for sorting and faceting. I've started to 
investigate facet.method, but from what I can tell, this doesn't seem to 
influence sorting at all - only facet queries. I've tried setting 
useFilterForSortQuery, and seems to require less field cache but doesn't 
address the stalling issues.

Is there something I am overlooking? Perhaps the system is becoming 
oversubscribed in terms of resources? Thanks for any help that is offered.

--
Patrick O'Lone
Director of Software Development
TownNews.com

E-mail ... pol...@

Re: Persist solr cache

2013-11-27 Thread Prasi S
Currently , once solr is started, we run a batch that would fire queries to
solr ( just something like the firstsearcher does). Once this is done, then
the users would start using search.

In case the server is restarted or anything crashes, then again i have to
run this batch which i cannot control. Thats why if there is any way we can
persist .

This was only for our business scenario.



Thanks,
Prasi


On Wed, Nov 27, 2013 at 2:05 PM, michael.boom  wrote:

> Caches are only valid as long as the Index Searcher is valid. So, if you
> make
> a commit with opening a new searcher then caches will be invalidated.
> However, in this scenario you can configure your caches so that the new
> searcher will keep a certain number of cache entries from the previous one
> (autowarmCount).
> That's the only cache "persistence" Solr can offer. On restarting/crash you
> can't reuse caches.
>
> Why do you need to persist caches in case of a crash? What's your usage
> scenario?
> Do you have problems with performance after startup?
>
> You can read more at http://wiki.apache.org/solr/SolrCaching#Overview
>
>
>
> -
> Thanks,
> Michael
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Persist-solr-cache-tp4103463p4103469.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Multivalued true Error?

2013-11-27 Thread Furkan KAMACI
Thanks Sujit, I got the problem and fixed it.


2013/11/26 Sujit Pal 

> Hi Furkan,
>
> In the stock definition of the payload field:
>
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/collection1/conf/schema.xml?view=markup
>
> the analyzer for payloads field type is a WhitespaceTokenizerFactory
> followed by a DelimitedPayloadTokenFilterFactory. So if you send it a
> string "foo$score1 bar$score2 ..." where foo and bar are string tokens and
> score[12] are payload scores and "$" is your delimiter, the analyzer will
> tokenize it into multiple payloads and you should be able to run the tests
> in the blog post. So you shouldn't make it multiValued AFAIK.
>
> -sujit
>
>
>
> On Tue, Nov 26, 2013 at 8:44 AM, Furkan KAMACI  >wrote:
>
> > Hi;
> >
> > I've ported this example from Scala into Java:
> > http://sujitpal.blogspot.com/2013/07/porting-payloads-to-solr4.html#!
> >
> > However does field should be multivalued true at that example?
> >
> > PS: I use Solr 4.5.1
> >
> > Thanks;
> > Furkan KAMACI
> >
>


Re: weak documents

2013-11-27 Thread Thomas Scheffler

Am 27.11.2013 09:58, schrieb Paul Libbrecht:

Thomas,

our experience with Curriki.org is that evaluating what I call the
"related documents" is a procedure that needs access to the complete
content and thus is run at the DB level and no thte sold-level.

For example, if a user changes a part of its name, we need to reindex
all of his resources. Sure we could try to run a solr query for this,
and maybe add index fields for it, but we felt it better to run this
on the index-trigger side, the thing in our (XWiki) wiki which
listens to changes and requests the reindexing of a few documents
(including deletions).

For the maintenance operation, the same issue has appeared. So, if
the indexer or listener or solr has been down for a few minutes or
hours, we'd need to reindex not only all changed documents but all
changed documents and their related documents.

If you are able to work through your solution that would be
solr-only,  to write down all depends-on at index time, it means you
would index-update all "inverse related" documents every time that
changes. For the relation above (documents of a user), it means the
user documents needs reindexing every time a new document is added. I
wonder if this makes a scale difference.


I think both use-cases differ a bit. On index-time of my master document 
I have all information of dependent documents ready. So instead of 
committing one document I commit - lets say - four.


In your case you have to query to get all documents of a user first.

Here is a more detailed use-case. I have metadata in 1 to n languages to 
describe a document (e.g. journal article).


I commit a master document in a specified default language to SOLR and 
one document for every language I have metadata for. If a user adds or 
removes metadata (e.g. abstract in French) there is one document more or 
one document less in SOLR. So their number changes and I want stalled 
data to be kept in the index.


A similar use case: I have article documents with authors. I create 
"author" documents for every article. If someone adds or removes an 
author I need to track that change. These "dump" author documents are 
used for an alphabetical person index and hold a unique field that is 
used to group them but these documents exists only as long as their 
master documents do.


My two use-cases are quite similar so I would like these "weak" 
documents functionality somehow.


SOLR knows if a document is added with id=foo it have to replace a 
document that matches id:"foo". If I can change this behavior to 
dependsOn:"foo" I am done. :-D


regards

Thomas


TrimFilterFactory and IllegalArgumentException with Solr4.6

2013-11-27 Thread Bernd Fehling
Now this is strange,
while using TrimFilterFactory with attribute "updateOffsets=true" as described 
in

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.TrimFilterFactory
and
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-TrimFilter

I get "java.lang.IllegalArgumentException: updateOffsets=true is not supported 
anymore as of Lucene 4.4".

So the documentation is outdated?

What is now the behavior of TrimFilterFactory, always "updateOffsets=true" ?

Regards
Bernd


News url is not showing correct

2013-11-27 Thread Vishal GUPTA
Hi

I am facing a problem in solr with tt_news url. Every time its showing all news 
from one detail page.

For ex: I have two category of news

1.   Corporate

2.   Human

So url for corporate it should be form like : 
domainname/pagename/corporate/detail/article/newsheading
And for human it should be like: 
domainname/pagename/human/detail/article/newsheading

But every time it is showing: 
domainname/pagename/corporate/detail/article/newsheading

I have checked that is not taking singlepid value in following code(this is I 
used for tt news):

plugin.tx_solr.index.queue {
table = tt_news
// enables indexing of tt_news reocrds
tt_news = 1
tt_news {
fields {
abstract = short
author = author
description = short
title = title

// the special SOLR_CONTENT content object cleans HTML and RTE 
fields
content = SOLR_CONTENT
content {
field = bodytext
}

// the special SOLR_RELATION content object resolves relations
category_stringM = SOLR_RELATION
category_stringM {
localField = category
multiValue = 1
}

// the special SOLR_MULTIVALUE content object allows to index 
multivalue fields
keywords = SOLR_MULTIVALUE
keywords {
field = keywords
}

// build the URL through typolink, make sure to use returnLast = url
url = TEXT
url {
typolink.parameter = {$plugin.tt_news.singlePid}
typolink.additionalParams = 
&tx_ttnews[tt_news]={field:uid}&L={field:__solr_index_language}
typolink.additionalParams.insertData = 1
typolink.returnLast = url
typolink.useCacheHash = 1
}

sortAuthor_stringS = author
sortTitle_stringS  = title
}
}
}

so I define singlepid at home page typoscript section and this id is of 
corporate news.

Can anyone tell me how can I configure it for many categories. So that it show 
me correct url

Regards
Vishal Gupta
CUG: 830 4659
Email: vishal.gu...@steria.co.in

This email and any attachments may contain confidential information and 
intellectual property (including copyright material). It is only for the use of 
the addressee(s) in accordance with any instructions contained within it. If 
you are not the addressee, you are prohibited from copying, forwarding, 
disclosing, saving or otherwise using it in any way. If you receive this email 
in error, please immediately advise the sender and delete it. Steria may 
monitor the content of emails within its network to ensure compliance with its 
policies and procedures. Emails are susceptible to alteration and their 
integrity (including origin) cannot be assured. Steria shall not be liable for 
any modification to a message, or for messages falsely sent.


LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

2013-11-27 Thread Müller , Stephan
Hello,

this is a repost. This message was originally posted on the 'general' list but 
it was suggested, that the 'user' list might be a better place to ask.

 Original Message 
Hi,

we are passing a multivalued field to the LanguageIdentifierUpdateProcessor. 
This multivalued field contains arbitrary types (Integer, String, Date).

Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument doc, 
String[] fields), which btw does not use the parameter fields, is unable to 
parse all fields of the/a multivalued field. The call "Object content = 
doc.getFieldValue(fieldName);" does not care what type the field is and just 
delegates to SolrInputDocument which in turn calls getFirstValue.

So, two issues:
First - if the first value of the multivalued field is not of type String, the 
field is ignored completely.

Second - the concat method does not concat all values of a multivalued field.

While http://www.mail-archive.com/solr-user@lucene.apache.org/msg90530.html 
states: "The feature is designed to detect exactly one language per field. In 
case of multivalued, it will concatenate all values before detection." But as 
far as I can see, the code is unable to do this at all for multivalued fields.

This behavior was found in 4.3 but the code is still the same for current trunk 
(as of 2013-11-26)

Is this a bug? Is this a special design decision? Did we miss a certain 
configuration, that would allow the Language identification to use all values 
of a multivalued field?

We are about to write our own 
LangDetectLanguageIdentifierUpdateProcessorFactory (why is the getInstance 
hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite 
LanguageIdentifierUpdateProcessor to handle all values of a multivalued field, 
ignoring non-string values.



Please see configuration below.

I hope I was able to make myself clear. I'd like to hear your thoughts on this, 
before I go off and file a bug report.

Regards,
Stephan


A little background:
We are using a 3rd-party CMS framework which pulls in some magic SOLR 
configuration (namely the textbody field).

The textbody field is defined as follows:



As you can see, it is also used as search field, therefor we want to have the 
actual datatypes on the values.
The field itself is generated by a processor, prior to calling the language 
identification (see processor chain).



The processor chain:


  
  

  
  

  

  
  

  
  
name
name_tokenized
  

  
  
textbody,name_tokenized
language
en
  
  
  
  
language
textbody,name_tokenized
  

  



Re: Persist solr cache

2013-11-27 Thread michael.boom
You could just add the queries you have set up in your batch script to the
firstSearcher queries. Like this, you wouldn't need to run the script
everytime you restart Solr.

As for crash protection and immediate action, that's outside the scope of
the Solr mailing list. You could setup a watchdog that will restart Solr if
it crashes, or something like that. 

Or you could use SolrCloud with replicas on multiple machine. This would
remove the SPOF from your system.



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/Persist-solr-cache-tp4103463p4103487.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: syncronization between replicas

2013-11-27 Thread Daniel Collins
I think when a replica becomes leader, it tries to sync *from* all the
other replicas to see if anyone else is more up to date than it is, then it
syncs back out *to* the replicas.  But that probably won't happen in your
case, since when replica1 comes back (step 4) it is the only contender, so
it can't sync then.

So I know Solr has support for 2-way sync, but whether it will happen in
step 5 (when the other replica comes back up), I'm not honestly sure...
Would need to delve into the code to check.


On 27 November 2013 09:19, adfel70  wrote:

> I'm sorry, I forgot to write the problem.
>
>
> adfel70 wrote
> > 1. take one of the replicas of shard1 down(it doesn't matter which one)
> > 2. continue indexing documents(that's important for this scenario)
> > 3. take down the second replica of shard1(now the shard is down and we
> > cannot index anymore)
> > 4. take the replica from step 1 up(that's important that this replica
> will
> > go up first)
> > 5. take the replica from step 3 up
>
> after the second replica is up, it has data that the first replica doesn't
> have(step 2, we continued to index while the first replica was down), I
> need
> to know if there is a way that the second replica tell the first one that
> it
> has data to sync with him...
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/syncronization-between-replicas-tp4103046p4103477.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Which patch for which solr version

2013-11-27 Thread Ramo Karahasan
Hi,

 

on https://issues.apache.org/jira/browse/SOLR-3583

 

there are some patches listed. I currently can't really figure out, for
which solr version this patch is valid, since the issue listed there is
still open and should be fixed for version 4.6.

 

I'm wondering if this patch can be applied on a 4.5.1 version of solr 

 

Best,

Ramo



RE: Need help on Joining and sorting syntax and limitations between multiple documents in solr-4.4.0

2013-11-27 Thread Sukanta Dey
Hi Team,

As per the latest updates in the support ticket in Lucid portal we have some 
concerns as below


1.   The join key id's seem to have to be integers. It says they require 
longs, but I am having trouble with anything but an integer as the "from" and 
"to" key values.

--regarding the above comment, we need to have these fields as non-numeric 
instead of numeric value which was discussed in first call with you.



2.You have to separate this on two collections. This doesn't work as join 
does in the same way; it is a function returning a value. That means that 
documents that don't match just get a   value of zero and still return. That 
means all four documents will return if they are all in the same collection.
--starting from the requirement discussion we are emphasizing on the fact that 
join needs to be performed between the documents which will reside in the same 
core not in different cores.

3. You need a "join" cache in your joined collection.
 --could you please explain a bit more on the above like, what 
we need to do to have this implementation on our side and what is the 
utility/feature of join cache

Also, we tried the vjoin operation with the syntax given by Greg in the ticket 
but it is not working as per our expectation.

Thanks,
Sukanta

From: Sukanta Dey
Sent: Tuesday, November 26, 2013 3:20 PM
To: 'Colm Pruvot'; Yann Yu; 'Greg Harris'
Cc: 'solr-user@lucene.apache.org'; Sukanta Dey; Souvik Mazumder
Subject: RE: Need help on Joining and sorting syntax and limitations between 
multiple documents in solr-4.4.0

Hi Team,

Attaching the updated files as per the comments in the ticket. You can now try 
the VJOIN operation on the updated files.
It would be also helpful for us if you send the correct VJOIN syntax with the 
inputs from the updated files.

Thanks,
Sukanta

From: Sukanta Dey
Sent: Friday, November 22, 2013 1:46 PM
To: 'Colm Pruvot'; Yann Yu; 'Greg Harris'
Cc: 'solr-user@lucene.apache.org'; Sukanta Dey
Subject: RE: Need help on Joining and sorting syntax and limitations between 
multiple documents in solr-4.4.0

Hi Team,

I am attaching all the required files we are using to get the VJOIN 
functionality along with the actual requirement statement.
Hope this would help you understand better the requirement for VJOIN 
functionality.

Thanks,
Sukanta

From: Sukanta Dey
Sent: Wednesday, September 04, 2013 1:50 PM
To: 'solr-user@lucene.apache.org'
Cc: Sukanta Dey
Subject: Need help on Joining and sorting syntax and limitations between 
multiple documents in solr-4.4.0

Hi Team,

In my project I am going to use Apache solr-4.4.0 version for searching. While 
doing that I need to join between multiple solr documents within the same core 
on one of the common field across the documents.
Though I successfully join the documents using solr-4.4.0 join syntax, it is 
returning me the expected result, but, since my next requirement is to sort the 
returned result on basis of the fields from the documents
Involved in join condition's "from" clause, which I was not able to get. Let me 
explain the problem in detail along with the files I am using ...


1)  Files being used :

a.   Picklist_1.xml

--



t1324838

7

956

130712901

Draft

Draoft





b.  Picklist_2.xml

---



t1324837

7

87749

130712901

New

Neuo





c.   AssetID_1.xml

---



t1324837

a180894808

1

true

2013-09-02T09:28:18Z

130713716

130712901





d.  AssetID_2.xml





 t1324838

 a171658357

1

130713716

2283961

2290309

7

7

13503796
15485964

38052

41133

130712901





2)  Requirement:



i. It needs to have a join  between the files using 
"def14227_picklist" field from AssetID_1.xml and AssetID_2.xml and 
"describedObjectId" field from Picklist_1.xml and Picklist_2.xml files.

ii.   After joining we need to have all the fields from the 
files AssetID_*.xml and "en","gr" fields from Picklist_*.xml files.

iii.  While joining we also sort the result based on the "en" 
field value.



3)  I was trying with "q={!join from=inner_id to=outer_id}zzz:vvv" syntax 
but no luck.

Any help/suggestion would be appreciated.

Thanks,
Sukanta Dey






Re: SOLR Master-Slave Repeater with Load balancer

2013-11-27 Thread Erick Erickson
Yes. This is going to hurt you a lot. The intent of M/S is that
you should be indexing to one, and only one machine, the
master. All slaves pull their indexes from the master. Frankly
I don't know quite what will happen in the configuration you're
talking about. I strongly recommend you do not do this.

HA is not at all difficult in a M/S situation. Just configure
as many slaves as you need, all pointing to the same master.
Then have your LB point to the slaves. And have your indexing
process(es) point to the master and only the master.

Bringing up a new slave is as simple as configuring a new machine,
pointing it to the master, waiting for replication to complete then
letting the LB know about the new machine.

What you don't get is fail-over if the master goes down. You can
promote one of the slaves to be a master but you have to put
a mechanism in place that allows you to "catch up" that index,
which is often just re-indexing everything from before the master
went down to the new master.

You said you don't want to go to SolrCloud, and that's up to
you. But automatic, robust HA/DR is hard. If your goal is to have
a fault-tolerant, automated recovery system I really urge you to
reconsider SolrCloud. The _point_ of SolrCloud is exactly
that. If you try to re-invent that process you'll be in for a lot of work.

There, rant finished :)
Erick


On Tue, Nov 26, 2013 at 1:47 PM, kondamudims  wrote:

> We are trying to setup solr Master Slave repeater model, where we will have
> two solr servers say S1 and S2 and a Load balancer LB to route all the
> requests to either S1 or S2. S1 and S2 acts as both Master and
> Slave(Repeater). In both the solr server configurations, in the
> solrconfig.xml file for master url property if we provide Load balancer
> host-name and port number then at any point there will be a self polling,
> i.e. if LB is configured in such a way that all its requests will be routed
> to S1, then while polling S1-->LB-->S1 and S2-->LB-->S1. Do you see any
> issue with self polling(S1-->LB-->). We are mainly trying to achieve High
> availability as we don't want to use Solr Cloud. Thanks in advance
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SOLR-Master-Slave-Repeater-with-Load-balancer-tp4103363.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: TermVectorComponent NullPointerException

2013-11-27 Thread Erick Erickson
Please review: http://wiki.apache.org/solr/UsingMailingLists

You've given us almost no information to go on here.

Best,
Erick


On Tue, Nov 26, 2013 at 2:21 PM, GOYAL, ANKUR  wrote:

> Hi,
>
> I am working on using term vector component with solr 4.2.1. If I use solr
> in a multicore environment, then I am getting a Null Pointer exception.
> However, if I use single core as is mentioned at :-
>
> http://wiki.apache.org/solr/TermVectorComponent
>
> then I do not get any exception. However, the response that I get does not
> contain any term information. So, did anybody else also faced this issue ?
>
> With Regards,
> Ankur
>
>
>
>


Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

2013-11-27 Thread Jack Krupansky
I suspect that it is an oversight for a use case that was not considered. I 
mean, it should probably either ignore or convert non text/string values. 
Hmmm... are you using JSON input? I mean, how are the types being set? Solr 
XML doesn't have a way to set the value types.


You could workaround it with an update processor that copied the field and 
massaged the multiple values into what you really want the language 
detection to see. You could even implement that processor as a JavaScript 
script with the stateless script update processor.


-- Jack Krupansky

-Original Message- 
From: Müller, Stephan

Sent: Wednesday, November 27, 2013 5:02 AM
To: solr-user@lucene.apache.org
Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on 
multivalued fields


Hello,

this is a repost. This message was originally posted on the 'general' list 
but it was suggested, that the 'user' list might be a better place to ask.


 Original Message 
Hi,

we are passing a multivalued field to the LanguageIdentifierUpdateProcessor. 
This multivalued field contains arbitrary types (Integer, String, Date).


Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument 
doc, String[] fields), which btw does not use the parameter fields, is 
unable to parse all fields of the/a multivalued field. The call "Object 
content = doc.getFieldValue(fieldName);" does not care what type the field 
is and just delegates to SolrInputDocument which in turn calls 
getFirstValue.


So, two issues:
First - if the first value of the multivalued field is not of type String, 
the field is ignored completely.


Second - the concat method does not concat all values of a multivalued 
field.


While http://www.mail-archive.com/solr-user@lucene.apache.org/msg90530.html 
states: "The feature is designed to detect exactly one language per field. 
In case of multivalued, it will concatenate all values before detection." 
But as far as I can see, the code is unable to do this at all for 
multivalued fields.


This behavior was found in 4.3 but the code is still the same for current 
trunk (as of 2013-11-26)


Is this a bug? Is this a special design decision? Did we miss a certain 
configuration, that would allow the Language identification to use all 
values of a multivalued field?


We are about to write our own 
LangDetectLanguageIdentifierUpdateProcessorFactory (why is the getInstance 
hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite 
LanguageIdentifierUpdateProcessor to handle all values of a multivalued 
field, ignoring non-string values.




Please see configuration below.

I hope I was able to make myself clear. I'd like to hear your thoughts on 
this, before I go off and file a bug report.


Regards,
Stephan


A little background:
We are using a 3rd-party CMS framework which pulls in some magic SOLR 
configuration (namely the textbody field).


The textbody field is defined as follows:

multiValued="true"/>


As you can see, it is also used as search field, therefor we want to have 
the actual datatypes on the values.
The field itself is generated by a processor, prior to calling the language 
identification (see processor chain).




The processor chain:


 
 

 
 
   
 

 
 

 
 
   name
   name_tokenized
 

 
 class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">

   textbody,name_tokenized
   language
   en
 

 
 class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProcessorFactory">

   language
   textbody,name_tokenized
 

 
 



Re: Client-side proxy for Solr 4.5.0

2013-11-27 Thread Reyes, Mark
What about using some JSONP techniques since the results in the Solr
instance rest as key/value pairs?


On 11/26/13, 10:53 AM, "Markus Jelsma"  wrote:

>I don't think you mean client-side proxy. You need a server side layer
>such as a normal web application or good proxy. We use Nginx, it is very
>fast and very feature rich. Its config scripting is usually enough to
>restrict access and limit input parameters. We also use Nginx's embedded
>Perl and Lua scripting besides its config scripting to implement more
>difficult logic.
>
> 
> 
>-Original message-
>> From:Reyes, Mark 
>> Sent: Tuesday 26th November 2013 19:27
>> To: solr-user@lucene.apache.org
>> Subject: Client-side proxy for Solr 4.5.0
>> 
>> Are there any GOOD client-side solutions to proxy a Solr 4.5.0 instance
>>so that the end-user can see  their queries w/o being able to directly
>>access :8983?
>> 
>> Applications/frameworks used:
>> - Solr 4.5.0
>> - AJAX Solr (javascript library)
>> 
>> Thank you,
>> Mark
>> 
>> IMPORTANT NOTICE: This e-mail message is intended to be received only
>>by persons entitled to receive the confidential information it may
>>contain. E-mail messages sent from Bridgepoint Education may contain
>>information that is confidential and may be legally privileged. Please
>>do not read, copy, forward or store this message unless you are an
>>intended recipient of it. If you received this transmission in error,
>>please notify the sender by reply e-mail and delete the message and any
>>attachments.


IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Re: Is it possible to have only fq in my solr query?

2013-11-27 Thread Erick Erickson
The sense of "fq" clauses is "for all the docs that
match my primary query, only show the ones that
match the fq clause". There's no primary query to
work with.

If you really need this capability, you can add this to
the  section of your request handler in
solrconfig.xml
*:*

The oob request handler is
 

Best,
Erick


On Tue, Nov 26, 2013 at 10:41 PM, RadhaJayalakshmi <
rlakshminaraya...@inautix.co.in> wrote:

> Hi,
> I am preparing a solr query. in that i am only giving fq parameter .. I
> dont
> give any q parameter..
> If i exeucte such query, where only it is having fq, it is not returning
> any
> docs. in the sense it is returning 0 docs.
> So, is it always mandatory to have q parameter in solr query?
> if so, then i think i should have something like
> q=*:* and fq=field:value
>
>
> Please explain
>
> Thanks
> Radha
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Is-it-possible-to-have-only-fq-in-my-solr-query-tp4103429.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: weak documents

2013-11-27 Thread Jack Krupansky
Just bite the bullet and do the query at your application level. I mean, 
Solr/Lucene would have to do the same amount of work internally anyway. If 
the perceived performance overhead is too great, get beefier hardware.


-- Jack Krupansky

-Original Message- 
From: Thomas Scheffler

Sent: Wednesday, November 27, 2013 3:13 AM
To: SOLR User
Subject: weak documents

Hi,

I am relatively new to SOLR and I am looking for a neat way to implement
weak documents with SOLR.

Whenever a document is updated or deleted all it's dependent documents
should be removed from the index. In other words they exist as long as
the document exist they refer to when they were indexed - in that
specific version. On "update" they will be indexed after their master
document.

I could like to have some kind of "dependsOn" field that carries the
uniqueKey value of the master document.

Can this be done efficiently with SOLR?

I need this technique because on update and on delete I don't know how
many dependent documents exists in the SOLR index. Especially for batch
index processes, I need a more efficient way than query before every
update or delete.

kind regards,

Thomas 



Re: News url is not showing correct

2013-11-27 Thread Erick Erickson
My _guess_, and it's only a guess since you haven't shown us
anything about your Solr setup, is that all your documents
are getting indexed with the same ID so you only have one
live document.

You might review:
http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick


On Wed, Nov 27, 2013 at 5:03 AM, Vishal GUPTA wrote:

> Hi
>
> I am facing a problem in solr with tt_news url. Every time its showing all
> news from one detail page.
>
> For ex: I have two category of news
>
> 1.   Corporate
>
> 2.   Human
>
> So url for corporate it should be form like :
> domainname/pagename/corporate/detail/article/newsheading
> And for human it should be like:
> domainname/pagename/human/detail/article/newsheading
>
> But every time it is showing:
> domainname/pagename/corporate/detail/article/newsheading
>
> I have checked that is not taking singlepid value in following code(this
> is I used for tt news):
>
> plugin.tx_solr.index.queue {
> table = tt_news
> // enables indexing of tt_news reocrds
> tt_news = 1
> tt_news {
> fields {
> abstract = short
> author = author
> description = short
> title = title
>
> // the special SOLR_CONTENT content object cleans HTML and RTE
> fields
> content = SOLR_CONTENT
> content {
> field = bodytext
> }
>
> // the special SOLR_RELATION content object resolves relations
> category_stringM = SOLR_RELATION
> category_stringM {
> localField = category
> multiValue = 1
> }
>
> // the special SOLR_MULTIVALUE content object allows to index
> multivalue fields
> keywords = SOLR_MULTIVALUE
> keywords {
> field = keywords
> }
>
> // build the URL through typolink, make sure to use returnLast
> = url
> url = TEXT
> url {
> typolink.parameter = {$plugin.tt_news.singlePid}
> typolink.additionalParams =
> &tx_ttnews[tt_news]={field:uid}&L={field:__solr_index_language}
> typolink.additionalParams.insertData = 1
> typolink.returnLast = url
> typolink.useCacheHash = 1
> }
>
> sortAuthor_stringS = author
> sortTitle_stringS  = title
> }
> }
> }
>
> so I define singlepid at home page typoscript section and this id is of
> corporate news.
>
> Can anyone tell me how can I configure it for many categories. So that it
> show me correct url
>
> Regards
> Vishal Gupta
> CUG: 830 4659
> Email: vishal.gu...@steria.co.in
>
> This email and any attachments may contain confidential information and
> intellectual property (including copyright material). It is only for the
> use of the addressee(s) in accordance with any instructions contained
> within it. If you are not the addressee, you are prohibited from copying,
> forwarding, disclosing, saving or otherwise using it in any way. If you
> receive this email in error, please immediately advise the sender and
> delete it. Steria may monitor the content of emails within its network to
> ensure compliance with its policies and procedures. Emails are susceptible
> to alteration and their integrity (including origin) cannot be assured.
> Steria shall not be liable for any modification to a message, or for
> messages falsely sent.
>


Re: syncronization between replicas

2013-11-27 Thread Erick Erickson
As Daniel says, there's no information available
in step 4 for that node to know it's out of date.

"Don't do that" isn't very helpful. I think the only
recovery strategy I can think of offhand is to
reindex from some time T prior to step <1>...

Best,
Erick


On Wed, Nov 27, 2013 at 6:07 AM, Daniel Collins wrote:

> I think when a replica becomes leader, it tries to sync *from* all the
> other replicas to see if anyone else is more up to date than it is, then it
> syncs back out *to* the replicas.  But that probably won't happen in your
> case, since when replica1 comes back (step 4) it is the only contender, so
> it can't sync then.
>
> So I know Solr has support for 2-way sync, but whether it will happen in
> step 5 (when the other replica comes back up), I'm not honestly sure...
> Would need to delve into the code to check.
>
>
> On 27 November 2013 09:19, adfel70  wrote:
>
> > I'm sorry, I forgot to write the problem.
> >
> >
> > adfel70 wrote
> > > 1. take one of the replicas of shard1 down(it doesn't matter which one)
> > > 2. continue indexing documents(that's important for this scenario)
> > > 3. take down the second replica of shard1(now the shard is down and we
> > > cannot index anymore)
> > > 4. take the replica from step 1 up(that's important that this replica
> > will
> > > go up first)
> > > 5. take the replica from step 3 up
> >
> > after the second replica is up, it has data that the first replica
> doesn't
> > have(step 2, we continued to index while the first replica was down), I
> > need
> > to know if there is a way that the second replica tell the first one that
> > it
> > has data to sync with him...
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/syncronization-between-replicas-tp4103046p4103477.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> >
>


Term Vector Component Question

2013-11-27 Thread Jamie Johnson
I am interested in retrieving the tf for terms that matched the query, not
all terms in the document.  Is this possible?  Looking at the example when
I search for the word cable I get the response that is shown below, ideally
I'd like to see only the tf for the word cable.  Is this possible or would
I need to write a custom query component to do this?





0

2









32MB SD card, USB cable, AV cable, battery





USB cable





earbud headphones, USB cable







id



IW-02





9885A004





1





1





1





2





1





1





1









3007WFP





1





1









MA147LL/A





1





1





1





1












Re: Which patch for which solr version

2013-11-27 Thread Erick Erickson
"Try it and see". Not really helpful, but the best we I can do.
There's no formal method for insuring that a patch will work
with an arbitrary version. At least you're trying to apply it
to a version newer than it was created on.

Not much help, but

If you _do_ apply it to 4.5.1, and if you have
to make changes to the code to get it to work,
please consider uploading a new patch.

Best,
Erick


On Wed, Nov 27, 2013 at 8:10 AM, Ramo Karahasan <
ramo.karaha...@googlemail.com> wrote:

> Hi,
>
>
>
> on https://issues.apache.org/jira/browse/SOLR-3583
>
>
>
> there are some patches listed. I currently can't really figure out, for
> which solr version this patch is valid, since the issue listed there is
> still open and should be fixed for version 4.6.
>
>
>
> I'm wondering if this patch can be applied on a 4.5.1 version of solr
>
>
>
> Best,
>
> Ramo
>
>


Re: Term Vector Component Question

2013-11-27 Thread Erick Erickson
Would it serve to return the tf or ttf? You'd have to
tack on clauses like
fl=*,ttf(name,drive)
or
fl=*.ttf(name,drive)

Which implies that you'd have to do some work
on the query side to add the tf or ttf clauses.

See:
http://wiki.apache.org/solr/FunctionQuery#tf

Best,
Erick


On Wed, Nov 27, 2013 at 9:32 AM, Jamie Johnson  wrote:

> I am interested in retrieving the tf for terms that matched the query, not
> all terms in the document.  Is this possible?  Looking at the example when
> I search for the word cable I get the response that is shown below, ideally
> I'd like to see only the tf for the word cable.  Is this possible or would
> I need to write a custom query component to do this?
>
> 
>
> 
>
> 0
>
> 2
>
> 
>
> 
>
> 
>
> 
>
> 32MB SD card, USB cable, AV cable, battery
>
> 
>
> 
>
> USB cable
>
> 
>
> 
>
> earbud headphones, USB cable
>
> 
>
> 
>
> 
>
> id
>
> 
>
> IW-02
>
> 
>
> 
>
> 9885A004
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 2
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 
>
> 
>
> 3007WFP
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 
>
> 
>
> MA147LL/A
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 
>
> 
>
> 
>


Re: Term Vector Component Question

2013-11-27 Thread Jack Krupansky

That information would be included in the debugQuery output as well.

-- Jack Krupansky

-Original Message- 
From: Jamie Johnson 
Sent: Wednesday, November 27, 2013 9:32 AM 
To: solr-user@lucene.apache.org 
Subject: Term Vector Component Question 


I am interested in retrieving the tf for terms that matched the query, not
all terms in the document.  Is this possible?  Looking at the example when
I search for the word cable I get the response that is shown below, ideally
I'd like to see only the tf for the word cable.  Is this possible or would
I need to write a custom query component to do this?





0

2









32MB SD card, USB cable, AV cable, battery





USB cable





earbud headphones, USB cable







id



IW-02





9885A004





1





1





1





2





1





1





1









3007WFP





1





1









MA147LL/A





1





1





1





1












solr as a service for multiple projects in the same environment

2013-11-27 Thread adfel70
Hi
I have various solr related projects in a single environment.
These project are not related one to another.

I'm thinking of building a solr architecture so that all the projects will
use different solr collections in the same cluster, as opposed to having a
solr cluster for each project.

1. as I understand I can separate the configs of each collection in
zookeeper. is it correct?
2.are there any solr operations that can be performed on collection A and
somehow affect collection B?
3. is the solr cache separated for each collection? 
4. I assume that I'll encounter a problem with the os cache, when the
different indices will compete on the same memory, right? how severe is this
issue?
5. any other advice on building such an architecture? does the maintenance
overhead of maintaining multiple clusters in production really overwhelm the
problems and risks of using the same cluster for multiple systems?

thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-as-a-service-for-multiple-projects-in-the-same-environment-tp4103523.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Client-side proxy for Solr 4.5.0

2013-11-27 Thread Guido Medina
Why complicate it?, I think the simplest solution to the poster question 
is either a transparent proxy or proxy jetty (or tomcat) via Apache Web 
Server.


I don't think there will be any difference between either, only how easy 
one or the other are to implement.


HTH,

Guido.

On 27/11/13 14:13, Reyes, Mark wrote:

What about using some JSONP techniques since the results in the Solr
instance rest as key/value pairs?


On 11/26/13, 10:53 AM, "Markus Jelsma"  wrote:


I don't think you mean client-side proxy. You need a server side layer
such as a normal web application or good proxy. We use Nginx, it is very
fast and very feature rich. Its config scripting is usually enough to
restrict access and limit input parameters. We also use Nginx's embedded
Perl and Lua scripting besides its config scripting to implement more
difficult logic.



-Original message-

From:Reyes, Mark 
Sent: Tuesday 26th November 2013 19:27
To: solr-user@lucene.apache.org
Subject: Client-side proxy for Solr 4.5.0

Are there any GOOD client-side solutions to proxy a Solr 4.5.0 instance
so that the end-user can see  their queries w/o being able to directly
access :8983?

Applications/frameworks used:
- Solr 4.5.0
- AJAX Solr (javascript library)

Thank you,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only
by persons entitled to receive the confidential information it may
contain. E-mail messages sent from Bridgepoint Education may contain
information that is confidential and may be legally privileged. Please
do not read, copy, forward or store this message unless you are an
intended recipient of it. If you received this transmission in error,
please notify the sender by reply e-mail and delete the message and any
attachments.


IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.




Re: Client-side proxy for Solr 4.5.0

2013-11-27 Thread Guido Medina

Mark,

As a 2nd thought, maybe, I was just focusing on what I thought you 
needed initially which is allow client to query solr and at the same 
time restrict specific request parameters, both apache and a any rich 
transparent proxy can do the job easily, apache can rewrite the URL and 
map only what you want to expose, same as a transparent proxy. For me I 
find easier to use Apache web server.


HTH,

Guido.

On 27/11/13 14:46, Guido Medina wrote:
Why complicate it?, I think the simplest solution to the poster 
question is either a transparent proxy or proxy jetty (or tomcat) via 
Apache Web Server.


I don't think there will be any difference between either, only how 
easy one or the other are to implement.


HTH,

Guido.

On 27/11/13 14:13, Reyes, Mark wrote:

What about using some JSONP techniques since the results in the Solr
instance rest as key/value pairs?


On 11/26/13, 10:53 AM, "Markus Jelsma"  
wrote:



I don't think you mean client-side proxy. You need a server side layer
such as a normal web application or good proxy. We use Nginx, it is 
very

fast and very feature rich. Its config scripting is usually enough to
restrict access and limit input parameters. We also use Nginx's 
embedded

Perl and Lua scripting besides its config scripting to implement more
difficult logic.



-Original message-

From:Reyes, Mark 
Sent: Tuesday 26th November 2013 19:27
To: solr-user@lucene.apache.org
Subject: Client-side proxy for Solr 4.5.0

Are there any GOOD client-side solutions to proxy a Solr 4.5.0 
instance

so that the end-user can see  their queries w/o being able to directly
access :8983?

Applications/frameworks used:
- Solr 4.5.0
- AJAX Solr (javascript library)

Thank you,
Mark

IMPORTANT NOTICE: This e-mail message is intended to be received only
by persons entitled to receive the confidential information it may
contain. E-mail messages sent from Bridgepoint Education may contain
information that is confidential and may be legally privileged. Please
do not read, copy, forward or store this message unless you are an
intended recipient of it. If you received this transmission in error,
please notify the sender by reply e-mail and delete the message and 
any

attachments.


IMPORTANT NOTICE: This e-mail message is intended to be received only 
by persons entitled to receive the confidential information it may 
contain. E-mail messages sent from Bridgepoint Education may contain 
information that is confidential and may be legally privileged. 
Please do not read, copy, forward or store this message unless you 
are an intended recipient of it. If you received this transmission in 
error, please notify the sender by reply e-mail and delete the 
message and any attachments.






SolR vs large PDF

2013-11-27 Thread Marcello Lorenzi

Hi All,
on our test environment we have implemented a new search engine based on 
Solr 4.3 with 2 instances hosted on different servers and 1 shard 
present on each servlet container.


During some stress test we noticed a bottleneck into crawling of large 
PDF file that blocks the serving of results from queries to the collections.


Is it possible to boost or mitigate the overhead created by PDFBOX 
during the crawling?


Thanks,
Marcello


RE: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

2013-11-27 Thread Müller , Stephan
> I suspect that it is an oversight for a use case that was not considered.
> I mean, it should probably either ignore or convert non text/string
> values.
Ok, I'll see that I provide a patch against trunk. It actually 
ignores non string values, but is unable to check the remaining values
of a multivalued field.

> Hmmm... are you using JSON input? I mean, how are the types being set?
> Solr XML doesn't have a way to set the value types.
> 
No. It's a field with multivalued=true. That results in a SolrInputField 
where value (which is defined to be Object) actually holds a List.
This list is populated with Integer, String, Date, you name it. 
I'm talking about the actual Java-Datatypes. The values in the list are
probably set by this 3rdparty Textbodyprocessor thingy.

Now the Language processor just asks for field.getValue().
This is delegated to the SolrInputField which in turn calls firstValue()
Interestingly enough, already is able to handle a Collection as its value. 
But if the value is a collection, it just returns the first element.

> You could workaround it with an update processor that copied the field and
> massaged the multiple values into what you really want the language
> detection to see. You could even implement that processor as a JavaScript
> script with the stateless script update processor.
>
Our workaround would be to not feed the multivalued field but only the 
String fields (which are also included in the multivalued field)


Filing a Bug/Feature request and providing the patch will take some time
as I haven't setup a fully working trunk in my IDEA installation.
But I'm eager to do it :)

Regards,
Stephan

 
> -- Jack Krupansky
> 
> -Original Message-
> From: Müller, Stephan
> Sent: Wednesday, November 27, 2013 5:02 AM
> To: solr-user@lucene.apache.org
> Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on
> multivalued fields
> 
> Hello,
> 
> this is a repost. This message was originally posted on the 'general' list
> but it was suggested, that the 'user' list might be a better place to ask.
> 
>  Original Message 
> Hi,
> 
> we are passing a multivalued field to the
> LanguageIdentifierUpdateProcessor.
> This multivalued field contains arbitrary types (Integer, String, Date).
> 
> Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
> doc, String[] fields), which btw does not use the parameter fields, is
> unable to parse all fields of the/a multivalued field. The call "Object
> content = doc.getFieldValue(fieldName);" does not care what type the field
> is and just delegates to SolrInputDocument which in turn calls
> getFirstValue.
> 
> So, two issues:
> First - if the first value of the multivalued field is not of type String,
> the field is ignored completely.
> 
> Second - the concat method does not concat all values of a multivalued
> field.
> 
> While http://www.mail-archive.com/solr-
> u...@lucene.apache.org/msg90530.html
> states: "The feature is designed to detect exactly one language per field.
> In case of multivalued, it will concatenate all values before detection."
> But as far as I can see, the code is unable to do this at all for
> multivalued fields.
> 
> This behavior was found in 4.3 but the code is still the same for current
> trunk (as of 2013-11-26)
> 
> Is this a bug? Is this a special design decision? Did we miss a certain
> configuration, that would allow the Language identification to use all
> values of a multivalued field?
> 
> We are about to write our own
> LangDetectLanguageIdentifierUpdateProcessorFactory (why is the getInstance
> hardcoded to return LanguageIdentifierUpdateProcessor?) and overwrite
> LanguageIdentifierUpdateProcessor to handle all values of a multivalued
> field, ignoring non-string values.
> 
> 
> 
> Please see configuration below.
> 
> I hope I was able to make myself clear. I'd like to hear your thoughts on
> this, before I go off and file a bug report.
> 
> Regards,
> Stephan
> 
> 
> A little background:
> We are using a 3rd-party CMS framework which pulls in some magic SOLR
> configuration (namely the textbody field).
> 
> The textbody field is defined as follows:
> 
>  multiValued="true"/>
> 
> As you can see, it is also used as search field, therefor we want to have
> the actual datatypes on the values.
> The field itself is generated by a processor, prior to calling the
> language identification (see processor chain).
> 
> 
> 
> The processor chain:
> 
> 
>   
>   
> 
>   
>   
> 
>   
> 
>   
>   
> 
>   
>   
> name
> name_tokenized
>   
> 
>   
>class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdate
> ProcessorFactory">
> textbody,name_tokenized
> language
> en
>   
> 
>   
>class="3rdpartypackage.solr.update.processor.LanguageDependentFieldsProces
> sorFactory">
> language
> textbody,name_tokenized
>   
> 
>   
> 



Re: weak documents

2013-11-27 Thread Walter Underwood
Right. Delete by query "id:foo OR dependsOn:foo".  --wunder

On Nov 27, 2013, at 6:23 AM, "Jack Krupansky"  wrote:

> Just bite the bullet and do the query at your application level. I mean, 
> Solr/Lucene would have to do the same amount of work internally anyway. If 
> the perceived performance overhead is too great, get beefier hardware.
> 
> -- Jack Krupansky
> 
> -Original Message- From: Thomas Scheffler
> Sent: Wednesday, November 27, 2013 3:13 AM
> To: SOLR User
> Subject: weak documents
> 
> Hi,
> 
> I am relatively new to SOLR and I am looking for a neat way to implement
> weak documents with SOLR.
> 
> Whenever a document is updated or deleted all it's dependent documents
> should be removed from the index. In other words they exist as long as
> the document exist they refer to when they were indexed - in that
> specific version. On "update" they will be indexed after their master
> document.
> 
> I could like to have some kind of "dependsOn" field that carries the
> uniqueKey value of the master document.
> 
> Can this be done efficiently with SOLR?
> 
> I need this technique because on update and on delete I don't know how
> many dependent documents exists in the SOLR index. Especially for batch
> index processes, I need a more efficient way than query before every
> update or delete.
> 
> kind regards,
> 
> Thomas 

--
Walter Underwood
wun...@wunderwood.org





Re: Multivalued true Error?

2013-11-27 Thread Furkan KAMACI
Hi Sujit;

Your example has that line:

 override def decodeNormValue(b: Byte) = 1.0F


However it is a final class. Do you have any idea to handle it?



2013/11/27 Furkan KAMACI 

> Thanks Sujit, I got the problem and fixed it.
>
>
> 2013/11/26 Sujit Pal 
>
>> Hi Furkan,
>>
>> In the stock definition of the payload field:
>>
>> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/collection1/conf/schema.xml?view=markup
>>
>> the analyzer for payloads field type is a WhitespaceTokenizerFactory
>> followed by a DelimitedPayloadTokenFilterFactory. So if you send it a
>> string "foo$score1 bar$score2 ..." where foo and bar are string tokens and
>> score[12] are payload scores and "$" is your delimiter, the analyzer will
>> tokenize it into multiple payloads and you should be able to run the tests
>> in the blog post. So you shouldn't make it multiValued AFAIK.
>>
>> -sujit
>>
>>
>>
>> On Tue, Nov 26, 2013 at 8:44 AM, Furkan KAMACI > >wrote:
>>
>> > Hi;
>> >
>> > I've ported this example from Scala into Java:
>> > http://sujitpal.blogspot.com/2013/07/porting-payloads-to-solr4.html#!
>> >
>> > However does field should be multivalued true at that example?
>> >
>> > PS: I use Solr 4.5.1
>> >
>> > Thanks;
>> > Furkan KAMACI
>> >
>>
>
>


Re: Multivalued true Error?

2013-11-27 Thread Furkan KAMACI
"it is a final *method*". Can not be overrided at Solr 4.5.1?


2013/11/27 Furkan KAMACI 

> Hi Sujit;
>
> Your example has that line:
>
>  override def decodeNormValue(b: Byte) = 1.0F
>
>
> However it is a final class. Do you have any idea to handle it?
>
>
>
> 2013/11/27 Furkan KAMACI 
>
>> Thanks Sujit, I got the problem and fixed it.
>>
>>
>> 2013/11/26 Sujit Pal 
>>
>>> Hi Furkan,
>>>
>>> In the stock definition of the payload field:
>>>
>>> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/example/solr/collection1/conf/schema.xml?view=markup
>>>
>>> the analyzer for payloads field type is a WhitespaceTokenizerFactory
>>> followed by a DelimitedPayloadTokenFilterFactory. So if you send it a
>>> string "foo$score1 bar$score2 ..." where foo and bar are string tokens
>>> and
>>> score[12] are payload scores and "$" is your delimiter, the analyzer will
>>> tokenize it into multiple payloads and you should be able to run the
>>> tests
>>> in the blog post. So you shouldn't make it multiValued AFAIK.
>>>
>>> -sujit
>>>
>>>
>>>
>>> On Tue, Nov 26, 2013 at 8:44 AM, Furkan KAMACI >> >wrote:
>>>
>>> > Hi;
>>> >
>>> > I've ported this example from Scala into Java:
>>> > http://sujitpal.blogspot.com/2013/07/porting-payloads-to-solr4.html#!
>>> >
>>> > However does field should be multivalued true at that example?
>>> >
>>> > PS: I use Solr 4.5.1
>>> >
>>> > Thanks;
>>> > Furkan KAMACI
>>> >
>>>
>>
>>
>


Error when creating collection in Solr 4.6

2013-11-27 Thread lansing
Hi,
I am using solr 4.6, with external zookeeper 3.4.5
5 nodes, 5 shards, 3 replicas.
I uploaded collection configuration in zookeepr.
I am using the new core discovery mode

I have this issue when I try to create a collection with this call :

http://10.0.5.227:8101/solr/admin/collections?action=CREATE&name=Current1&numShards=5&replicationFactor=3&maxShardsPerNode=3&collection.configName=Current1

I get this response :
0576org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard3_replica1':
10.0.5.227:8101_solr_Current1_shard3_replica1 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard1_replica2':
10.0.5.227:8101_solr_Current1_shard1_replica2 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard4_replica3':
10.0.5.227:8101_solr_Current1_shard4_replica3 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard5_replica1':
10.0.5.229:8101_solr_Current1_shard5_replica1 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard1_replica3':
10.0.5.229:8101_solr_Current1_shard1_replica3 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard3_replica2':
10.0.5.229:8101_solr_Current1_shard3_replica2 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard5_replica3':
10.0.5.230:8101_solr_Current1_shard5_replica3 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard4_replica1':
10.0.5.230:8101_solr_Current1_shard4_replica1 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard2_replica2':
10.0.5.230:8101_solr_Current1_shard2_replica2 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard5_replica2':
10.0.5.228:8101_solr_Current1_shard5_replica2 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard2_replica1':
10.0.5.228:8101_solr_Current1_shard2_replica1 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard3_replica3':
10.0.5.228:8101_solr_Current1_shard3_replica3 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard1_replica1':
10.0.5.231:8101_solr_Current1_shard1_replica1 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard2_replica3':
10.0.5.231:8101_solr_Current1_shard2_replica3 is
removedorg.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:Error
CREATEing SolrCore 'Current1_shard4_replica2':
10.0.5.231:8101_solr_Current1_shard4_replica2 is
removed  


The clusterstate.json in zookeeper :
{"Current1":{
"shards":{
  "shard1":{
"range":"8000-b332",
"state":"active",
"replicas":{}},
  "shard2":{
"range":"b333-e665",
"state":"active",
"replicas":{}},
  "shard3":{
"range":"e666-1998",
"state":"active",
"replicas":{}},
  "shard4":{
"range":"1999-4ccb",
"state":"active",
"replicas":{}},
  "shard5":{
"range":"4ccc-7fff",
"state":"active",
"replicas":{}}},
"maxShardsPerNode":"3",
"router":{"name":"compositeId"},
"replicationFactor":"3"}}

Note :This setup worked fine under solr 4.5.1




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-when-creating-collection-in-Solr-4-6-tp4103536.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: solr as a service for multiple projects in the same environment

2013-11-27 Thread michael.boom
Hi,

There's nothing unusual in what you are trying to do, this scenario is very
common.

To answer your questions:
> 1. as I understand I can separate the configs of each collection in
> zookeeper. is it correct? 
Yes, that's correct. You'll have to upload your configs to ZK and use the
CollectionAPI to create your collections.

>2.are there any solr operations that can be performed on collection A and
somehow affect collection B? 
No, I can't think of any cross-collection operation. Here you can find a
list of collection related operations:
https://cwiki.apache.org/confluence/display/solr/Collections+API

>3. is the solr cache separated for each collection? 
Yes, separate and configurable in solrconfig.xml for each collection.

>4. I assume that I'll encounter a problem with the os cache, when the
different indices will compete on the same memory, right? how severe is this
issue? 
Hardware can be a bottleneck. If all your collection will face the same load
you should try to give solr a RAM amount equal to the index size (all
indexes)

>5. any other advice on building such an architecture? does the maintenance
overhead of maintaining multiple clusters in production really overwhelm the
problems and risks of using the same cluster for multiple systems? 
I was in the same situation as you, and putting everything in multiple
collections in just one cluster made sense for me : it's easier to manage
and has no obvious downside. As for "risks of using the same cluster for
multiple systems" they are pretty much the same  in both scenarios. Only
that with multiple clusters you'll have much more machines to manage.



-
Thanks,
Michael
--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-as-a-service-for-multiple-projects-in-the-same-environment-tp4103523p4103537.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Error when creating collection in Solr 4.6

2013-11-27 Thread Yago Riveiro
Lansing, 

I ran the command without any issue

http://localhost:8983/solr/admin/collections?action=CREATE&name=Current1&numShards=5&replicationFactor=3&maxShardsPerNode=15&collection.configName=default

The only different was that I have only one box and used the default config 
from example folder. 

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 27, 2013 at 3:43 PM, lansing wrote:

> http://10.0.5.227:8101/solr/admin/collections?action=CREATE&name=Current1&numShards=5&replicationFactor=3&maxShardsPerNode=3&collection.
>  
> (http://10.0.5.227:8101/solr/admin/collections?action=CREATE&name=Current1&numShards=5&replicationFactor=3&maxShardsPerNode=3&collection.configName=Current1)
>  



What is the right way to list top terms for a given field?

2013-11-27 Thread Dave Seltzer
Hello,

I'm trying to get a list of top terms for a field called "Tags".

One way to do this would be to query all data *:* and then facet by the
Tags column:
/solr/collection/admin/select?q=*:*&rows=0&facet=true&facet.field=Tags

I've noticed another way to do this is using the luke interface like this:
/solr/collection/admin/luke?fl=Tags&numTerms=20

One problem I see with the luke interface is that its inside the /admin/
path, which to me means that my users shouldn't be able to access it.

Whats the most SOLRy way to do this?

Thanks!

-D


Solr 4.3.1 :: Error loading class 'solr.ICUFoldingFilterFactory'

2013-11-27 Thread Raheel Hasan
Hi,

I got a new issue now. I have Solr 4.3.0 running just fine. However on Solr
4.3.1, it wont load. I get this issue:


{msg=SolrCore 'mycore' is not available due to init failure: Plugin
init failure for [schema.xml] fieldType "text_ws": Plugin init failure
for [schema.xml] analyzer/filter: Error loading class
'solr.ICUFoldingFilterFactory',trace=org.apache.solr.common.SolrException:
SolrCore 'mycore' is not available due to init failure: Plugin init
failure for [schema.xml] fieldType "text_ws": Plugin init failure for
[schema.xml] analyzer/filter: Error loading class
'solr.ICUFoldingFilterFactory'


Here is Solr.xml



  



  




-- 
Regards,
Raheel Hasan


Re: Term Vector Component Question

2013-11-27 Thread Jamie Johnson
Jack,

I'm not following, are you suggesting to turn on debug and then parse the
explain?  Seems very round about if that is the case, no?


On Wed, Nov 27, 2013 at 9:40 AM, Jack Krupansky wrote:

> That information would be included in the debugQuery output as well.
>
> -- Jack Krupansky
>
> -Original Message- From: Jamie Johnson Sent: Wednesday, November
> 27, 2013 9:32 AM To: solr-user@lucene.apache.org Subject: Term Vector
> Component Question
> I am interested in retrieving the tf for terms that matched the query, not
> all terms in the document.  Is this possible?  Looking at the example when
> I search for the word cable I get the response that is shown below, ideally
> I'd like to see only the tf for the word cable.  Is this possible or would
> I need to write a custom query component to do this?
>
> 
>
> 
>
> 0
>
> 2
>
> 
>
> 
>
> 
>
> 
>
> 32MB SD card, USB cable, AV cable, battery
>
> 
>
> 
>
> USB cable
>
> 
>
> 
>
> earbud headphones, USB cable
>
> 
>
> 
>
> 
>
> id
>
> 
>
> IW-02
>
> 
>
> 
>
> 9885A004
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 2
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 
>
> 
>
> 3007WFP
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 
>
> 
>
> MA147LL/A
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 1
>
> 
>
> 
>
> 
>
> 
>
> 
>


Re: What is the right way to list top terms for a given field?

2013-11-27 Thread Alexandre Rafalovitch
You can always expose the admin handler on non-admin URL. That's all just
definitions in solrconfig.xml.

Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all at
once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)


On Wed, Nov 27, 2013 at 11:22 PM, Dave Seltzer  wrote:

> Hello,
>
> I'm trying to get a list of top terms for a field called "Tags".
>
> One way to do this would be to query all data *:* and then facet by the
> Tags column:
> /solr/collection/admin/select?q=*:*&rows=0&facet=true&facet.field=Tags
>
> I've noticed another way to do this is using the luke interface like this:
> /solr/collection/admin/luke?fl=Tags&numTerms=20
>
> One problem I see with the luke interface is that its inside the /admin/
> path, which to me means that my users shouldn't be able to access it.
>
> Whats the most SOLRy way to do this?
>
> Thanks!
>
> -D
>


Re: What is the right way to list top terms for a given field?

2013-11-27 Thread Stefan Matheis
Since your users shouldn't be allowed at any time to access Solr directly, it's 
up to you to implement that on the client side anyway?

I can't tell if there is a technical difference between the two calls you 
named, but i'd guess that the second might be a more direct way to access this 
information (and probably a bit faster?).

-Stefan 


On Wednesday, November 27, 2013 at 5:22 PM, Dave Seltzer wrote:

> Hello,
> 
> I'm trying to get a list of top terms for a field called "Tags".
> 
> One way to do this would be to query all data *:* and then facet by the
> Tags column:
> /solr/collection/admin/select?q=*:*&rows=0&facet=true&facet.field=Tags
> 
> I've noticed another way to do this is using the luke interface like this:
> /solr/collection/admin/luke?fl=Tags&numTerms=20
> 
> One problem I see with the luke interface is that its inside the /admin/
> path, which to me means that my users shouldn't be able to access it.
> 
> Whats the most SOLRy way to do this?
> 
> Thanks!
> 
> -D 



Re: Term Vector Component Question

2013-11-27 Thread Jamie Johnson
I definitely want tf, the number of times the matched term appears in the
document, the key is that I want only the term that was searched for, not
all terms.

Looking at the tf function this is close, except it needs to be the exact
term, I really need it to be the user entered text.  So for instance if the
user said q=tests, I'd like the tf to be for any terms that tests got
analyzed to.  So if I had a stemming analyzer I'd expect the user search
for tests to match test, and I'd like to get the number of time test
appeared in the document of interest.  Does that make sense?


On Wed, Nov 27, 2013 at 9:40 AM, Erick Erickson wrote:

> Would it serve to return the tf or ttf? You'd have to
> tack on clauses like
> fl=*,ttf(name,drive)
> or
> fl=*.ttf(name,drive)
>
> Which implies that you'd have to do some work
> on the query side to add the tf or ttf clauses.
>
> See:
> http://wiki.apache.org/solr/FunctionQuery#tf
>
> Best,
> Erick
>
>
> On Wed, Nov 27, 2013 at 9:32 AM, Jamie Johnson  wrote:
>
> > I am interested in retrieving the tf for terms that matched the query,
> not
> > all terms in the document.  Is this possible?  Looking at the example
> when
> > I search for the word cable I get the response that is shown below,
> ideally
> > I'd like to see only the tf for the word cable.  Is this possible or
> would
> > I need to write a custom query component to do this?
> >
> > 
> >
> > 
> >
> > 0
> >
> > 2
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
> > 32MB SD card, USB cable, AV cable, battery
> >
> > 
> >
> > 
> >
> > USB cable
> >
> > 
> >
> > 
> >
> > earbud headphones, USB cable
> >
> > 
> >
> > 
> >
> > 
> >
> > id
> >
> > 
> >
> > IW-02
> >
> > 
> >
> > 
> >
> > 9885A004
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 2
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
> > 3007WFP
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
> > MA147LL/A
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 1
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
> > 
> >
>


Re: Term Vector Component Question

2013-11-27 Thread Jack Krupansky
There is an XML version of explain as well, if parsing the structured text 
is too difficult for your application. The point is that debug "explain" 
details precisely the term vector values for actual query terms.


Don't let the "debug" moniker throw you - this parameter is simply giving 
you access to detail information that you might find of value in your 
application.


As Erick explained, the function query approach ("tf(query-term)") also 
works, kind of, sort of, at least where all query terms must be matched, but 
when the "OR" operator is used, it won't tell you which term matched - 
although a tf value of 0 basically tells you that.


-- Jack Krupansky

-Original Message- 
From: Jamie Johnson

Sent: Wednesday, November 27, 2013 11:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Term Vector Component Question

Jack,

I'm not following, are you suggesting to turn on debug and then parse the
explain?  Seems very round about if that is the case, no?


On Wed, Nov 27, 2013 at 9:40 AM, Jack Krupansky 
wrote:



That information would be included in the debugQuery output as well.

-- Jack Krupansky

-Original Message- From: Jamie Johnson Sent: Wednesday, November
27, 2013 9:32 AM To: solr-user@lucene.apache.org Subject: Term Vector
Component Question
I am interested in retrieving the tf for terms that matched the query, not
all terms in the document.  Is this possible?  Looking at the example when
I search for the word cable I get the response that is shown below, 
ideally

I'd like to see only the tf for the word cable.  Is this possible or would
I need to write a custom query component to do this?





0

2









32MB SD card, USB cable, AV cable, battery





USB cable





earbud headphones, USB cable







id



IW-02





9885A004





1





1





1





2





1





1





1









3007WFP





1





1









MA147LL/A





1





1





1





1















Re: What is the right way to list top terms for a given field?

2013-11-27 Thread Dave Seltzer
It's certainly seems to be faster (in my limited testing).

I just don't want to base my software on the Luke scripts if they're
prone to changing in the future.

And yes, I realize there are ways to make this secure. I just wanted
to know if it's something I should avoid doing (perhaps for reasons
beyond my comprehension.)

Thanks!

-D

> On Nov 27, 2013, at 11:46 AM, Stefan Matheis  wrote:
>
> Since your users shouldn't be allowed at any time to access Solr directly, 
> it's up to you to implement that on the client side anyway?
>
> I can't tell if there is a technical difference between the two calls you 
> named, but i'd guess that the second might be a more direct way to access 
> this information (and probably a bit faster?).
>
> -Stefan
>
>
>> On Wednesday, November 27, 2013 at 5:22 PM, Dave Seltzer wrote:
>>
>> Hello,
>>
>> I'm trying to get a list of top terms for a field called "Tags".
>>
>> One way to do this would be to query all data *:* and then facet by the
>> Tags column:
>> /solr/collection/admin/select?q=*:*&rows=0&facet=true&facet.field=Tags
>>
>> I've noticed another way to do this is using the luke interface like this:
>> /solr/collection/admin/luke?fl=Tags&numTerms=20
>>
>> One problem I see with the luke interface is that its inside the /admin/
>> path, which to me means that my users shouldn't be able to access it.
>>
>> Whats the most SOLRy way to do this?
>>
>> Thanks!
>>
>> -D
>


Re: SolR vs large PDF

2013-11-27 Thread Erick Erickson
I'm assuming you're using the ExtractingRequestHandler. Offloading
the entire work onto your Solr box that is also serving queries
and indexing is not going to scale well.

Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to
offload the PDF parsing amongst as many clients as you can afford.
Here's a way to get started:

http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick


On Wed, Nov 27, 2013 at 10:00 AM, Marcello Lorenzi wrote:

> Hi All,
> on our test environment we have implemented a new search engine based on
> Solr 4.3 with 2 instances hosted on different servers and 1 shard present
> on each servlet container.
>
> During some stress test we noticed a bottleneck into crawling of large PDF
> file that blocks the serving of results from queries to the collections.
>
> Is it possible to boost or mitigate the overhead created by PDFBOX during
> the crawling?
>
> Thanks,
> Marcello
>


SolrCloud and 2MB Synonym file

2013-11-27 Thread Puneet Pawaia
Hi

I am trying to setup a test SolrCloud 4.5.1 implementation. My synonym file
is about 1.6 MB. When I try to add collection to ZooKeeper 3.4.5 on Ubuntu
12.4, it fails because of the 1MB limit of ZooKeeper. Has anyone any
experience with using such synonym files? Can I store them in some other
location other than the Config Folder since the config folder is loaded
into ZooKeeper.

TIA

Puneet Pawaia


Re: SolrCloud and 2MB Synonym file

2013-11-27 Thread Yago Riveiro
You can use the jute.maxbuffer > 1M as a workaround. 

You must set -Djute.maxbuffer in zookeeper and solr to work properly

-- 
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 27, 2013 at 5:15 PM, Puneet Pawaia wrote:

> Hi
> 
> I am trying to setup a test SolrCloud 4.5.1 implementation. My synonym file
> is about 1.6 MB. When I try to add collection to ZooKeeper 3.4.5 on Ubuntu
> 12.4, it fails because of the 1MB limit of ZooKeeper. Has anyone any
> experience with using such synonym files? Can I store them in some other
> location other than the Config Folder since the config folder is loaded
> into ZooKeeper.
> 
> TIA
> 
> Puneet Pawaia 



Re: Term Vector Component Question

2013-11-27 Thread Jamie Johnson
thanks I'm looking at this now, debug seems pretty close to what I want.
 Is there a way to exclude information from the debug response, for
instance I don't need idf, fieldnorm, timing information, etc.  Again
thanks.


On Wed, Nov 27, 2013 at 11:49 AM, Jack Krupansky wrote:

> There is an XML version of explain as well, if parsing the structured text
> is too difficult for your application. The point is that debug "explain"
> details precisely the term vector values for actual query terms.
>
> Don't let the "debug" moniker throw you - this parameter is simply giving
> you access to detail information that you might find of value in your
> application.
>
> As Erick explained, the function query approach ("tf(query-term)") also
> works, kind of, sort of, at least where all query terms must be matched,
> but when the "OR" operator is used, it won't tell you which term matched -
> although a tf value of 0 basically tells you that.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Jamie Johnson
> Sent: Wednesday, November 27, 2013 11:38 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Term Vector Component Question
>
>
> Jack,
>
> I'm not following, are you suggesting to turn on debug and then parse the
> explain?  Seems very round about if that is the case, no?
>
>
> On Wed, Nov 27, 2013 at 9:40 AM, Jack Krupansky 
> wrote:
>
>  That information would be included in the debugQuery output as well.
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Jamie Johnson Sent: Wednesday, November
>> 27, 2013 9:32 AM To: solr-user@lucene.apache.org Subject: Term Vector
>> Component Question
>> I am interested in retrieving the tf for terms that matched the query, not
>> all terms in the document.  Is this possible?  Looking at the example when
>> I search for the word cable I get the response that is shown below,
>> ideally
>> I'd like to see only the tf for the word cable.  Is this possible or would
>> I need to write a custom query component to do this?
>>
>> 
>>
>> 
>>
>> 0
>>
>> 2
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 32MB SD card, USB cable, AV cable, battery
>>
>> 
>>
>> 
>>
>> USB cable
>>
>> 
>>
>> 
>>
>> earbud headphones, USB cable
>>
>> 
>>
>> 
>>
>> 
>>
>> id
>>
>> 
>>
>> IW-02
>>
>> 
>>
>> 
>>
>> 9885A004
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 2
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 3007WFP
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> MA147LL/A
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 1
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>> 
>>
>>
>


Re: Term Vector Component Question

2013-11-27 Thread Jamie Johnson
a little more reading gave me it.  I can just do debug=results, but that
still includes idf and fieldnorm.  Much less though so it's a step ;)  If
there is anyway to get just idf that would be great, otherwise no big deal


On Wed, Nov 27, 2013 at 12:18 PM, Jamie Johnson  wrote:

> thanks I'm looking at this now, debug seems pretty close to what I want.
>  Is there a way to exclude information from the debug response, for
> instance I don't need idf, fieldnorm, timing information, etc.  Again
> thanks.
>
>
> On Wed, Nov 27, 2013 at 11:49 AM, Jack Krupansky 
> wrote:
>
>> There is an XML version of explain as well, if parsing the structured
>> text is too difficult for your application. The point is that debug
>> "explain" details precisely the term vector values for actual query terms.
>>
>> Don't let the "debug" moniker throw you - this parameter is simply giving
>> you access to detail information that you might find of value in your
>> application.
>>
>> As Erick explained, the function query approach ("tf(query-term)") also
>> works, kind of, sort of, at least where all query terms must be matched,
>> but when the "OR" operator is used, it won't tell you which term matched -
>> although a tf value of 0 basically tells you that.
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Jamie Johnson
>> Sent: Wednesday, November 27, 2013 11:38 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Term Vector Component Question
>>
>>
>> Jack,
>>
>> I'm not following, are you suggesting to turn on debug and then parse the
>> explain?  Seems very round about if that is the case, no?
>>
>>
>> On Wed, Nov 27, 2013 at 9:40 AM, Jack Krupansky 
>> wrote:
>>
>>  That information would be included in the debugQuery output as well.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Jamie Johnson Sent: Wednesday, November
>>> 27, 2013 9:32 AM To: solr-user@lucene.apache.org Subject: Term Vector
>>> Component Question
>>> I am interested in retrieving the tf for terms that matched the query,
>>> not
>>> all terms in the document.  Is this possible?  Looking at the example
>>> when
>>> I search for the word cable I get the response that is shown below,
>>> ideally
>>> I'd like to see only the tf for the word cable.  Is this possible or
>>> would
>>> I need to write a custom query component to do this?
>>>
>>> 
>>>
>>> 
>>>
>>> 0
>>>
>>> 2
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 32MB SD card, USB cable, AV cable, battery
>>>
>>> 
>>>
>>> 
>>>
>>> USB cable
>>>
>>> 
>>>
>>> 
>>>
>>> earbud headphones, USB cable
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> id
>>>
>>> 
>>>
>>> IW-02
>>>
>>> 
>>>
>>> 
>>>
>>> 9885A004
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 2
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 3007WFP
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> MA147LL/A
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>>
>>
>


Re: SolrCloud and 2MB Synonym file

2013-11-27 Thread Puneet Pawaia
Yago, not sure if this is a good idea. Docs say this is dangerous stuff.

Anyway,  not being a linux or java expert,  I would appreciate if you could
point me to an implementation of this.

Regards
Puneet Pawaia
On 27 Nov 2013 22:54, "Yago Riveiro"  wrote:

> You can use the jute.maxbuffer > 1M as a workaround.
>
> You must set -Djute.maxbuffer in zookeeper and solr to work properly
>
> --
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Wednesday, November 27, 2013 at 5:15 PM, Puneet Pawaia wrote:
>
> > Hi
> >
> > I am trying to setup a test SolrCloud 4.5.1 implementation. My synonym
> file
> > is about 1.6 MB. When I try to add collection to ZooKeeper 3.4.5 on
> Ubuntu
> > 12.4, it fails because of the 1MB limit of ZooKeeper. Has anyone any
> > experience with using such synonym files? Can I store them in some other
> > location other than the Config Folder since the config folder is loaded
> > into ZooKeeper.
> >
> > TIA
> >
> > Puneet Pawaia
>
>


Re: Term Vector Component Question

2013-11-27 Thread Jack Krupansky
To be honest, this kind of question comes up so often, that it probably is 
worth a Jira to have a more customized or parameterized "explain".


Function queries in the "fl" list give you a lot more control, but not at 
the level of actual terms that matched.


-- Jack Krupansky

-Original Message- 
From: Jamie Johnson

Sent: Wednesday, November 27, 2013 12:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Term Vector Component Question

thanks I'm looking at this now, debug seems pretty close to what I want.
Is there a way to exclude information from the debug response, for
instance I don't need idf, fieldnorm, timing information, etc.  Again
thanks.


On Wed, Nov 27, 2013 at 11:49 AM, Jack Krupansky 
wrote:



There is an XML version of explain as well, if parsing the structured text
is too difficult for your application. The point is that debug "explain"
details precisely the term vector values for actual query terms.

Don't let the "debug" moniker throw you - this parameter is simply giving
you access to detail information that you might find of value in your
application.

As Erick explained, the function query approach ("tf(query-term)") also
works, kind of, sort of, at least where all query terms must be matched,
but when the "OR" operator is used, it won't tell you which term matched -
although a tf value of 0 basically tells you that.


-- Jack Krupansky

-Original Message- From: Jamie Johnson
Sent: Wednesday, November 27, 2013 11:38 AM
To: solr-user@lucene.apache.org
Subject: Re: Term Vector Component Question


Jack,

I'm not following, are you suggesting to turn on debug and then parse the
explain?  Seems very round about if that is the case, no?


On Wed, Nov 27, 2013 at 9:40 AM, Jack Krupansky 
wrote:

 That information would be included in the debugQuery output as well.


-- Jack Krupansky

-Original Message- From: Jamie Johnson Sent: Wednesday, November
27, 2013 9:32 AM To: solr-user@lucene.apache.org Subject: Term Vector
Component Question
I am interested in retrieving the tf for terms that matched the query, 
not
all terms in the document.  Is this possible?  Looking at the example 
when

I search for the word cable I get the response that is shown below,
ideally
I'd like to see only the tf for the word cable.  Is this possible or 
would

I need to write a custom query component to do this?





0

2









32MB SD card, USB cable, AV cable, battery





USB cable





earbud headphones, USB cable







id



IW-02





9885A004





1





1





1





2





1





1





1









3007WFP





1





1









MA147LL/A





1





1





1





1


















Re: Solr 4.3.1 :: Error loading class 'solr.ICUFoldingFilterFactory'

2013-11-27 Thread Shawn Heisey

On 11/27/2013 9:37 AM, Raheel Hasan wrote:

I got a new issue now. I have Solr 4.3.0 running just fine. However on Solr
4.3.1, it wont load. I get this issue:


{msg=SolrCore 'mycore' is not available due to init failure: Plugin
init failure for [schema.xml] fieldType "text_ws": Plugin init failure
for [schema.xml] analyzer/filter: Error loading class
'solr.ICUFoldingFilterFactory',trace=org.apache.solr.common.SolrException:
SolrCore 'mycore' is not available due to init failure: Plugin init
failure for [schema.xml] fieldType "text_ws": Plugin init failure for
[schema.xml] analyzer/filter: Error loading class
'solr.ICUFoldingFilterFactory'


The jars required for that analysis chain component are not available to 
Solr.  Jars can be loaded in one of two ways.  1) By using lib 
directives in solrconfig.xml.  2) Putting them all in 
${solr.solr.home}/lib, with ${solr.solr.home} as the location where 
solr.xml lives.  The latter is a far better option.Since you are using 
4.3.1, don't use the sharedLib attribute in solr.xml, or you'll run into 
SOLR-4852.


The extra jars required for ICUFoldingFilterFactory on Solr 4.3.1 are:

icu4j-49.1.jar
lucene-analyzers-icu-4.3.1.jar

You can find these in the download under contrib/analysis-extras.

Thanks,
Shawn



Custom Relevancy Using Field Payloads

2013-11-27 Thread Furkan KAMACI
I have a payload field at my schema (Solr 4.5.1) When a user searches for a
keyword I will calculate the usual score and "if" a match occurs at that
payload field I will add payload to the general score (payload * normalize
coefficient)

How can I do that? Custom payload similarity class or custom function
query?

I've followed here:
http://sujitpal.blogspot.com/2013/07/porting-payloads-to-solr4.html#! but
decodeNormValue if a final method anymore. How about that:
http://www.solrtutorial.com/custom-solr-functionquery.html

Any ideas about my aim?


Re: SolrCloud and 2MB Synonym file

2013-11-27 Thread Yago Riveiro
How are you launching Solr?  

Do you have an ensemble or you're running zookeeper embedded?  

Yes, doc says that jute.maxbuffer is dangerous, but without it you can stored 
nothing with more than 1M in zookeeper … and in some point you can have a 
clusterstate.json with a size greater than 1M  
--  
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Wednesday, November 27, 2013 at 5:31 PM, Puneet Pawaia wrote:

> Yago, not sure if this is a good idea. Docs say this is dangerous stuff.
>  
> Anyway, not being a linux or java expert, I would appreciate if you could
> point me to an implementation of this.
>  
> Regards
> Puneet Pawaia
> On 27 Nov 2013 22:54, "Yago Riveiro"  (mailto:yago.rive...@gmail.com)> wrote:
>  
> > You can use the jute.maxbuffer > 1M as a workaround.
> >  
> > You must set -Djute.maxbuffer in zookeeper and solr to work properly
> >  
> > --
> > Yago Riveiro
> > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> >  
> >  
> > On Wednesday, November 27, 2013 at 5:15 PM, Puneet Pawaia wrote:
> >  
> > > Hi
> > >  
> > > I am trying to setup a test SolrCloud 4.5.1 implementation. My synonym
> > file
> > > is about 1.6 MB. When I try to add collection to ZooKeeper 3.4.5 on
> >  
> > Ubuntu
> > > 12.4, it fails because of the 1MB limit of ZooKeeper. Has anyone any
> > > experience with using such synonym files? Can I store them in some other
> > > location other than the Config Folder since the config folder is loaded
> > > into ZooKeeper.
> > >  
> > > TIA
> > >  
> > > Puneet Pawaia  



Re: Function query matching

2013-11-27 Thread Peter Keegan
Hi,

So, this query does just what I want, but it's typically 3 times slower
than the edismax query  without the functions:

select?qq={!edismax v='news' qf='title^2 body'}&scaledQ=scale(product(
query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),
product(0.25,field(myfield)))&fq={!query v=$qq}

Is there any way to speed this up? Would writing a custom function query
that compiled all the function queries together be any faster?

Thanks,
Peter


On Mon, Nov 11, 2013 at 1:31 PM, Peter Keegan wrote:

> Thanks
>
>
> On Mon, Nov 11, 2013 at 11:46 AM, Yonik Seeley wrote:
>
>> On Mon, Nov 11, 2013 at 11:39 AM, Peter Keegan 
>> wrote:
>> > fq=$qq
>> >
>> > What is the proper syntax?
>>
>> fq={!query v=$qq}
>>
>> -Yonik
>> http://heliosearch.com -- making solr shine
>>
>
>


Re: SolrCloud and 2MB Synonym file

2013-11-27 Thread Mark Miller
They are just trying to keep users from using ZK in a bad way. Storing and 
accessing a ton of huge files is not what ZooKeeper was designed for. A 1MB 
limit is a fairly arbitrary limiter to make sure you don’t shoot yourself in 
the foot and store lots of large files. With modern networks and hardware, 
setting it to 3MB and uploading your 2MB syn file is not going to be a problem. 
Solr doesn’t read and write those files often, nor use ZooKeeper much at all in 
a stable state. Upping that limit and putting in a few config files that are a 
few MB is not going to break anything.

- Mark

On Nov 27, 2013, at 12:31 PM, Puneet Pawaia  wrote:

> Yago, not sure if this is a good idea. Docs say this is dangerous stuff.
> 
> Anyway,  not being a linux or java expert,  I would appreciate if you could
> point me to an implementation of this.
> 
> Regards
> Puneet Pawaia
> On 27 Nov 2013 22:54, "Yago Riveiro"  wrote:
> 
>> You can use the jute.maxbuffer > 1M as a workaround.
>> 
>> You must set -Djute.maxbuffer in zookeeper and solr to work properly
>> 
>> --
>> Yago Riveiro
>> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>> 
>> 
>> On Wednesday, November 27, 2013 at 5:15 PM, Puneet Pawaia wrote:
>> 
>>> Hi
>>> 
>>> I am trying to setup a test SolrCloud 4.5.1 implementation. My synonym
>> file
>>> is about 1.6 MB. When I try to add collection to ZooKeeper 3.4.5 on
>> Ubuntu
>>> 12.4, it fails because of the 1MB limit of ZooKeeper. Has anyone any
>>> experience with using such synonym files? Can I store them in some other
>>> location other than the Config Folder since the config folder is loaded
>>> into ZooKeeper.
>>> 
>>> TIA
>>> 
>>> Puneet Pawaia
>> 
>> 



Re: Function query matching

2013-11-27 Thread Chris Hostetter

: So, this query does just what I want, but it's typically 3 times slower
: than the edismax query  without the functions:

that's because the scale() function is inhernetly slow (it has to 
compute the min & max value for every document in order to know how to 
scale them)

what you are seeing is the price you have to pay to get that query with a 
"normalized" 0-1 value.

(you might be able to save a little bit of time by eliminating that 
no-Op multiply by 1: "product(query($qq),1)" ... but i doubt you'll even 
notice much of a chnage given that scale function.

: Is there any way to speed this up? Would writing a custom function query
: that compiled all the function queries together be any faster?

If you can find a faster implementation for scale() then by all means let 
us konw, and we can fold it back into Solr.


-Hoss


Re: Function query matching

2013-11-27 Thread Peter Keegan
Although the 'scale' is a big part of it, here's a closer breakdown. Here
are 4 queries with increasing functions, and theei response times (caching
turned off in solrconfig):

100 msec:
select?q={!edismax v='news' qf='title^2 body'}

135 msec:
select?qq={!edismax v='news' qf='title^2
body'}q={!func}product(field(myfield),query($qq)&fq={!query v=$qq}

200 msec:
select?qq={!edismax v='news' qf='title^2
body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfield&fq={!query
v=$qq}

320 msec:
select?qq={!edismax v='news' qf='title^2
body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!query
v=$qq}

Btw, that no-op product is necessary, else you get this exception:

org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to
org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo

thanks,

peter



On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter
wrote:

>
> : So, this query does just what I want, but it's typically 3 times slower
> : than the edismax query  without the functions:
>
> that's because the scale() function is inhernetly slow (it has to
> compute the min & max value for every document in order to know how to
> scale them)
>
> what you are seeing is the price you have to pay to get that query with a
> "normalized" 0-1 value.
>
> (you might be able to save a little bit of time by eliminating that
> no-Op multiply by 1: "product(query($qq),1)" ... but i doubt you'll even
> notice much of a chnage given that scale function.
>
> : Is there any way to speed this up? Would writing a custom function query
> : that compiled all the function queries together be any faster?
>
> If you can find a faster implementation for scale() then by all means let
> us konw, and we can fold it back into Solr.
>
>
> -Hoss
>


Re: SolrCloud and 2MB Synonym file

2013-11-27 Thread Timothy Potter
I'm curious how much compression you get with your synonym file using
something basic like gzip? If significant, would it make sense to
store the compressed syn file in ZooKeeper (or any other metadata you
need to distribute around the cluster)? This would require the code
that reads the syn file from ZooKeeper to be able to de-compress it.
Seems like this would be a nice-to-have in SolrCloud in general - the
ability to read / write files to ZooKeeper in compressed format.

Tim

On Wed, Nov 27, 2013 at 11:29 AM, Mark Miller  wrote:
> They are just trying to keep users from using ZK in a bad way. Storing and 
> accessing a ton of huge files is not what ZooKeeper was designed for. A 1MB 
> limit is a fairly arbitrary limiter to make sure you don’t shoot yourself in 
> the foot and store lots of large files. With modern networks and hardware, 
> setting it to 3MB and uploading your 2MB syn file is not going to be a 
> problem. Solr doesn’t read and write those files often, nor use ZooKeeper 
> much at all in a stable state. Upping that limit and putting in a few config 
> files that are a few MB is not going to break anything.
>
> - Mark
>
> On Nov 27, 2013, at 12:31 PM, Puneet Pawaia  wrote:
>
>> Yago, not sure if this is a good idea. Docs say this is dangerous stuff.
>>
>> Anyway,  not being a linux or java expert,  I would appreciate if you could
>> point me to an implementation of this.
>>
>> Regards
>> Puneet Pawaia
>> On 27 Nov 2013 22:54, "Yago Riveiro"  wrote:
>>
>>> You can use the jute.maxbuffer > 1M as a workaround.
>>>
>>> You must set -Djute.maxbuffer in zookeeper and solr to work properly
>>>
>>> --
>>> Yago Riveiro
>>> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>>>
>>>
>>> On Wednesday, November 27, 2013 at 5:15 PM, Puneet Pawaia wrote:
>>>
 Hi

 I am trying to setup a test SolrCloud 4.5.1 implementation. My synonym
>>> file
 is about 1.6 MB. When I try to add collection to ZooKeeper 3.4.5 on
>>> Ubuntu
 12.4, it fails because of the 1MB limit of ZooKeeper. Has anyone any
 experience with using such synonym files? Can I store them in some other
 location other than the Config Folder since the config folder is loaded
 into ZooKeeper.

 TIA

 Puneet Pawaia
>>>
>>>
>


Re: What is the right way to list top terms for a given field?

2013-11-27 Thread Timothy Potter
Hi Dave,

Have you looked at the TermsComponent?
http://wiki.apache.org/solr/TermsComponent It is easy to wire into an
existing request handler and allows you to return the top terms for a
field. Example server even includes an example request handler that
uses it:

  

  
 
  true
  false


  terms

  

Cheers,
Tim

On Wed, Nov 27, 2013 at 10:07 AM, Dave Seltzer  wrote:
> It's certainly seems to be faster (in my limited testing).
>
> I just don't want to base my software on the Luke scripts if they're
> prone to changing in the future.
>
> And yes, I realize there are ways to make this secure. I just wanted
> to know if it's something I should avoid doing (perhaps for reasons
> beyond my comprehension.)
>
> Thanks!
>
> -D
>
>> On Nov 27, 2013, at 11:46 AM, Stefan Matheis  
>> wrote:
>>
>> Since your users shouldn't be allowed at any time to access Solr directly, 
>> it's up to you to implement that on the client side anyway?
>>
>> I can't tell if there is a technical difference between the two calls you 
>> named, but i'd guess that the second might be a more direct way to access 
>> this information (and probably a bit faster?).
>>
>> -Stefan
>>
>>
>>> On Wednesday, November 27, 2013 at 5:22 PM, Dave Seltzer wrote:
>>>
>>> Hello,
>>>
>>> I'm trying to get a list of top terms for a field called "Tags".
>>>
>>> One way to do this would be to query all data *:* and then facet by the
>>> Tags column:
>>> /solr/collection/admin/select?q=*:*&rows=0&facet=true&facet.field=Tags
>>>
>>> I've noticed another way to do this is using the luke interface like this:
>>> /solr/collection/admin/luke?fl=Tags&numTerms=20
>>>
>>> One problem I see with the luke interface is that its inside the /admin/
>>> path, which to me means that my users shouldn't be able to access it.
>>>
>>> Whats the most SOLRy way to do this?
>>>
>>> Thanks!
>>>
>>> -D
>>


Re: Term Vector Component Question

2013-11-27 Thread Jamie Johnson
Thanks Jack, I'll see if I can find anything on Jira about this and if not
I'll create a ticket for it.


On Wed, Nov 27, 2013 at 12:28 PM, Jack Krupansky wrote:

> To be honest, this kind of question comes up so often, that it probably is
> worth a Jira to have a more customized or parameterized "explain".
>
> Function queries in the "fl" list give you a lot more control, but not at
> the level of actual terms that matched.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Jamie Johnson
> Sent: Wednesday, November 27, 2013 12:18 PM
>
> To: solr-user@lucene.apache.org
> Subject: Re: Term Vector Component Question
>
> thanks I'm looking at this now, debug seems pretty close to what I want.
> Is there a way to exclude information from the debug response, for
> instance I don't need idf, fieldnorm, timing information, etc.  Again
> thanks.
>
>
> On Wed, Nov 27, 2013 at 11:49 AM, Jack Krupansky 
> wrote:
>
>  There is an XML version of explain as well, if parsing the structured text
>> is too difficult for your application. The point is that debug "explain"
>> details precisely the term vector values for actual query terms.
>>
>> Don't let the "debug" moniker throw you - this parameter is simply giving
>> you access to detail information that you might find of value in your
>> application.
>>
>> As Erick explained, the function query approach ("tf(query-term)") also
>> works, kind of, sort of, at least where all query terms must be matched,
>> but when the "OR" operator is used, it won't tell you which term matched -
>> although a tf value of 0 basically tells you that.
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Jamie Johnson
>> Sent: Wednesday, November 27, 2013 11:38 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Term Vector Component Question
>>
>>
>> Jack,
>>
>> I'm not following, are you suggesting to turn on debug and then parse the
>> explain?  Seems very round about if that is the case, no?
>>
>>
>> On Wed, Nov 27, 2013 at 9:40 AM, Jack Krupansky 
>> wrote:
>>
>>  That information would be included in the debugQuery output as well.
>>
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Jamie Johnson Sent: Wednesday, November
>>> 27, 2013 9:32 AM To: solr-user@lucene.apache.org Subject: Term Vector
>>> Component Question
>>> I am interested in retrieving the tf for terms that matched the query,
>>> not
>>> all terms in the document.  Is this possible?  Looking at the example
>>> when
>>> I search for the word cable I get the response that is shown below,
>>> ideally
>>> I'd like to see only the tf for the word cable.  Is this possible or
>>> would
>>> I need to write a custom query component to do this?
>>>
>>> 
>>>
>>> 
>>>
>>> 0
>>>
>>> 2
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 32MB SD card, USB cable, AV cable, battery
>>>
>>> 
>>>
>>> 
>>>
>>> USB cable
>>>
>>> 
>>>
>>> 
>>>
>>> earbud headphones, USB cable
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> id
>>>
>>> 
>>>
>>> IW-02
>>>
>>> 
>>>
>>> 
>>>
>>> 9885A004
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 2
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 3007WFP
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> MA147LL/A
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 1
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>> 
>>>
>>>
>>>
>>
>


Re: Term Vector Component Question

2013-11-27 Thread Jamie Johnson
I didn't see anything so I created this

https://issues.apache.org/jira/browse/SOLR-5511


On Wed, Nov 27, 2013 at 2:35 PM, Jamie Johnson  wrote:

> Thanks Jack, I'll see if I can find anything on Jira about this and if not
> I'll create a ticket for it.
>
>
> On Wed, Nov 27, 2013 at 12:28 PM, Jack Krupansky 
> wrote:
>
>> To be honest, this kind of question comes up so often, that it probably
>> is worth a Jira to have a more customized or parameterized "explain".
>>
>> Function queries in the "fl" list give you a lot more control, but not at
>> the level of actual terms that matched.
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Jamie Johnson
>> Sent: Wednesday, November 27, 2013 12:18 PM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: Term Vector Component Question
>>
>> thanks I'm looking at this now, debug seems pretty close to what I want.
>> Is there a way to exclude information from the debug response, for
>> instance I don't need idf, fieldnorm, timing information, etc.  Again
>> thanks.
>>
>>
>> On Wed, Nov 27, 2013 at 11:49 AM, Jack Krupansky > >wrote:
>>
>>  There is an XML version of explain as well, if parsing the structured
>>> text
>>> is too difficult for your application. The point is that debug "explain"
>>> details precisely the term vector values for actual query terms.
>>>
>>> Don't let the "debug" moniker throw you - this parameter is simply giving
>>> you access to detail information that you might find of value in your
>>> application.
>>>
>>> As Erick explained, the function query approach ("tf(query-term)") also
>>> works, kind of, sort of, at least where all query terms must be matched,
>>> but when the "OR" operator is used, it won't tell you which term matched
>>> -
>>> although a tf value of 0 basically tells you that.
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Jamie Johnson
>>> Sent: Wednesday, November 27, 2013 11:38 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Term Vector Component Question
>>>
>>>
>>> Jack,
>>>
>>> I'm not following, are you suggesting to turn on debug and then parse the
>>> explain?  Seems very round about if that is the case, no?
>>>
>>>
>>> On Wed, Nov 27, 2013 at 9:40 AM, Jack Krupansky >> >
>>> wrote:
>>>
>>>  That information would be included in the debugQuery output as well.
>>>

 -- Jack Krupansky

 -Original Message- From: Jamie Johnson Sent: Wednesday, November
 27, 2013 9:32 AM To: solr-user@lucene.apache.org Subject: Term Vector
 Component Question
 I am interested in retrieving the tf for terms that matched the query,
 not
 all terms in the document.  Is this possible?  Looking at the example
 when
 I search for the word cable I get the response that is shown below,
 ideally
 I'd like to see only the tf for the word cable.  Is this possible or
 would
 I need to write a custom query component to do this?

 

 

 0

 2

 

 

 

 

 32MB SD card, USB cable, AV cable, battery

 

 

 USB cable

 

 

 earbud headphones, USB cable

 

 

 

 id

 

 IW-02

 

 

 9885A004

 

 

 1

 

 

 1

 

 

 1

 

 

 2

 

 

 1

 

 

 1

 

 

 1

 

 

 

 

 3007WFP

 

 

 1

 

 

 1

 

 

 

 

 MA147LL/A

 

 

 1

 

 

 1

 

 

 1

 

 

 1

 

 

 

 

 



>>>
>>
>


Re: SolR vs large PDF

2013-11-27 Thread Marcello Lorenzi

Hi Erick,
On our architecture we use Apache Manifoldcf to invoke the schedulation 
from Manifold-web and we use the Manifold-agent to take the pdf file 
from the filesystem to SolR instances. Is it possibile to redirect the 
Manifold schedulation to the SolrJ instance for specific schedules?


Thanks,
Marcello

On 11/27/2013 06:14 PM, Erick Erickson wrote:

I'm assuming you're using the ExtractingRequestHandler. Offloading
the entire work onto your Solr box that is also serving queries
and indexing is not going to scale well.

Consider using Tika/SolrJ (Tika is what the ERH uses anyway) to
offload the PDF parsing amongst as many clients as you can afford.
Here's a way to get started:

http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick


On Wed, Nov 27, 2013 at 10:00 AM, Marcello Lorenzi wrote:


Hi All,
on our test environment we have implemented a new search engine based on
Solr 4.3 with 2 instances hosted on different servers and 1 shard present
on each servlet container.

During some stress test we noticed a bottleneck into crawling of large PDF
file that blocks the serving of results from queries to the collections.

Is it possible to boost or mitigate the overhead created by PDFBOX during
the crawling?

Thanks,
Marcello





NOTE: comments currently disabled for most users in Solr Ref Guide

2013-11-27 Thread Chris Hostetter


jca recently pointed out on the #solr IRC channel that normal (ie: 
non-committer) confluence-users are not able to post comments on any 
pages of the Solr Ref Guide.


This is evidently do to a change made by Infra that was mentioned in an 
email to all PMC members on Oct1 -- but the rramifications of which I 
didn't realize at the time (i thought it was specifically about reating 
*pages* i didn't realize it also affected comments)


Changing this back for just the ref guide wiki psace would be fairly easy 
-- but i don't want to do that until i have a chance to talk to Infra 
about it.


In the mean time, if you have any comments/suggestions about ref guide 
pages please send an email to solr-user@lucene with a link ot hte page in 
question.


Sorry about this.

-Hoss


Caches contain deleted docs (?)

2013-11-27 Thread Roman Chyla
Hi,
I'd like to check - there is something I don't understand about cache - and
I don't know if it is a bug, or feature

the following calls return a cache

FieldCache.DEFAULT.getTerms(reader, idField);
FieldCache.DEFAULT.getInts(reader, idField, false);


the resulting arrays *will* contain entries for deleted docs, so to filter
them out, one has to manually check livedocs. Is this the expected
behaviour? I don't understand why the cache would be bothering to load data
for deleted docs. This is on SOLR4.0

Thanks!

  roman


Can't post comment on Confluence pages under "Apache Solr Reference Guide"

2013-11-27 Thread Julien Canquelain

Hi,I would like to post a comment about the problem below on Solr Confluence documentation, but comments are disabled right now for confluence-users (at least at the time I'm writing this - it was confirmed on IRC a minute ago).The page I would like to comment on is : https://cwiki.apache.org/confluence/display/solr/Result+GroupingIt seems to me that there is a minor mistake in the following sentence :"Grouped faceting only supports facet.field for string based fields that are not tokenized and are not multivalued."The point is : grouped faceting DO support multivalued fields. Indeed, as it can be read in the "request parameter" table on the same page:"Grouped faceting supports single and multivalued fields"I did many tests today that confirm the fact that multivalued fields are supported for grouped faceting.If someone can confirm that and have the rights to modify the documentation (or to post a comment), it would be great.Many thanks in advance.-- Julien Canquelain


Re: NOTE: comments currently disabled for most users in Solr Ref Guide

2013-11-27 Thread Chris Hostetter


FYI: https://issues.apache.org/jira/browse/INFRA-7058

: Changing this back for just the ref guide wiki psace would be fairly easy -- 
: but i don't want to do that until i have a chance to talk to Infra about it.



-Hoss


Re: What is the right way to list top terms for a given field?

2013-11-27 Thread Dave Seltzer
Thanks Tim,

That seems to be exactly what I'm looking for!

-Dave



> On Nov 27, 2013, at 2:34 PM, Timothy Potter  wrote:
>
> Hi Dave,
>
> Have you looked at the TermsComponent?
> http://wiki.apache.org/solr/TermsComponent It is easy to wire into an
> existing request handler and allows you to return the top terms for a
> field. Example server even includes an example request handler that
> uses it:
>
>  
>
>  
> 
>  true
>  false
>
>
>  terms
>
>  
>
> Cheers,
> Tim
>
>> On Wed, Nov 27, 2013 at 10:07 AM, Dave Seltzer  wrote:
>> It's certainly seems to be faster (in my limited testing).
>>
>> I just don't want to base my software on the Luke scripts if they're
>> prone to changing in the future.
>>
>> And yes, I realize there are ways to make this secure. I just wanted
>> to know if it's something I should avoid doing (perhaps for reasons
>> beyond my comprehension.)
>>
>> Thanks!
>>
>> -D
>>
>>> On Nov 27, 2013, at 11:46 AM, Stefan Matheis  
>>> wrote:
>>>
>>> Since your users shouldn't be allowed at any time to access Solr directly, 
>>> it's up to you to implement that on the client side anyway?
>>>
>>> I can't tell if there is a technical difference between the two calls you 
>>> named, but i'd guess that the second might be a more direct way to access 
>>> this information (and probably a bit faster?).
>>>
>>> -Stefan
>>>
>>>
 On Wednesday, November 27, 2013 at 5:22 PM, Dave Seltzer wrote:

 Hello,

 I'm trying to get a list of top terms for a field called "Tags".

 One way to do this would be to query all data *:* and then facet by the
 Tags column:
 /solr/collection/admin/select?q=*:*&rows=0&facet=true&facet.field=Tags

 I've noticed another way to do this is using the luke interface like this:
 /solr/collection/admin/luke?fl=Tags&numTerms=20

 One problem I see with the luke interface is that its inside the /admin/
 path, which to me means that my users shouldn't be able to access it.

 Whats the most SOLRy way to do this?

 Thanks!

 -D
>>>


Re: Term Vector Component Question

2013-11-27 Thread Erick Erickson
Jamie:

Before jumping into using debug, do take a bit to test
the performance! I've seen the debug component take
up to 80% of the query time. Admittedly, that was, I
think, 3.6 or something so it may be much different now.

But I should have asked first, "Why do you care?". What
is your use case. Of course I'm really asking if this is an
XY problem.

Best,
Erick


On Wed, Nov 27, 2013 at 2:52 PM, Jamie Johnson  wrote:

> I didn't see anything so I created this
>
> https://issues.apache.org/jira/browse/SOLR-5511
>
>
> On Wed, Nov 27, 2013 at 2:35 PM, Jamie Johnson  wrote:
>
> > Thanks Jack, I'll see if I can find anything on Jira about this and if
> not
> > I'll create a ticket for it.
> >
> >
> > On Wed, Nov 27, 2013 at 12:28 PM, Jack Krupansky <
> j...@basetechnology.com>wrote:
> >
> >> To be honest, this kind of question comes up so often, that it probably
> >> is worth a Jira to have a more customized or parameterized "explain".
> >>
> >> Function queries in the "fl" list give you a lot more control, but not
> at
> >> the level of actual terms that matched.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> -Original Message- From: Jamie Johnson
> >> Sent: Wednesday, November 27, 2013 12:18 PM
> >>
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Term Vector Component Question
> >>
> >> thanks I'm looking at this now, debug seems pretty close to what I want.
> >> Is there a way to exclude information from the debug response, for
> >> instance I don't need idf, fieldnorm, timing information, etc.  Again
> >> thanks.
> >>
> >>
> >> On Wed, Nov 27, 2013 at 11:49 AM, Jack Krupansky <
> j...@basetechnology.com
> >> >wrote:
> >>
> >>  There is an XML version of explain as well, if parsing the structured
> >>> text
> >>> is too difficult for your application. The point is that debug
> "explain"
> >>> details precisely the term vector values for actual query terms.
> >>>
> >>> Don't let the "debug" moniker throw you - this parameter is simply
> giving
> >>> you access to detail information that you might find of value in your
> >>> application.
> >>>
> >>> As Erick explained, the function query approach ("tf(query-term)") also
> >>> works, kind of, sort of, at least where all query terms must be
> matched,
> >>> but when the "OR" operator is used, it won't tell you which term
> matched
> >>> -
> >>> although a tf value of 0 basically tells you that.
> >>>
> >>>
> >>> -- Jack Krupansky
> >>>
> >>> -Original Message- From: Jamie Johnson
> >>> Sent: Wednesday, November 27, 2013 11:38 AM
> >>> To: solr-user@lucene.apache.org
> >>> Subject: Re: Term Vector Component Question
> >>>
> >>>
> >>> Jack,
> >>>
> >>> I'm not following, are you suggesting to turn on debug and then parse
> the
> >>> explain?  Seems very round about if that is the case, no?
> >>>
> >>>
> >>> On Wed, Nov 27, 2013 at 9:40 AM, Jack Krupansky <
> j...@basetechnology.com
> >>> >
> >>> wrote:
> >>>
> >>>  That information would be included in the debugQuery output as well.
> >>>
> 
>  -- Jack Krupansky
> 
>  -Original Message- From: Jamie Johnson Sent: Wednesday,
> November
>  27, 2013 9:32 AM To: solr-user@lucene.apache.org Subject: Term Vector
>  Component Question
>  I am interested in retrieving the tf for terms that matched the query,
>  not
>  all terms in the document.  Is this possible?  Looking at the example
>  when
>  I search for the word cable I get the response that is shown below,
>  ideally
>  I'd like to see only the tf for the word cable.  Is this possible or
>  would
>  I need to write a custom query component to do this?
> 
>  
> 
>  
> 
>  0
> 
>  2
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  32MB SD card, USB cable, AV cable, battery
> 
>  
> 
>  
> 
>  USB cable
> 
>  
> 
>  
> 
>  earbud headphones, USB cable
> 
>  
> 
>  
> 
>  
> 
>  id
> 
>  
> 
>  IW-02
> 
>  
> 
>  
> 
>  9885A004
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  2
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  3007WFP
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  MA147LL/A
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  1
> 
>  
> 
>  
> 
>  
> 
>  
> 
>  
> 
> 
> 
> >>>
> >>
> >
>


Re: Caches contain deleted docs (?)

2013-11-27 Thread Erick Erickson
Yep, it's expected. Segments are write-once. It's been
a long standing design that deleted data will be
reclaimed on segment merge, but not before. It's
pretty expensive to change the terms loaded on the
fly to respect deleted document's removed data.

Best,
Erick


On Wed, Nov 27, 2013 at 4:07 PM, Roman Chyla  wrote:

> Hi,
> I'd like to check - there is something I don't understand about cache - and
> I don't know if it is a bug, or feature
>
> the following calls return a cache
>
> FieldCache.DEFAULT.getTerms(reader, idField);
> FieldCache.DEFAULT.getInts(reader, idField, false);
>
>
> the resulting arrays *will* contain entries for deleted docs, so to filter
> them out, one has to manually check livedocs. Is this the expected
> behaviour? I don't understand why the cache would be bothering to load data
> for deleted docs. This is on SOLR4.0
>
> Thanks!
>
>   roman
>


Re: Error when creating collection in Solr 4.6

2013-11-27 Thread Erick Erickson
Are you using old-style XML files with a  tag
and maybe  tags as well? If so, see:
https://issues.apache.org/jira/browse/SOLR-5510

Short form: you may have better luck if you're using
old-style solr.xml files by adding:
 genericCoreNodeNames="${genericCoreNodeNames:true}
to your  tag, something like:




But really, I'd use the new-style discovery instead. That's "the new way".

Best,
Erick


On Wed, Nov 27, 2013 at 11:03 AM, Yago Riveiro wrote:

> Lansing,
>
> I ran the command without any issue
>
>
> http://localhost:8983/solr/admin/collections?action=CREATE&name=Current1&numShards=5&replicationFactor=3&maxShardsPerNode=15&collection.configName=default
>
> The only different was that I have only one box and used the default
> config from example folder.
>
> --
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Wednesday, November 27, 2013 at 3:43 PM, lansing wrote:
>
> >
> http://10.0.5.227:8101/solr/admin/collections?action=CREATE&name=Current1&numShards=5&replicationFactor=3&maxShardsPerNode=3&collection.
> (
> http://10.0.5.227:8101/solr/admin/collections?action=CREATE&name=Current1&numShards=5&replicationFactor=3&maxShardsPerNode=3&collection.configName=Current1
> )
>
>


Re: Caches contain deleted docs (?)

2013-11-27 Thread Roman Chyla
I understand that changes would be expensive, but shouldn't the cache
simply skip the deleted docs? In the same way as the cache for multivalued
fields (that accepts livedocs bits).
Thanks,

  roman


On Wed, Nov 27, 2013 at 6:26 PM, Erick Erickson wrote:

> Yep, it's expected. Segments are write-once. It's been
> a long standing design that deleted data will be
> reclaimed on segment merge, but not before. It's
> pretty expensive to change the terms loaded on the
> fly to respect deleted document's removed data.
>
> Best,
> Erick
>
>
> On Wed, Nov 27, 2013 at 4:07 PM, Roman Chyla 
> wrote:
>
> > Hi,
> > I'd like to check - there is something I don't understand about cache -
> and
> > I don't know if it is a bug, or feature
> >
> > the following calls return a cache
> >
> > FieldCache.DEFAULT.getTerms(reader, idField);
> > FieldCache.DEFAULT.getInts(reader, idField, false);
> >
> >
> > the resulting arrays *will* contain entries for deleted docs, so to
> filter
> > them out, one has to manually check livedocs. Is this the expected
> > behaviour? I don't understand why the cache would be bothering to load
> data
> > for deleted docs. This is on SOLR4.0
> >
> > Thanks!
> >
> >   roman
> >
>


Re: NOTE: comments currently disabled for most users in Solr Ref Guide

2013-11-27 Thread Chris Hostetter

FYI: comments should now be working for all registered users.

If Comment spam becomes a problem too unweildy to manage by deleting after 
the fact, we'll have to consider going the same route as we do with an 
explicit white list of users like we have with moin moin.


: Date: Wed, 27 Nov 2013 14:46:11 -0700 (MST)
: From: Chris Hostetter 
: To: solr-user@lucene.apache.org
: Subject: Re: NOTE: comments currently disabled for most users in Solr Ref
: Guide
: 
: 
: 
: FYI: https://issues.apache.org/jira/browse/INFRA-7058
: 
: : Changing this back for just the ref guide wiki psace would be fairly easy 
-- 
: : but i don't want to do that until i have a chance to talk to Infra about it.
: 
: 
: 
: -Hoss
: 

-Hoss


(info) how view lucene merge process

2013-11-27 Thread Jacky.J.Wang (mis.cnsh04.Newegg) 41361
Hello lucene
how view lucene merge process?


Re: (info) how view lucene merge process

2013-11-27 Thread Jack Krupansky

What do you really want to do/accomplish? I mean, for what purpose?

You can turn on the Lucene infostream for logging of index writing.

See:
https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig

Set  to "true".

There are some examples in my e-book.

-- Jack Krupansky

-Original Message- 
From: Jacky.J.Wang (mis.cnsh04.Newegg) 41361 
Sent: Wednesday, November 27, 2013 7:41 PM 
To: solr-user@lucene.apache.org 
Subject: (info) how view lucene merge process 


Hello lucene
how view lucene merge process?


Re: SolrCloud and 2MB Synonym file

2013-11-27 Thread Puneet Pawaia
I am running an ensemble.
Can I get examples of how to use the option? I think there are not many
examples available of the exact usage.

Regards
Puneet
On 27 Nov 2013 23:23, "Yago Riveiro"  wrote:

> How are you launching Solr?
>
> Do you have an ensemble or you're running zookeeper embedded?
>
> Yes, doc says that jute.maxbuffer is dangerous, but without it you can
> stored nothing with more than 1M in zookeeper … and in some point you can
> have a clusterstate.json with a size greater than 1M
> --
> Yago Riveiro
> Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
>
>
> On Wednesday, November 27, 2013 at 5:31 PM, Puneet Pawaia wrote:
>
> > Yago, not sure if this is a good idea. Docs say this is dangerous stuff.
> >
> > Anyway, not being a linux or java expert, I would appreciate if you could
> > point me to an implementation of this.
> >
> > Regards
> > Puneet Pawaia
> > On 27 Nov 2013 22:54, "Yago Riveiro"  yago.rive...@gmail.com)> wrote:
> >
> > > You can use the jute.maxbuffer > 1M as a workaround.
> > >
> > > You must set -Djute.maxbuffer in zookeeper and solr to work properly
> > >
> > > --
> > > Yago Riveiro
> > > Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
> > >
> > >
> > > On Wednesday, November 27, 2013 at 5:15 PM, Puneet Pawaia wrote:
> > >
> > > > Hi
> > > >
> > > > I am trying to setup a test SolrCloud 4.5.1 implementation. My
> synonym
> > > file
> > > > is about 1.6 MB. When I try to add collection to ZooKeeper 3.4.5 on
> > >
> > > Ubuntu
> > > > 12.4, it fails because of the 1MB limit of ZooKeeper. Has anyone any
> > > > experience with using such synonym files? Can I store them in some
> other
> > > > location other than the Config Folder since the config folder is
> loaded
> > > > into ZooKeeper.
> > > >
> > > > TIA
> > > >
> > > > Puneet Pawaia
>
>


Re: Can't post comment on Confluence pages under "Apache Solr Reference Guide"

2013-11-27 Thread Ahmet Arslan
Hi Julien,

Please see : http://search-lucene.com/m/MTRUH1cyNGZ1 and 
https://issues.apache.org/jira/browse/INFRA-7058



On Wednesday, November 27, 2013 11:19 PM, Julien Canquelain 
 wrote:
 
 


Hi,

I would like to post a comment about the problem below on Solr Confluence 
documentation, but comments are disabled right now for confluence-users (at 
least at the time I'm writing this - it was confirmed on IRC a minute ago).

The page I would like to comment on is : 
https://cwiki.apache.org/confluence/display/solr/Result+Grouping

It seems to me that there is a minor mistake in the following sentence :
"Grouped faceting only supports facet.field for string based fields that are 
not tokenized and are not multivalued."

The point is : grouped faceting DO support multivalued fields. Indeed, as it 
can be read in the "request parameter" table on the same page:
"Grouped faceting supports single and multivalued fields"

I did many tests today that confirm the fact that multivalued fields are 
supported for grouped faceting.

If someone can confirm that and have the rights to modify the documentation (or 
to post a comment), it would be great.

Many thanks in advance.


-- 
Julien Canquelain

Request for Contributors Group

2013-11-27 Thread Ahmet Arslan
Hello all,

Please add my username ( iorixxx ) to  Contributors Group. With this, can I 
edit confluence too?

Request for Contributors Group

2013-11-27 Thread Shinichiro Abe
Hi,

Please add my username ( shinichiro ) to  Contributors Group.

Thanks in advance,
Shinichiro Abe


Re: Error when creating collection in Solr 4.6

2013-11-27 Thread lansing
Thank you for your replies,
I am using the new-style discovery 
It worked after adding this setting :
${genericCoreNodeNames:true}





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-when-creating-collection-in-Solr-4-6-tp4103536p4103696.html
Sent from the Solr - User mailing list archive at Nabble.com.