Re: Question About Solr Cores

2009-07-11 Thread Shalin Shekhar Mangar
On Fri, Jul 10, 2009 at 11:22 PM, danben  wrote:

>
> What I have seen, however, is that the number of open FDs steadily
> increases
> with the number of cores opened and files indexed, until I hit whatever
> upper bound happens to be set (currently 100k).  Raising machine-imposed
> limits, using the compound file format, etc are only holdovers.  I was
> thinking it would be nice if I could keep some kind of MRU cache of cores
> such that Solr only keeps open resources for the cores in the cache, but
> I'm
> not sure if this is allowed.  I saw that SolrCore has a close() function,
> but if my understanding is correct, that isn't exposed to the client.
>
> Would anyone know if there are any ways to de/reallocate resources for
> different cores at runtime?
>

We are currently working on a similar use-case. What we've done is that
we've added a lazy startup option to a core with a LRU based core
loading/unloading. We did this through modifying CoreContainer and extending
CoreAdminHandler. This feature is marked for 1.5. We plan to give a patch as
soon as the code for 1.4 is branched off. Some related changes are already
in trunk e.g. SOLR-943, SOLR-1121, SOLR-921, SOLR-1108, SOLR-920.

The pending issues are:
https://issues.apache.org/jira/browse/SOLR-919
https://issues.apache.org/jira/browse/SOLR-1028
https://issues.apache.org/jira/browse/SOLR-880

-- 
Regards,
Shalin Shekhar Mangar.


Re: Aggregating/Grouping Document Search Results on a Field

2009-07-11 Thread Shalin Shekhar Mangar
On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens <
bradfordsteph...@gmail.com> wrote:

> Does the facet aggregation take place on the Solr search server, or
> the Solr client?
>
> It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50
> million document index (about 36M unique values in the "author"
> field), a query that returns 131,000 hits takes about 20 seconds to
> calculate the top 50 authors. The query I'm running is this:
>
>
> http://dttest10:8983/solr/select/select?q=java&facet=true&facet.field=authorname
> :
>
>
Is the author field tokenized? Is it multi-valued? It is best to have
untokenized fields.

Solr 1.4 has huge improvements in faceting performance so you can try that
and see if it helps. See Yonik's blog post about this -
http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/

-- 
Regards,
Shalin Shekhar Mangar.


Re: solr jmx connection

2009-07-11 Thread Shalin Shekhar Mangar
On Sat, Jul 11, 2009 at 8:56 AM, J G  wrote:

>
> I have a SOLR JMX connection issue. I am running my JMX MBeanServer through
> Tomcat, meaning I am using Tomcat's MBeanServer rather than any other
> MBeanServer implemenation.
> I am having a hard time trying to figure out the correct JMX Service URL on
> my localhost for the accessing the SOLR MBeans. My current configuration
> consists of the following:
>
> JMX Service url = localhost:9000/jmxrmi
>
> So I have configured JMX to run on port 9000 on tomcat on my localhost and
> using the above service url i can access the tomcat jmx MBeanServer and get
> related JVM object information(e.g. I can access the MemoryMXBean object)
>
> However, I am having a harder time trying to access the SOLR MBeans. First,
> I could have the wrong service URL. Second, I'm confused as to which MBeans
> SOLR provides.
>
>
The service url is of the form --
"service:jmx:rmi:///jndi/rmi://localhost:/solr". The following code
snippet is taked from TestJmxMonitoredMap unit test:

String url = "service:jmx:rmi:///jndi/rmi://localhost:/solr";
JMXServiceURL u = new JMXServiceURL(url);
connector = JMXConnectorFactory.connect(u);
mbeanServer = connector.getMBeanServerConnection();

Solr exposes many MBeans, there's one named "searcher" which always refers
to the live SolrIndexSearcher. You can connect with jconsole once to see all
the mbeans.

Hope that helps.

-- 
Regards,
Shalin Shekhar Mangar.


Solr tika and extracting formatting info

2009-07-11 Thread S.Selvam
Hi all,

I am using solr tika to index various file formats.I have used
ExtractingRequestHandler to get the data and render it in GUI using VB.NET.
Now my requirement is to render the file as it is(With all formatting,for
eg.Table,) or almost a similar look of original file.So i need to receive
all the formatting information of the file posted to Tika not only the data.
Is that possible with Tika? or do i need use any other module ?

I would like to get your suggestions regarding this.


-- 
Yours,
S.Selvam


Re: Preparing the ground for a real multilang index

2009-07-11 Thread Jan Høydahl

Michael, you're of course right, copyfield would copy from source.
The lack of built-in language awareness in Solr is unfortunate :(
I have not tried Lucid's BasisTech lemmatizer implementation, but check
with them whether they can support multi languages in the same field.

--
Jan Høydahl
On 8. juli. 2009, at 16.32, Paul Libbrecht wrote:


Can't the copy field use a different analyzer?
Both for query and indexing?
Otherwise you need to craft your own analyzer which reads the  
language from the field-name... there's several classes ready for  
this.


paul

Le 08-juil.-09 à 02:36, Michael Lackhoff a écrit :


On 08.07.2009 00:50 Jan Høydahl wrote:


itself and do not need to know the query language. You may then want
to do a copyfield from all your text_ -> text for convenient  
one-

field-to-rule-them-all search.


Would that really help? As I understand it, copyfield takes the  
raw, not

yet analyzed field value. I cannot see yet the advantage of this
"text"-field over the current situation with no text_ fields  
at all.
The copied-to text field has to be language agnostic with no  
stemming at

all, so it would miss many hits. Or is there a way to combine many
differently stemmed variants into one field to be able to search  
against

all of them at once? That would be great indeed!

-Michael






Re: Using curl comparing with using WebService::Solr

2009-07-11 Thread Noble Paul നോബിള്‍ नोब्ळ्
I am not familiar with perl so I cannot help you in how to do it
better in perl.The pseudo code should help.

You can do faster indexing if you post in multiple threads. If you
know java , use StreamingHttpSolrServer (in SolrJ client)

On Fri, Jul 10, 2009 at 4:28 PM, Shalin Shekhar
Mangar wrote:
> On Fri, Jul 10, 2009 at 1:17 PM, Francis Yakin  wrote:
>
>> How you batching all documents in one curl call? Do you have a sample, so I
>> can modify my script and try it again.
>>
>> Right now I do curl on each documents( I have 1000 docs on each folder and
>> I have 1000 folders) using :
>>
>>  curl http://localhost:7001/solr/update --data-binary @abc.xml -H
>> 'Content-type:text/plain; charset=utf-8'
>>
>> Abc.xml is one doc, we have another 999 files ending with ".xml"
>>
>> Please advice.
>>
>
> You'll need to combine the multiple add xmls you have into one. See Noble's
> suggestion on how to do that. Basically, your script will read a number of
> files, combine them into one and send them with one curl call. However, I
> just noticed that you are posting to localhost only, so may not be that
> expensive to have one curl call per document.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com


Re: Solr tika and extracting formatting info

2009-07-11 Thread Grant Ingersoll


On Jul 11, 2009, at 4:23 AM, S.Selvam wrote:


Hi all,

I am using solr tika to index various file formats.I have used
ExtractingRequestHandler to get the data and render it in GUI using  
VB.NET.
Now my requirement is to render the file as it is(With all  
formatting,for
eg.Table,) or almost a similar look of original file.So i need to  
receive
all the formatting information of the file posted to Tika not only  
the data.

Is that possible with Tika? or do i need use any other module ?



Are you saying you want the original file back?  If so, then I believe  
making it a stored field should work, although I haven't verified it  
and a part of me wonders whether Solr is going to store that data as  
binary.  Otherwise, I don't have any suggestions, as Tika nor Solr  
hang on to any formatting information.


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Boosting certain documents dynamically at query-time

2009-07-11 Thread Michael Lugassy
Hi guys --

Using solr 1.4 functions at query-time, can I dynamically boost
certain documents which are: a) not on the same range, i.e. have very
different document ids, b) have different boost values, c) part of a
long list (can be around 1,000 different document ids with 50
different boost values)?

Overall I'm trying to influence ranking scores on a user-by-user
basis, each user carries a list of historical documents that he
already voted on.

Thanks!

-- Michael


Select tika output for extract-only?

2009-07-11 Thread Peter Wolanin
I had been assuming that I could choose among possible tika output
formats when using the extracting request handler in extract-only mode
as if from the CLI with the tika jar:

-x or --xmlOutput XHTML content (default)
-h or --html   Output HTML content
-t or --text   Output plain text content
-m or --metadata   Output only metadata

However, looking at the docs and source, it seems that only the xml
option is available (hard-coded) in ExtractingDocumentLoader:

serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));

In addition, it seems that the metadata is always appended to the response.

Are there any open issues relating to this, or opinions on whether
adding additional flexibility to the response format would be of
interest for 1.4?

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com


Caching per segmentReader?

2009-07-11 Thread Jason Rutherglen
Are we planning on implementing caching (docsets, documents, results) per
segment reader or is this something that's going to be in 1.4?


A question about SolrJ range query?

2009-07-11 Thread huenzhao

We can use solr range query like:
http://localhost:8983/solr/select?q=queryStr&fq=x:[10 TO 100] AND y:[20 TO
300]
or :
http://localhost:8983/solr/select?q=queryStr&fq=x:[10 TO 100]&fq=y:[20 TO
300]

My Question:
How to make this range query by using solrJ ? Anybody knows? 
enzhao...@gmail.com  thanks!

-- 
View this message in context: 
http://www.nabble.com/A-question-about-SolrJ-range-query--tp24445471p24445471.html
Sent from the Solr - User mailing list archive at Nabble.com.