Re: Question About Solr Cores
On Fri, Jul 10, 2009 at 11:22 PM, danben wrote: > > What I have seen, however, is that the number of open FDs steadily > increases > with the number of cores opened and files indexed, until I hit whatever > upper bound happens to be set (currently 100k). Raising machine-imposed > limits, using the compound file format, etc are only holdovers. I was > thinking it would be nice if I could keep some kind of MRU cache of cores > such that Solr only keeps open resources for the cores in the cache, but > I'm > not sure if this is allowed. I saw that SolrCore has a close() function, > but if my understanding is correct, that isn't exposed to the client. > > Would anyone know if there are any ways to de/reallocate resources for > different cores at runtime? > We are currently working on a similar use-case. What we've done is that we've added a lazy startup option to a core with a LRU based core loading/unloading. We did this through modifying CoreContainer and extending CoreAdminHandler. This feature is marked for 1.5. We plan to give a patch as soon as the code for 1.4 is branched off. Some related changes are already in trunk e.g. SOLR-943, SOLR-1121, SOLR-921, SOLR-1108, SOLR-920. The pending issues are: https://issues.apache.org/jira/browse/SOLR-919 https://issues.apache.org/jira/browse/SOLR-1028 https://issues.apache.org/jira/browse/SOLR-880 -- Regards, Shalin Shekhar Mangar.
Re: Aggregating/Grouping Document Search Results on a Field
On Sat, Jul 11, 2009 at 12:01 AM, Bradford Stephens < bradfordsteph...@gmail.com> wrote: > Does the facet aggregation take place on the Solr search server, or > the Solr client? > > It's pretty slow for me -- on a machine with 8 cores/ 8 GB RAM, 50 > million document index (about 36M unique values in the "author" > field), a query that returns 131,000 hits takes about 20 seconds to > calculate the top 50 authors. The query I'm running is this: > > > http://dttest10:8983/solr/select/select?q=java&facet=true&facet.field=authorname > : > > Is the author field tokenized? Is it multi-valued? It is best to have untokenized fields. Solr 1.4 has huge improvements in faceting performance so you can try that and see if it helps. See Yonik's blog post about this - http://yonik.wordpress.com/2008/11/25/solr-faceted-search-performance-improvements/ -- Regards, Shalin Shekhar Mangar.
Re: solr jmx connection
On Sat, Jul 11, 2009 at 8:56 AM, J G wrote: > > I have a SOLR JMX connection issue. I am running my JMX MBeanServer through > Tomcat, meaning I am using Tomcat's MBeanServer rather than any other > MBeanServer implemenation. > I am having a hard time trying to figure out the correct JMX Service URL on > my localhost for the accessing the SOLR MBeans. My current configuration > consists of the following: > > JMX Service url = localhost:9000/jmxrmi > > So I have configured JMX to run on port 9000 on tomcat on my localhost and > using the above service url i can access the tomcat jmx MBeanServer and get > related JVM object information(e.g. I can access the MemoryMXBean object) > > However, I am having a harder time trying to access the SOLR MBeans. First, > I could have the wrong service URL. Second, I'm confused as to which MBeans > SOLR provides. > > The service url is of the form -- "service:jmx:rmi:///jndi/rmi://localhost:/solr". The following code snippet is taked from TestJmxMonitoredMap unit test: String url = "service:jmx:rmi:///jndi/rmi://localhost:/solr"; JMXServiceURL u = new JMXServiceURL(url); connector = JMXConnectorFactory.connect(u); mbeanServer = connector.getMBeanServerConnection(); Solr exposes many MBeans, there's one named "searcher" which always refers to the live SolrIndexSearcher. You can connect with jconsole once to see all the mbeans. Hope that helps. -- Regards, Shalin Shekhar Mangar.
Solr tika and extracting formatting info
Hi all, I am using solr tika to index various file formats.I have used ExtractingRequestHandler to get the data and render it in GUI using VB.NET. Now my requirement is to render the file as it is(With all formatting,for eg.Table,) or almost a similar look of original file.So i need to receive all the formatting information of the file posted to Tika not only the data. Is that possible with Tika? or do i need use any other module ? I would like to get your suggestions regarding this. -- Yours, S.Selvam
Re: Preparing the ground for a real multilang index
Michael, you're of course right, copyfield would copy from source. The lack of built-in language awareness in Solr is unfortunate :( I have not tried Lucid's BasisTech lemmatizer implementation, but check with them whether they can support multi languages in the same field. -- Jan Høydahl On 8. juli. 2009, at 16.32, Paul Libbrecht wrote: Can't the copy field use a different analyzer? Both for query and indexing? Otherwise you need to craft your own analyzer which reads the language from the field-name... there's several classes ready for this. paul Le 08-juil.-09 à 02:36, Michael Lackhoff a écrit : On 08.07.2009 00:50 Jan Høydahl wrote: itself and do not need to know the query language. You may then want to do a copyfield from all your text_ -> text for convenient one- field-to-rule-them-all search. Would that really help? As I understand it, copyfield takes the raw, not yet analyzed field value. I cannot see yet the advantage of this "text"-field over the current situation with no text_ fields at all. The copied-to text field has to be language agnostic with no stemming at all, so it would miss many hits. Or is there a way to combine many differently stemmed variants into one field to be able to search against all of them at once? That would be great indeed! -Michael
Re: Using curl comparing with using WebService::Solr
I am not familiar with perl so I cannot help you in how to do it better in perl.The pseudo code should help. You can do faster indexing if you post in multiple threads. If you know java , use StreamingHttpSolrServer (in SolrJ client) On Fri, Jul 10, 2009 at 4:28 PM, Shalin Shekhar Mangar wrote: > On Fri, Jul 10, 2009 at 1:17 PM, Francis Yakin wrote: > >> How you batching all documents in one curl call? Do you have a sample, so I >> can modify my script and try it again. >> >> Right now I do curl on each documents( I have 1000 docs on each folder and >> I have 1000 folders) using : >> >> curl http://localhost:7001/solr/update --data-binary @abc.xml -H >> 'Content-type:text/plain; charset=utf-8' >> >> Abc.xml is one doc, we have another 999 files ending with ".xml" >> >> Please advice. >> > > You'll need to combine the multiple add xmls you have into one. See Noble's > suggestion on how to do that. Basically, your script will read a number of > files, combine them into one and send them with one curl call. However, I > just noticed that you are posting to localhost only, so may not be that > expensive to have one curl call per document. > > -- > Regards, > Shalin Shekhar Mangar. > -- - Noble Paul | Principal Engineer| AOL | http://aol.com
Re: Solr tika and extracting formatting info
On Jul 11, 2009, at 4:23 AM, S.Selvam wrote: Hi all, I am using solr tika to index various file formats.I have used ExtractingRequestHandler to get the data and render it in GUI using VB.NET. Now my requirement is to render the file as it is(With all formatting,for eg.Table,) or almost a similar look of original file.So i need to receive all the formatting information of the file posted to Tika not only the data. Is that possible with Tika? or do i need use any other module ? Are you saying you want the original file back? If so, then I believe making it a stored field should work, although I haven't verified it and a part of me wonders whether Solr is going to store that data as binary. Otherwise, I don't have any suggestions, as Tika nor Solr hang on to any formatting information. -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search
Boosting certain documents dynamically at query-time
Hi guys -- Using solr 1.4 functions at query-time, can I dynamically boost certain documents which are: a) not on the same range, i.e. have very different document ids, b) have different boost values, c) part of a long list (can be around 1,000 different document ids with 50 different boost values)? Overall I'm trying to influence ranking scores on a user-by-user basis, each user carries a list of historical documents that he already voted on. Thanks! -- Michael
Select tika output for extract-only?
I had been assuming that I could choose among possible tika output formats when using the extracting request handler in extract-only mode as if from the CLI with the tika jar: -x or --xmlOutput XHTML content (default) -h or --html Output HTML content -t or --text Output plain text content -m or --metadata Output only metadata However, looking at the docs and source, it seems that only the xml option is available (hard-coded) in ExtractingDocumentLoader: serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true)); In addition, it seems that the metadata is always appended to the response. Are there any open issues relating to this, or opinions on whether adding additional flexibility to the response format would be of interest for 1.4? Thanks, Peter -- Peter M. Wolanin, Ph.D. Momentum Specialist, Acquia. Inc. peter.wola...@acquia.com
Caching per segmentReader?
Are we planning on implementing caching (docsets, documents, results) per segment reader or is this something that's going to be in 1.4?
A question about SolrJ range query?
We can use solr range query like: http://localhost:8983/solr/select?q=queryStr&fq=x:[10 TO 100] AND y:[20 TO 300] or : http://localhost:8983/solr/select?q=queryStr&fq=x:[10 TO 100]&fq=y:[20 TO 300] My Question: How to make this range query by using solrJ ? Anybody knows? enzhao...@gmail.com thanks! -- View this message in context: http://www.nabble.com/A-question-about-SolrJ-range-query--tp24445471p24445471.html Sent from the Solr - User mailing list archive at Nabble.com.