cheking the size of the index using solrj API's
hi, I need to monitor the index for the following information: 1. Size of the index 2 Last time the index was updated. Although I did an extensive search of the API's i cant find something which does the same( as mentioned above) please help -- View this message in context: http://n3.nabble.com/cheking-the-size-of-the-index-using-solrj-API-s-tp692686p692686.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: cheking the size of the index using solrj API's
> I need to monitor the index for the following information: > > 1. Size of the index > 2 Last time the index was updated. > > Although I did an extensive search of the API's i cant find > something which > does the same( as mentioned above) solr/admin/stats.jsp is actually an xml converted to html with stats.xsl There are info about when last commit etc: Fri Apr 02 17:07:03 EEST 2010 Fri Apr 02 17:07:03 EEST 2010 Also LukeRequestHandler shows last modified time in UTC solr/admin/luke?wt=xml&numTerms=0 2010-04-02T14:07:07Z I am not sure with the size. I can see it in stats.jsp because i have registered Replication Handler. 226.86 MB
Experience with indexing billions of documents?
We are currently indexing 5 million books in Solr, scaling up over the next few years to 20 million. However we are using the entire book as a Solr document. We are evaluating the possibility of indexing individual pages as there are some use cases where users want the most relevant pages regardless of what book they occur in. However, we estimate that we are talking about somewhere between 1 and 6 billion pages and have concerns over whether Solr will scale to this level. Does anyone have experience using Solr with 1-6 billion Solr documents? The lucene file format document (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) mentions a limit of about 2 billion document ids. I assume this is the lucene internal document id and would therefore be a per index/per shard limit. Is this correct? Tom Burton-West.
Re: Index db data
Hello trueman, here are some helpful pages from the wiki: DataImportHandler: http://wiki.apache.org/solr/DataImportHandler And if there are some troubles, you may find an answer here: http://wiki.apache.org/solr/DataImportHandlerFaq An example for a data-config.xml you can find at the example-directory of your solr-download. Look at: example/example-DIH/solr/db/conf Thats where you can find a db-data-config.xml. Firstly reading through the wiki, I think you will have no problems in setting up your own DB-Import for Solr. Hope this helps - Mitch -- View this message in context: http://n3.nabble.com/Index-db-data-tp693204p693250.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index db data
Additionally to my first post: At the wiki there is given a http-request for full-import. I haven't worked yet with SolrJ, but I think you need to copy those parts from the URL that show the directory-structure of your Solr-instance. For the example I suggested to have a look at, I think it will looks like this, if your dataImportHandler is called "yourDataImportHandler": /example/example-DIH/solr/db/yourDataImportHandler?command=full-import Searching for "SolrJ" you may find some examples for a SolrJ-client application. -- View this message in context: http://n3.nabble.com/Index-db-data-tp693204p693269.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index db data
No HTTP-call. That's a missunderstanding. For a http-call you need to have an url like this: http://:/solr/dataimport?command=full-import For the SolrJ-client I *think* that your query only needs to look like this: /solr/dataimport?command=full-import However, I have never worked with the SolrJ-client - so maybe I am wrong. -- View this message in context: http://n3.nabble.com/Index-db-data-tp693204p693343.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Experience with indexing billions of documents?
My guess is that you will need to take advantage of Solr 1.5's upcoming cloud/cluster renovations and use multiple indexes to comfortably achieve those numbers. Hypthetically, in that case, you won't be limited by single index docid limitations of Lucene. > We are currently indexing 5 million books in Solr, scaling up over the > next few years to 20 million. However we are using the entire book as a > Solr document. We are evaluating the possibility of indexing individual > pages as there are some use cases where users want the most relevant pages > regardless of what book they occur in. However, we estimate that we are > talking about somewhere between 1 and 6 billion pages and have concerns > over whether Solr will scale to this level. > > Does anyone have experience using Solr with 1-6 billion Solr documents? > > The lucene file format document > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) > mentions a limit of about 2 billion document ids. I assume this is the > lucene internal document id and would therefore be a per index/per shard > limit. Is this correct? > > > Tom Burton-West. > > > >
Re: Experience with indexing billions of documents?
You can do this today with multiple indexes, replication and distributed searching. SolrCloud/clustering will certainly make life easier when it comes to managing these, but with distributed searches over multiple indexes, you're limited only by how much hardware you can throw at it. On Fri, Apr 2, 2010 at 6:17 PM, wrote: > My guess is that you will need to take advantage of Solr 1.5's upcoming > cloud/cluster renovations and use multiple indexes to comfortably achieve > those numbers. Hypthetically, in that case, you won't be limited by single > index docid limitations of Lucene. > > > We are currently indexing 5 million books in Solr, scaling up over the > > next few years to 20 million. However we are using the entire book as a > > Solr document. We are evaluating the possibility of indexing individual > > pages as there are some use cases where users want the most relevant > pages > > regardless of what book they occur in. However, we estimate that we are > > talking about somewhere between 1 and 6 billion pages and have concerns > > over whether Solr will scale to this level. > > > > Does anyone have experience using Solr with 1-6 billion Solr documents? > > > > The lucene file format document > > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) > > mentions a limit of about 2 billion document ids. I assume this is the > > lucene internal document id and would therefore be a per index/per shard > > limit. Is this correct? > > > > > > Tom Burton-West. > > > > > > > > > >
Re: Experience with indexing billions of documents?
A colleague of mine is using native Lucene + some home-grown patches/optimizations to index over 13B small documents in a 32-shard environment, which is around 406M docs per shard. If there's a 2B doc id limitation in Lucene then I assume he's patched it himself. On Fri, Apr 2, 2010 at 1:17 PM, wrote: > My guess is that you will need to take advantage of Solr 1.5's upcoming > cloud/cluster renovations and use multiple indexes to comfortably achieve > those numbers. Hypthetically, in that case, you won't be limited by single > index docid limitations of Lucene. > > > We are currently indexing 5 million books in Solr, scaling up over the > > next few years to 20 million. However we are using the entire book as a > > Solr document. We are evaluating the possibility of indexing individual > > pages as there are some use cases where users want the most relevant > pages > > regardless of what book they occur in. However, we estimate that we are > > talking about somewhere between 1 and 6 billion pages and have concerns > > over whether Solr will scale to this level. > > > > Does anyone have experience using Solr with 1-6 billion Solr documents? > > > > The lucene file format document > > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations) > > mentions a limit of about 2 billion document ids. I assume this is the > > lucene internal document id and would therefore be a per index/per shard > > limit. Is this correct? > > > > > > Tom Burton-West. > > > > > > > > > >
Re: MoreLikeThis function queries
Bueller? Anyone? :) -- View this message in context: http://n3.nabble.com/MoreLikeThis-function-queries-tp692377p693648.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: MoreLikeThis function queries
Its Friday dude. Give it a couple days. ;) On Fri, 2010-04-02 at 11:50 -0800, Blargy wrote: > Bueller? Anyone? :)
highlighter issue
hello *, i have a field that is indexing the string "the ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend] then they are passed to the edgengram filter, this allows me to match different user spellings and allows for partial highlighting, however a token like 'ex' would get generated twice which should be fine except the highlighter seems to highlight that token twice even though it has the same offsets (4,6) is there away to make the highlighter not highlight the same token twice, or do i have to create a token filter that would dump tokens with equal text and offsets ? basically whats happening now is if i search 'the e', i get: 'SeinfeldThe EEx-Girlfriend' for 'the ex', i get: 'SeinfeldThe ExEx-Girlfriend' and so on thx much --joe
Re: Search accross more than one field (dismax) ignored
Hoss, thank you for responsing. This behaviour was caused by an unexpected behaviour of the RessourceLoader caused by an utf-8-BOM encodet file. I have mentioned this in another thread on the mail-list, sorry for forget to say this also here. Kind regards - Mitch -- View this message in context: http://n3.nabble.com/Search-accross-more-than-one-field-dismax-ignored-tp687935p693759.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: highlighter issue
Will adding the RemoveDuplicatesTokenFilter(Factory) do the trick here? Erik On Apr 2, 2010, at 4:13 PM, Joe Calderon wrote: hello *, i have a field that is indexing the string "the ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend] then they are passed to the edgengram filter, this allows me to match different user spellings and allows for partial highlighting, however a token like 'ex' would get generated twice which should be fine except the highlighter seems to highlight that token twice even though it has the same offsets (4,6) is there away to make the highlighter not highlight the same token twice, or do i have to create a token filter that would dump tokens with equal text and offsets ? basically whats happening now is if i search 'the e', i get: 'SeinfeldThe EEx-Girlfriend' for 'the ex', i get: 'SeinfeldThe ExEx-Girlfriend' and so on thx much --joe
Re: highlighter issue
i had tried it earlier with no effect, when i looked at the source, it doesnt look at offsets at all, just position increments, so short of somebody finding a better way i going to create a similar filter that compared offsets... On Fri, Apr 2, 2010 at 2:07 PM, Erik Hatcher wrote: > Will adding the RemoveDuplicatesTokenFilter(Factory) do the trick here? > > Erik > > On Apr 2, 2010, at 4:13 PM, Joe Calderon wrote: > >> hello *, i have a field that is indexing the string "the >> ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend] >> then they are passed to the edgengram filter, this allows me to match >> different user spellings and allows for partial highlighting, however >> a token like 'ex' would get generated twice which should be fine >> except the highlighter seems to highlight that token twice even though >> it has the same offsets (4,6) >> >> is there away to make the highlighter not highlight the same token >> twice, or do i have to create a token filter that would dump tokens >> with equal text and offsets ? >> >> >> basically whats happening now is if i search >> >> 'the e', i get: >> 'Seinfeld The EEx-Girlfriend' >> >> for 'the ex', i get: >> 'Seinfeld The ExEx-Girlfriend' >> >> and so on >> >> >> thx much >> >> --joe > >
Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor
Hi I am experimenting with Solr to index my gmail and am experiencing an error: 'Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor' I downloaded a fresh 1.4 tgz, extracted it and added the following to example/solr/config/solrconfig.xml: /home/andrew/bin/apache-solr-1.5-dev/example/solr/conf/email-data-config.xml email-data-config.xml containd the following: Whenever I try to import data using /dataimport?command=full-import I am seeing the error below: Apr 2, 2010 10:14:51 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to load EntityProcessor implementation for entity:11418758786959 Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:805) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:536) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:261) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372) Caused by: java.lang.ClassNotFoundException: Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:966) at org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:802) ... 6 more Caused by: org.apache.solr.common.SolrException: Error loading class 'MailEntityProcessor' at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373) at org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:956) ... 7 more Caused by: java.lang.ClassNotFoundException: MailEntityProcessor at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357) ... 8 more Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback Am I missing a step somewhere? I have tried this with the standard apache 1.4, a nightly of 1.5 and also the LucidWorks release and get the same issue with each. The wiki isn't very detailed either. My backbground isn't in Java so a lot of this is new to me. Regards Andrew McCombe
Re: MoreLikeThis function queries
Fair enough :) -- View this message in context: http://n3.nabble.com/MoreLikeThis-function-queries-tp692377p693872.html Sent from the Solr - User mailing list archive at Nabble.com.
Related terms/combined terms
Not sure of the exact vocabulary I am looking for so I'll try to explain myself. Given a search term is there anyway to return back a list of related/grouped keywords (based on the current state of the index) for that term. For example say I have a sports catalog and I search for "Callaway". Is there anything that could give me back "Callaway Driver" "Callaway Golf Balls" "Callaway Hat" "Callaway Glove" Since these words are always grouped to together/related. Note sure if something like this is even possible. Thanks -- View this message in context: http://n3.nabble.com/Related-terms-combined-terms-tp694083p694083.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr caches and nearly static indexes
My index has a number of shards that are nearly static, each with about 7 million documents. By nearly static, I mean that the only changes that normally happen to them are document deletions, done with the xml update handler. The process that does these deletions runs once every two minutes, and does them with a query on a field other than the one that's used for uniqueKey. Once a day, I will be adding data to these indexes with the DIH delta-import. One of my shards gets all new data once every two minutes, but it is less than 5% the size of the others. The problem that I'm running into is that every time a delete is committed, my caches are suddenly invalid and I seem to have two options: Spend a lot of time and I/O rewarming them, or suffer with slow (3 seconds or longer) search times. Is there any way to have the index keep its caches when the only thing that happens is deletions, then invalidate them when it's time to actually add data? It would have to be something I can dynamically change when switching between deletions and the daily import. Thanks, Shawn