Solr1.3 / MySql / Tomcat55 multiple delta-import inside a big full-import
Hi, I would like to know if it's very longer to make a limited full import and multi delta-import to index all the database. If I fire a full-import without limit 4M in my request that will run me OOM because I've 8,5M of document. If I fire a full-import without limit and a batchsize=-1 I will stock the database just for me, and stack other request for 10hours, but it will work. Do you have an advice?? Thanks a lot -- View this message in context: http://www.nabble.com/Solr1.3---MySql---Tomcat55--multiple-delta-import-inside-a-big-full-import-tp20262801p20262801.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performanec Lucene / Solr
Hey, I think it will have the disadvantage of being a lot slower though... How were you handling things with Lucene? You must have used Java then? If you even want to get close to that performance I think you need to use non http embedded solr. I am using this : - I wrote a JAVA JSP file to get an EmbeddedSolrServer - Now I call this JSP file from my PHP script and the JSP makes my search request to SOLR - after that I generate a CSV file out of the JSP and read it from PHP It´s the same way I did it with the prior LUCENE engine I used. But now the peformence is 10% from the prior LUCENE speed :-( Greets -Ralf-
Re: DataImportHandler running out of memory
Hi Grant, How did you finally managed it I've the same problem with less data, 8,5M, if I put a batchsize -1, I will slow down a lot the database which is not that good for the website and stack request. What did you do you ??? Thanks, Grant Ingersoll-6 wrote: > > I think it's a bit different. I ran into this exact problem about two > weeks ago on a 13 million record DB. MySQL doesn't honor the fetch > size for it's v5 JDBC driver. > > See > http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ > or do a search for MySQL fetch size. > > You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't > work) in order to get streaming in MySQL. > > -Grant > > > On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote: > >> Setting the batchSize to 1 would mean that the Jdbc driver will >> keep >> 1 rows in memory *for each entity* which uses that data source (if >> correctly implemented by the driver). Not sure how well the Sql Server >> driver implements this. Also keep in mind that Solr also needs >> memory to >> index documents. You can probably try setting the batch size to a >> lower >> value. >> >> The regular memory tuning stuff should apply here too -- try disabling >> autoCommit and turn-off autowarming and see if it helps. >> >> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]> >> wrote: >> >>> >>> I'm trying to load ~10 million records into Solr using the >>> DataImportHandler. >>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap >>> space) as >>> soon as I try loading more than about 5 million records. >>> >>> Here's my configuration: >>> I'm connecting to a SQL Server database using the sqljdbc driver. >>> I've >>> given >>> my Solr instance 1.5 GB of memory. I have set the dataSource >>> batchSize to >>> 1. My SQL query is "select top XXX field1, ... from table1". I >>> have >>> about 40 fields in my Solr schema. >>> >>> I thought the DataImportHandler would stream data from the DB >>> rather than >>> loading it all into memory at once. Is that not the case? Any >>> thoughts on >>> how to get around this (aside from getting a machine with more >>> memory)? >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> -- >> Regards, >> Shalin Shekhar Mangar. > > -- > Grant Ingersoll > http://www.lucidimagination.com > > Lucene Helpful Hints: > http://wiki.apache.org/lucene-java/BasicsOfPerformance > http://wiki.apache.org/lucene-java/LuceneFAQ > > > > > > > > > -- View this message in context: http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p20263146.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Changing mergeFactor in mid-stream?
Otis Gospodnetic wrote: Yes, you can change the mergeFactor. More important than the mergeFactor is this: 32 Pump it up as much as your hardware/JVM allows. And use appropriate -Xmx, of course. Is that true? I thought there was a sweet spot for the RAM buffer (and not as high as youd think)? You might want to test that out a bit before riding it too high...
Re: Performanec Lucene / Solr
Hi, Thx a lot for the tip ! But when I try it I got > HTTP/1.1 500 null java.lang.NullPointerException at org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37) My Request is : INFO: [core_de] webapp=/solr path=/select/ params={wt=phps&query=Tools&records=30&start_record=0} status=500 QTime=1 Exception in SOLR: SCHWERWIEGEND: java.lang.NullPointerException at org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37) at org.apache.solr.search.OldLuceneQParser.parse(LuceneQParserPlugin.java:104) at org.apache.solr.search.QParser.getQuery(QParser.java:88) at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:82) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:148) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148) at org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:833) at org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:639) at org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1285) at java.lang.Thread.run(Thread.java:595) Greets -Ralf-
Re: Performanec Lucene / Solr
On Fri, Oct 31, 2008 at 5:10 PM, Kraus, Ralf | pixelhouse GmbH < [EMAIL PROTECTED]> wrote: > > > HTTP/1.1 500 null java.lang.NullPointerException at > org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37) > > My Request is : > INFO: [core_de] webapp=/solr path=/select/ > params={wt=phps&query=Tools&records=30&start_record=0} status=500 QTime=1 > > The parameter name should be "q" instead of "query". -- Regards, Shalin Shekhar Mangar.
Re: Using Solrj
On Fri, Oct 31, 2008 at 4:32 PM, Raghunandan Rao < [EMAIL PROTECTED]> wrote: > I am doing that but the API is in experimental stage. Not sure to use it or > not. BTW can you also let me know how clustering works on Windows OS cos I > saw clustering scripts for Unix OS bundled out with Solr release. > > Which API is in experimental stage? By clustering, I think you mean replication. Until the last release (1.3), we had support only for *nix platforms for replication. In the next release we have a Java based replication coming which works for Windows also. If you want, you can try it with one of the nightly (un-released) builds. http://wiki.apache.org/solr/SolrReplication -- Regards, Shalin Shekhar Mangar.
RE: Using Solrj
Thank you. I was talking about DataImportHandler API. -Original Message- From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] Sent: Friday, October 31, 2008 5:19 PM To: solr-user@lucene.apache.org Subject: Re: Using Solrj On Fri, Oct 31, 2008 at 4:32 PM, Raghunandan Rao < [EMAIL PROTECTED]> wrote: > I am doing that but the API is in experimental stage. Not sure to use it or > not. BTW can you also let me know how clustering works on Windows OS cos I > saw clustering scripts for Unix OS bundled out with Solr release. > > Which API is in experimental stage? By clustering, I think you mean replication. Until the last release (1.3), we had support only for *nix platforms for replication. In the next release we have a Java based replication coming which works for Windows also. If you want, you can try it with one of the nightly (un-released) builds. http://wiki.apache.org/solr/SolrReplication -- Regards, Shalin Shekhar Mangar.
Re: Solr1.3 / MySql / Tomcat55 multiple delta-import inside a big full-import
Sorry I wasn't clear, The stack is not on solr database or index query, stack request are on our main database MySql, When I do a full import to create indexes for solr, MySql honnor it and won't drive it OOM, but with a batchsize -1, it uses MySql memory which let less memory for the rest of the request on the dabase like, insert, update, delete ... :) thanks for your answer, Shalin Shekhar Mangar wrote: > > On Fri, Oct 31, 2008 at 3:27 PM, sunnyfr <[EMAIL PROTECTED]> wrote: > >> >> I would like to know if it's very longer to make a limited full import >> and >> multi delta-import to index all the database. >> If I fire a full-import without limit 4M in my request that will run me >> OOM >> because I've 8,5M of document. >> If I fire a full-import without limit and a batchsize=-1 I will stock the >> database just for me, and stack other request for 10hours, but it will >> work. >> > > A full-import should not stack other requests though response time will be > more because of the heavy processing. > > Most users have a dedicated Master instance used only for indexing and > many > slaves dedicated for serving search requests. Maybe you can try a > master-slave architecture. > > -- > Regards, > Shalin Shekhar Mangar. > > -- View this message in context: http://www.nabble.com/Solr1.3---MySql---Tomcat55--multiple-delta-import-inside-a-big-full-import-tp20262801p20264431.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Performanec Lucene / Solr
On Oct 31, 2008, at 6:14 AM, Kraus, Ralf | pixelhouse GmbH wrote: Hey, I think it will have the disadvantage of being a lot slower though... How were you handling things with Lucene? You must have used Java then? If you even want to get close to that performance I think you need to use non http embedded solr. I am using this : - I wrote a JAVA JSP file to get an EmbeddedSolrServer - Now I call this JSP file from my PHP script and the JSP makes my search request to SOLR - after that I generate a CSV file out of the JSP and read it from PHP It´s the same way I did it with the prior LUCENE engine I used. But now the peformence is 10% from the prior LUCENE speed :-( No need to involve JSP at all to get Solr results in PHP. Rather than those hoops, simply use one of the PHP response writers. Look in the example Solr config, uncomment this: class="org.apache.solr.request.PHPSerializedResponseWriter"/> Then in PHP, hit Solr directly like this: $response = unserialize(file_get_contents($url)); Where $url is something like http://localhost:8983/solr/select?q=*:* Erik
Re: Using Solrj
On Fri, Oct 31, 2008 at 5:21 PM, Raghunandan Rao < [EMAIL PROTECTED]> wrote: > Thank you. > I was talking about DataImportHandler API. > > Most likely, you will not need to use the API. DataImportHandler will let you index your database without writing code -- you just need an XML configuration file. Only if you need to do some custom tasks, will you need to touch the API. The API is marked as experimental just because it is new and we're not sure of the use-cases that might come up in the near future and therefore we'd like the freedom of modifying it. As always, we strive to maintain backwards-compatibility as much as possible. I know that DataImportHandler is being used in production in a few high traffic websites. -- Regards, Shalin Shekhar Mangar.
RE: Using Solrj
I am doing that but the API is in experimental stage. Not sure to use it or not. BTW can you also let me know how clustering works on Windows OS cos I saw clustering scripts for Unix OS bundled out with Solr release. -Original Message- From: Noble Paul നോബിള് नोब्ळ् [mailto:[EMAIL PROTECTED] Sent: Friday, October 31, 2008 11:37 AM To: solr-user@lucene.apache.org Subject: Re: Using Solrj First of all you need to index your data in Solr. I suggest DataImportHandler because it can help you join multiple tables and index data On Fri, Oct 31, 2008 at 10:20 AM, Raghunandan Rao <[EMAIL PROTECTED]> wrote: > Thank you so much. > > Here goes my Use case: > > I need to search the database for collection of input parameters which > touches 'n' number of tables. The data is very huge. The search query itself > is so dynamic. I use lot of views for same search. How do I make use of Solr > in this case? > > -Original Message- > From: Erick Erickson [mailto:[EMAIL PROTECTED] > Sent: Thursday, October 30, 2008 7:01 PM > To: solr-user@lucene.apache.org > Subject: Re: Using Solrj > > Generally, you need to get your head out of the database world and into > the search world to be successful with Lucene. For instance, one > of the cardinal tenets of database design is to normalize your > data. It goes against every instinct to *denormalize* your data when > creating an Lucene index explicitly so you do NOT have to think > in terms of joins or sub-queries. Whenever I start thinking this > way, I try to back up and think again. > > Both your posts indicate to me that you're thinking in database > terms. There are no views in Lucene, for instance. You refer > to tables. There are no tables in Lucene, there are only documents > with various numbers of fields. You could conceivable make your index > look like a database by creatively naming your document fields. But > that doesn't play to the strengths of Lucene *or* the database. > > In fact, there is NO requirement that documents have the *same* fields. > Which is really difficult to get into when thinking like a DBA. > > Lucene is designed to search text. Fast and well. It is NOT intended to > efficiently manipulate relationships *between* documents. There > are various hybrid solutions that people have used. That is, put the > data you really need to do text searching on in a Lucene index, > along with enough data to be able to get the *rest* of what you need > from your database. But it all depends upon the problem you're trying to > solve. > > But as Noble says, all this is too general to be really useful, you need > to provide quite more detail about the problem you're trying to > solve to get useful recommendations. > > Best > Erick > > On Thu, Oct 30, 2008 at 8:50 AM, Raghunandan Rao < > [EMAIL PROTECTED]> wrote: > >> Thanks Noble. >> >> So you mean to say that I need to create a view according to my query and >> then index on the view and fetch? >> >> -Original Message- >> From: Noble Paul നോബിള് नोब्ळ् [mailto:[EMAIL PROTECTED] >> Sent: Thursday, October 30, 2008 6:16 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Using Solrj >> >> hi , >> There are two sides to this . >> 1. indexing (getting data into Solr) SolrJ or DataImportHandler can be >> used for this >> 2.querying . getting data out of solr. Here you do not have the choice >> of joining multiple tables. There only one index for Solr >> >> >> >> On Thu, Oct 30, 2008 at 5:34 PM, Raghunandan Rao >> <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > I am trying to use Solrj for my web application. I am indexing a table >> > using the @Field annotation tag. Now I need to index or query multiple >> > tables. Like, get all the employees who are managers in Finance >> > department (interacting with 3 entities). How do I do that? >> > >> > >> > >> > Does anyone have any idea? >> > >> > >> > >> > Thanks >> > >> > >> >> >> >> -- >> --Noble Paul >> > -- --Noble Paul
Re: Performanec Lucene / Solr
On Oct 31, 2008, at 7:42 AM, Shalin Shekhar Mangar wrote: On Fri, Oct 31, 2008 at 5:10 PM, Kraus, Ralf | pixelhouse GmbH < [EMAIL PROTECTED]> wrote: HTTP/1.1 500 null java.lang.NullPointerException at org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37) My Request is : INFO: [core_de] webapp=/solr path=/select/ params={wt=phps&query=Tools&records=30&start_record=0} status=500 QTime=1 The parameter name should be "q" instead of "query". And rows instead of records, and start instead of start_record. :) Erik
Re: Solr1.3 / MySql / Tomcat55 multiple delta-import inside a big full-import
On Fri, Oct 31, 2008 at 3:27 PM, sunnyfr <[EMAIL PROTECTED]> wrote: > > I would like to know if it's very longer to make a limited full import and > multi delta-import to index all the database. > If I fire a full-import without limit 4M in my request that will run me OOM > because I've 8,5M of document. > If I fire a full-import without limit and a batchsize=-1 I will stock the > database just for me, and stack other request for 10hours, but it will > work. > A full-import should not stack other requests though response time will be more because of the heavy processing. Most users have a dedicated Master instance used only for indexing and many slaves dedicated for serving search requests. Maybe you can try a master-slave architecture. -- Regards, Shalin Shekhar Mangar.
Re: Performanec Lucene / Solr
Hi, And rows instead of records, and start instead of start_record. :) Erik You´re my man :-) Greets -Ralf-
Re: Performanec Lucene / Solr
Hi, class="org.apache.solr.request.PHPSerializedResponseWriter"/> Then in PHP, hit Solr directly like this: $response = unserialize(file_get_contents($url)); Where $url is something like http://localhost:8983/solr/select?q=*:* No SOLR is 2times faster than LUCENE => Strike ! Hello weekend I am comming :-) Greets -Ralf-
Re: DataImportHandler running out of memory
I've moved the FAQ to a new Page http://wiki.apache.org/solr/DataImportHandlerFaq The DIH page is too big and editing has become harder On Thu, Jun 26, 2008 at 6:07 PM, Shalin Shekhar Mangar <[EMAIL PROTECTED]> wrote: > I've added a FAQ section to DataImportHandler wiki page which captures > question on out of memory exception with both MySQL and MS SQL Server > drivers. > > http://wiki.apache.org/solr/DataImportHandler#faq > > On Thu, Jun 26, 2008 at 9:36 AM, Noble Paul നോബിള് नोब्ळ् > <[EMAIL PROTECTED]> wrote: >> We must document this information in the wiki. We never had a chance >> to play w/ ms sql server >> --Noble >> >> On Thu, Jun 26, 2008 at 12:38 AM, wojtekpia <[EMAIL PROTECTED]> wrote: >>> >>> It looks like that was the problem. With responseBuffering=adaptive, I'm >>> able >>> to load all my data using the sqljdbc driver. >>> -- >>> View this message in context: >>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18119732.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>> >> >> >> >> -- >> --Noble Paul >> > > > > -- > Regards, > Shalin Shekhar Mangar. > -- --Noble Paul
Re: DIH and rss feeds
Is that right? I find the wording of "clean" a little confusing. I would have thought this is what I had needed earlier but the topic came up regarding the fact that you can not deleteByQuery for an entity you want to flush w/ delta-import. I just noticed that the original JIRA request says it was implemented recently ... https://issues.apache.org/jira/browse/SOLR-801 Im assuming this means your war needs to come from trunk copy? Does this patch affect that param @ all? - Jon On Oct 31, 2008, at 2:05 AM, Noble Paul നോബിള് नोब्ळ् wrote: run full-import with clean=false for full-import clean is set to true by default and for delta-import clean is false by default. On Fri, Oct 31, 2008 at 9:16 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: I have a DataImportHandler configured to index from an RSS feed. It is a "latest stuff" feed. It reads the feed and indexes the 100 documents harvested from the feed. So far, works great. Now: a few hours later there are a different 100 "lastest" documents. How do I add those to the index so I will have 200 documents? 'full- import' throws away the first 100. 'delta-import' is not implemented. What is the special trick here? I'm using the Solr-1.3.0 release. Thanks, Lance Norskog -- --Noble Paul
What are the way to update / delete solr datas?
Hello, I'm trying to find the best way to update / delete datas according to my project (developed with javascript and rails). I would like to do something like that : http://localhost:8983/solr/update/?q=id:1&rate=4 and http://localhost:8983/solr/delete/?q=id:1 Is it possible ? But I found only these two ways : http://localhost:8983/solr/update/csv?commit=true&separator=%09&escape=\&stream.file=/tmp/result.text or using an xml file with that : load_id:20070424150841 The last possibility is to use the solr-ruby library. Is there any way I forgot ? Thanks, Vincent -- View this message in context: http://www.nabble.com/What-are-the-way-to-update---delete-solr-datas--tp20268507p20268507.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: DIH and rss feeds
The "clean" parameter is there in the 1.3 release. The full-import is by definition "full" so we delete all existing documents at the start. If you don't want to clean the index, you can pass clean=false and DIH will just add them. On Fri, Oct 31, 2008 at 8:58 PM, Jon Baer <[EMAIL PROTECTED]> wrote: > Is that right? I find the wording of "clean" a little confusing. I would > have thought this is what I had needed earlier but the topic came up > regarding the fact that you can not deleteByQuery for an entity you want to > flush w/ delta-import. > > I just noticed that the original JIRA request says it was implemented > recently ... > > https://issues.apache.org/jira/browse/SOLR-801 > > Im assuming this means your war needs to come from trunk copy? Does this > patch affect that param @ all? > > - Jon > > > On Oct 31, 2008, at 2:05 AM, Noble Paul നോബിള് नोब्ळ् wrote: > > run full-import with clean=false >> >> for full-import clean is set to true by default and for delta-import >> clean is false by default. >> >> On Fri, Oct 31, 2008 at 9:16 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> >>> I have a DataImportHandler configured to index from an RSS feed. It is a >>> "latest stuff" feed. It reads the feed and indexes the 100 documents >>> harvested from the feed. So far, works great. >>> >>> Now: a few hours later there are a different 100 "lastest" documents. How >>> do >>> I add those to the index so I will have 200 documents? 'full-import' >>> throws >>> away the first 100. 'delta-import' is not implemented. What is the >>> special >>> trick here? I'm using the Solr-1.3.0 release. >>> >>> Thanks, >>> >>> Lance Norskog >>> >>> >> >> >> -- >> --Noble Paul >> > > -- Regards, Shalin Shekhar Mangar.
Re: What are the way to update / delete solr datas?
On Oct 31, 2008, at 11:40 AM, Vincent Pérès wrote: The last possibility is to use the solr-ruby library. If you're using Ruby, that's what I'd use. Were your other proposals to still do those calls from Ruby, but with the HTTP library directly? Erik
Re: What are the way to update / delete solr datas?
Thanks for your quick answer. I'm using only HTTP to display my results, that's why I would like to continue with this way. If I can use HTTP instead of solr, it will be better for me. Erik Hatcher wrote: > > > On Oct 31, 2008, at 11:40 AM, Vincent Pérès wrote: >> The last possibility is to use the solr-ruby library. > > If you're using Ruby, that's what I'd use. Were your other proposals > to still do those calls from Ruby, but with the HTTP library directly? > > Erik > > > -- View this message in context: http://www.nabble.com/What-are-the-way-to-update---delete-solr-datas--tp20268507p20268773.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: date range query performance
: Concrete example, this query just look 18s: : : instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO : 2008-10-30T03:59:59Z] AND label_facet:"Added to Position" : I saw a thread from Apr 2008 which explains the problem being due to too much : precision on the DateField type, and the range expansion leading to far too : many elements being checked. Proposed solution appears to be a hack where you : index date fields as strings and hacking together date functions to generate : proper queries/format results. forteh record, you don't need to index as a "StrField" to get this benefit, you can still index using DateField you just need to round your dates to some less graunlar level .. if you always want to round down, you don't even need to do the rounding yourself, just add "/SECOND" or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr. (SOLR-741 proposes adding a config option to DateField to let this be done server side) your example query seems to be happy with hour resolution, but in theory if sometimes you needed hour resolution when doing "big ranges" but more precise resolution when doing "small ranges" you could even in theory have a "course" date field that you round to an hour, and redundently index the same data in a "fine" date field with minute or second resolution. Also: if you frequently reuse the same ranges over and over (ie: you have a form widget people pick from so on any given day there is only N discrete ranges being used) putting them in an "fq" param will let them be cached uniquely from the main query string (instance:client\-csm.symplicity.com) so differnet searches using the same date ranges will be faster. ditto for your label_facet:"Added to Position" clause. -Hoss
Re: date range query performance
We have implemented the suggested reduction in granularity by dropping time altogether and simply disallowing time filtering. This, in light of other search filters we have provided, should prove be sufficient for our user base. We did keep the fine granularity field not for filtering, but for sorting. We definitely need the log entries to be presented in chronological order, so the finer resolution date field is useful for that at least. Thanks for the detailed response. Alok On Oct 31, 2008, at 2:16 PM, Chris Hostetter wrote: : Concrete example, this query just look 18s: : : instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO : 2008-10-30T03:59:59Z] AND label_facet:"Added to Position" : I saw a thread from Apr 2008 which explains the problem being due to too much : precision on the DateField type, and the range expansion leading to far too : many elements being checked. Proposed solution appears to be a hack where you : index date fields as strings and hacking together date functions to generate : proper queries/format results. forteh record, you don't need to index as a "StrField" to get this benefit, you can still index using DateField you just need to round your dates to some less graunlar level .. if you always want to round down, you don't even need to do the rounding yourself, just add "/SECOND" or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr. (SOLR-741 proposes adding a config option to DateField to let this be done server side) your example query seems to be happy with hour resolution, but in theory if sometimes you needed hour resolution when doing "big ranges" but more precise resolution when doing "small ranges" you could even in theory have a "course" date field that you round to an hour, and redundently index the same data in a "fine" date field with minute or second resolution. Also: if you frequently reuse the same ranges over and over (ie: you have a form widget people pick from so on any given day there is only N discrete ranges being used) putting them in an "fq" param will let them be cached uniquely from the main query string (instance:client\-csm.symplicity.com) so differnet searches using the same date ranges will be faster. ditto for your label_facet:"Added to Position" clause. -Hoss
Re: corrupt solr index on ec2
Bill Graham wrote: Then it seemed to run well for about an hour and I saw this: Oct 28, 2008 10:38:51 PM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true) Oct 28, 2008 10:38:51 PM org.apache.solr.common.SolrException log SEVERE: java.lang.RuntimeException: after flush: fdx size mismatch: 1156 docs vs 0 length in bytes of _2rv.fdx at org .apache .lucene .index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:94) at org .apache .lucene.index.DocFieldConsumers.closeDocStore(DocFieldConsumers.java: 83) at org .apache .lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java: 47) at org .apache .lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:367) at org.apache.lucene.index.IndexWriter.flushDocStores(IndexWriter.java: 1774) This particular exception is very spooky -- it really looks like something is removing the index files (such as accidentally opening a 2nd writer on the index). Mike
TermVectorComponent for tag generation?
Hi, So Im looking to either use this or build a component which might do what Im looking for. Id like to figure out if its possible use a single doc to get tag generation based on the matches within that document for example: 1 News Doc -> contains 5 Players and 8 Teams (show them as possible tags for this article) In this case Players and Teams are also docs. It's almost like I want to use MoreLikeThis w/ a different filter query than what Im using. Is there any easy hack to get this going? Thanks. - Jon
Re: TermVectorComponent for tag generation?
Hey Jon, Not following how the TVC (TermVectorComp) would help here.I suppose you could use the "most important" terms, as defined by TF- IDF, as suggested tags. The MLT (MoreLikeThis) uses this to generate query terms. However, I'm not following the different filter query piece. Can you provide a bit more details? One thing you did make me think, though, is it might be interesting to extend TermVectorMapper so that it can output a NamedList and then allow people to implement their own SolrTermVectorMapper and have it customize the TV output... Thanks, Grant On Oct 31, 2008, at 5:20 PM, Jon Baer wrote: Hi, So Im looking to either use this or build a component which might do what Im looking for. Id like to figure out if its possible use a single doc to get tag generation based on the matches within that document for example: 1 News Doc -> contains 5 Players and 8 Teams (show them as possible tags for this article) In this case Players and Teams are also docs. It's almost like I want to use MoreLikeThis w/ a different filter query than what Im using. Is there any easy hack to get this going? Thanks. - Jon -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: TermVectorComponent for tag generation?
Well for example in any given text (which is field on a document); "While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of file format. Text from PDFs, HTML, Microsoft Word documents, as well as many others can all be indexed so long as their textual information can be extracted." Id like to be able to say the tags for this article should be [Lucene, PDF, HTML, Microsoft Word] because they are in field values from other documents. Basically how to generate tags from just a single document based on other document field values. - Jon On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote: Hey Jon, Not following how the TVC (TermVectorComp) would help here.I suppose you could use the "most important" terms, as defined by TF- IDF, as suggested tags. The MLT (MoreLikeThis) uses this to generate query terms. However, I'm not following the different filter query piece. Can you provide a bit more details? One thing you did make me think, though, is it might be interesting to extend TermVectorMapper so that it can output a NamedList and then allow people to implement their own SolrTermVectorMapper and have it customize the TV output... Thanks, Grant On Oct 31, 2008, at 5:20 PM, Jon Baer wrote: Hi, So Im looking to either use this or build a component which might do what Im looking for. Id like to figure out if its possible use a single doc to get tag generation based on the matches within that document for example: 1 News Doc -> contains 5 Players and 8 Teams (show them as possible tags for this article) In this case Players and Teams are also docs. It's almost like I want to use MoreLikeThis w/ a different filter query than what Im using. Is there any easy hack to get this going? Thanks. - Jon -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
RE: DIH and rss feeds
Thanks all. I knew there had to be something :) Perhaps I should read the complete wiki page over and over again some more. It is a complex tool. Lance -Original Message- From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] Sent: Friday, October 31, 2008 8:42 AM To: solr-user@lucene.apache.org Subject: Re: DIH and rss feeds The "clean" parameter is there in the 1.3 release. The full-import is by definition "full" so we delete all existing documents at the start. If you don't want to clean the index, you can pass clean=false and DIH will just add them. On Fri, Oct 31, 2008 at 8:58 PM, Jon Baer <[EMAIL PROTECTED]> wrote: > Is that right? I find the wording of "clean" a little confusing. I > would have thought this is what I had needed earlier but the topic > came up regarding the fact that you can not deleteByQuery for an > entity you want to flush w/ delta-import. > > I just noticed that the original JIRA request says it was implemented > recently ... > > https://issues.apache.org/jira/browse/SOLR-801 > > Im assuming this means your war needs to come from trunk copy? Does > this patch affect that param @ all? > > - Jon > > > On Oct 31, 2008, at 2:05 AM, Noble Paul നോബിള് नोब्ळ् wrote: > > run full-import with clean=false >> >> for full-import clean is set to true by default and for delta-import >> clean is false by default. >> >> On Fri, Oct 31, 2008 at 9:16 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> >>> I have a DataImportHandler configured to index from an RSS feed. It >>> is a "latest stuff" feed. It reads the feed and indexes the 100 >>> documents harvested from the feed. So far, works great. >>> >>> Now: a few hours later there are a different 100 "lastest" >>> documents. How do I add those to the index so I will have 200 >>> documents? 'full-import' >>> throws >>> away the first 100. 'delta-import' is not implemented. What is the >>> special trick here? I'm using the Solr-1.3.0 release. >>> >>> Thanks, >>> >>> Lance Norskog >>> >>> >> >> >> -- >> --Noble Paul >> > > -- Regards, Shalin Shekhar Mangar.
DIH Http input bug - problem with two-level RSS walker
I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss feed which contains N links to other rss feeds. The nested loop then reads each one of those to create documents. (Yes, this is an obnoxious thing to do.) Let's say the outer RSS feed gives 10 items. Both feeds use the same structure: /rss/channel with a node and then N nodes inside the channel. This should create two separate XML streams with two separate Xpath iterators, right? This does indeed walk each url from the outer feed and then fetch the inner rss feed. Bravo! However, I found two separate problems in xpath iteration. They may be related. The first problem is that it only stores the first document from each "inner" feed. Each feed has several documents with different title fields but it only grabs the first. The other is an off-by-one bug. The outer loop iterates through the 10 items and then tries to pull an 11th. It then gives this exception trace: INFO: Created URL to: [inner url] Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource getData SEVERE: Exception thrown while getting data java.net.MalformedURLException: no protocol: null/account.rss at java.net.URL.(URL.java:567) at java.net.URL.(URL.java:464) at java.net.URL.(URL.java:413) at org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav a:90) at org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav a:47) at org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18 3) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit yProcessor.java:210) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn tityProcessor.java:180) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP rocessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java: 285) ... Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: album document : SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}] org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in invoking url null Processing Document # 11 at org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav a:115) at org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav a:47)
Re: date range query performance
On 31.10.2008 19:16 Chris Hostetter wrote: > forteh record, you don't need to index as a "StrField" to get this > benefit, you can still index using DateField you just need to round your > dates to some less graunlar level .. if you always want to round down, you > don't even need to do the rounding yourself, just add "/SECOND" > or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr. > (SOLR-741 proposes adding a config option to DateField to let this be done > server side) Is this also possible for the timestamp that is automatically added to all new/updated docs? I would like to be able to search (quickly) for everything that was added within the last week or month or whatever. And because I update the index only once a day a granuality of /DAY (if that exists) would be fine. - Michael
Re: date range query performance
On Nov 1, 2008, at 1:07 AM, Michael Lackhoff wrote: On 31.10.2008 19:16 Chris Hostetter wrote: forteh record, you don't need to index as a "StrField" to get this benefit, you can still index using DateField you just need to round your dates to some less graunlar level .. if you always want to round down, you don't even need to do the rounding yourself, just add "/SECOND" or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr. (SOLR-741 proposes adding a config option to DateField to let this be done server side) Is this also possible for the timestamp that is automatically added to all new/updated docs? I would like to be able to search (quickly) for everything that was added within the last week or month or whatever. And because I update the index only once a day a granuality of /DAY (if that exists) would be fine. Yeah, this should work fine: default="NOW/DAY" multiValued="false"/> Erik
Re: DIH Http input bug - problem with two-level RSS walker
On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss > feed > which contains N links to other rss feeds. The nested loop then reads each > one of those to create documents. (Yes, this is an obnoxious thing to do.) > Let's say the outer RSS feed gives 10 items. Both feeds use the same > structure: /rss/channel with a node and then N nodes inside > the channel. This should create two separate XML streams with two separate > Xpath iterators, right? > > > > > > > > > > > This does indeed walk each url from the outer feed and then fetch the inner > rss feed. Bravo! > > However, I found two separate problems in xpath iteration. They may be > related. The first problem is that it only stores the first document from > each "inner" feed. Each feed has several documents with different title > fields but it only grabs the first. > The idea behind nested entities is to join them together so that one Solr document is created for each root entity and the child entities provide more fields which are added to the parent document. I guess you want to create separate Solr documents from the root entity as well as the child entities. I don't think that is possible with nested entities. Essentially, you are trying to crawl feeds, not join them. Probably an integration with Apache Droids can be thought about. http://incubator.apache.org/projects/droids.html http://people.apache.org/~thorsten/droids/ If you are going to crawl only one level, there may be a workaround. However, it may be easier to implement all this with your own Java program and just post results to Solr as usual. > The other is an off-by-one bug. The outer loop iterates through the 10 > items > and then tries to pull an 11th. It then gives this exception trace: > > INFO: Created URL to: [inner url] > Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource > getData > SEVERE: Exception thrown while getting data > java.net.MalformedURLException: no protocol: null/account.rss >at java.net.URL.(URL.java:567) >at java.net.URL.(URL.java:464) >at java.net.URL.(URL.java:413) >at > > org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav > a:90) >at > > org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav > a:47) >at > > org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18 > 3) >at > > org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit > yProcessor.java:210) >at > > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn > tityProcessor.java:180) >at > > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP > rocessor.java:160) >at > > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java: > 285) > ... > Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder > buildDocument > SEVERE: Exception while processing: album document : > SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}] > org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in > invoking url null Processing Document # 11 >at > > org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav > a:115) >at > > org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav > a:47) > > > > > > -- Regards, Shalin Shekhar Mangar.
Re: date range query performance
On 01.11.2008 06:10 Erik Hatcher wrote: > Yeah, this should work fine: > > default="NOW/DAY" multiValued="false"/> Wow, that was fast, thanks! -Michael
Re: TermVectorComponent for tag generation?
Hi Jon, Isn't it similar to what Grant just said the top most terms ( after removing the stop words ). You would need to get how many terms are there and there related frequency and any term which is beyond a certain threshold you would mark it as an member of tag set. One can also build a set of related entities or terms which are following the current term, and than can decide on which all can become part of the tagset. It that the requirement or I am missing something here. -- Thanks and Regards Vaijanath N. Rao Jon Baer wrote: Well for example in any given text (which is field on a document); "While suitable for any application which requires full text indexing and searching capability, Lucene has been widely recognized for its utility in the implementation of Internet search engines and local, single-site searching. At the core of Lucene's logical architecture is the idea of a document containing fields of text. This flexibility allows Lucene's API to be independent of file format. Text from PDFs, HTML, Microsoft Word documents, as well as many others can all be indexed so long as their textual information can be extracted." Id like to be able to say the tags for this article should be [Lucene, PDF, HTML, Microsoft Word] because they are in field values from other documents. Basically how to generate tags from just a single document based on other document field values. - Jon On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote: Hey Jon, Not following how the TVC (TermVectorComp) would help here.I suppose you could use the "most important" terms, as defined by TF-IDF, as suggested tags. The MLT (MoreLikeThis) uses this to generate query terms. However, I'm not following the different filter query piece. Can you provide a bit more details? One thing you did make me think, though, is it might be interesting to extend TermVectorMapper so that it can output a NamedList and then allow people to implement their own SolrTermVectorMapper and have it customize the TV output... Thanks, Grant On Oct 31, 2008, at 5:20 PM, Jon Baer wrote: Hi, So Im looking to either use this or build a component which might do what Im looking for. Id like to figure out if its possible use a single doc to get tag generation based on the matches within that document for example: 1 News Doc -> contains 5 Players and 8 Teams (show them as possible tags for this article) In this case Players and Teams are also docs. It's almost like I want to use MoreLikeThis w/ a different filter query than what Im using. Is there any easy hack to get this going? Thanks. - Jon -- Grant Ingersoll Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans. http://www.lucenebootcamp.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ