LucidWorks Solr
Just wanted to know if anyone has used LucidWorks Solr. - How do you compare it to the standard Apache Solr? - the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk IO? what are its effects? - LucidWorks website also talked about "significantly improved faceting performance" -- what improvements are they? How much improvements? Would you recommend using it? Thanks.
Re: LucidWorks Solr
Thanks for asking, I am interested as well in reading the response to your questions. Paolo Andy wrote: Just wanted to know if anyone has used LucidWorks Solr. - How do you compare it to the standard Apache Solr? - the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk IO? what are its effects? - LucidWorks website also talked about "significantly improved faceting performance" -- what improvements are they? How much improvements? Would you recommend using it? Thanks.
Autofill 'id' field with the URL of files posted to Solr?
Hi, I need to submit thousands of online PDF/html files to Solr. I can submit one file using SolrJ (StreamingUpdateSolrServer and ..solr.common.util.ContentStreamBase.URLStream), setting literal.idparameter to the url. I can't do the same with a batch of multiple files, as their 'id' should be unique (set to their urls). I couldn't get this to work. Is there a way to somehow get the 'id' field set automatically to the url of the files posted to Solr (something like to 'stream_name')? How to set this in solrconfig.xml or schema.xml? or any other way? Thanks.
Autofill 'id' field with the URL of files posted to Solr?
Hi, I need to submit thousands of online PDF/html files to Solr. I can submit one file using SolrJ (StreamingUpdateSolrServer and ..solr.common.util.ContentStreamBase.URLStream), setting literal.id parameter to the url. I can't do the same with a batch of multiple files, as their 'id' should be unique (set to their urls). I couldn't get this to work. Is there a way to somehow get the 'id' field set automatically to the url of the files posted to Solr (something like to 'stream_name')? How to set this in solrconfig.xml or schema.xml? or any other way? If their url can be put in some other field (like 'url' iitself) that will also serve my purpose. Thanks for your help. -- View this message in context: http://n3.nabble.com/Autofill-id-field-with-the-URL-of-files-posted-to-Solr-tp727985p727985.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet count problem
I am.using text for type, which is static. For example: type is a field and I am using type for categorization. For news type I am using news and for blog using blog.. type is a text field. On Apr 17, 2010 8:38 PM, "Ahmet Arslan" wrote: > I am facing problem to get facet result count. I must be > wrong somewhere. > I am getting proper ... Are you faceting on a tokenized field? What is the fieldType of your field?
Solr throws TikaException while parsing sample PDF
Hi, while posting a sample pdf (that comes with Solr dist'n) to solr, i'm getting a TikaException. Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to solr. Other sample pdfs can be parsed and indexed successfully.. I;m getting same error with some other pdfs also (but adobe reader can open them fine, so i dont think they have an issue in formatting or are corrupt etc)... Here is the trace... found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf :: size=286242 Apr 18, 2010 10:31:34 PM org.apache.solr.update.processor.LogUpdateProcessor finish INFO: {} 0 640 Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log SEVERE: org.apache.solr.common.SolrException: org.apache.tika.exception.TikaException: Una ble to extract PDF content at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu mentLoader.java:211) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea mHandlerBase.java:54) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav a:131) at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re questHandlers.java:233) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241 ) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil terChain.java:215) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain .java:188) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: 172) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 8) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) at org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn ection(Http11BaseProtocol.java:665) at org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 28) at org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke rThread.java:81) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 89) at java.lang.Thread.run(Thread.java:595) Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF content at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) at org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu mentLoader.java:190) ... 20 more Caused by: java.util.zip.ZipException: incorrect header check at java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140) at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101) at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) at org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) at org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) at org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) at org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) at org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53) ... 24 more Apr 18, 2010 10:31:34 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/update/extract params={wt=javabin&waitFlush=true&literal.index Date=2010-04-18+&commit=true&waitSearcher=true&version=1&literal.id=C%253A%255Csolr_1.4.0% 255Cdocs%255CInstalling%2BSolr%2Bin%2BTomcat.pdf} status=500 QTime=640 Exception in handling an uplaoded file:C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf : Internal Server Error Internal Server Error request: http://localhost:8080/solr/update/extract?literal.id=
Re: Solr Schema Question
Thanks everyone, It works! I have successfully indexed them. Thanks again! I have couple of more questions regarding with solr, if you don't mind. 1-) As I said before, the text files are quite large, between 100kb-10mb, but I need to store them as well for highlighting, including with their title, description, tags (I concat tags while fetching from the db, and treat them as one row). For search result on the page, I have to get; username (string) lang (string) cat (string) view_count (int) imgid (int) thumbs_up (int) thumbs_down (int) these columns as well. These columns are not used for indexing, just for storing. Do you think it is better idea to store these columns as well and not query the database? Or, I can just get the ids and query the database myself. Which approach is better from memory usage and performance perspective? I was using Sphinx for full text searching on my production websites, so I am not used to this format as Sphinx only returns document IDs. 2-) I was using Sphinx for other purposes as well, like "browse" section on the website. http://www.youtube.com/videos. It gives better performance on large datasets (sorting, ordering etc). I know some people also use solr(lucene) for this, but I have not seen any website that use solr on their "browse" section without using Facets. So, even if I don't use Facets, is it still useful to use solr on that section? I will be storing a large amount of data on solr, and expect to have 1 TB data after 6-8 months. 3-) I will be using http://wiki.apache.org/solr/MoreLikeThis option too. As I said the text files are large. Do you have any suggestions regarding with this feature? Thanks again, On Sun, Apr 18, 2010 at 7:53 AM, Lance Norskog wrote: > Man you people are fast! > > There is a bug in Solr/Lucene. It keeps memory around from previous > fields, so giant text files might run out of memory when they should > not. This bug is fixed in the trunk. > > On 4/17/10, Lance Norskog wrote: >> The DataImportHandler can let you fetch the file name from the >> database record, and then load the file as a field and process the >> text with Tika. >> >> It will not be easy :) but it is possible. >> >> http://wiki.apache.org/solr/DataImportHandler >> >> On 4/17/10, Serdar Sahin wrote: >>> Hi, >>> >>> I am rather new to Solr and have a question. >>> >>> We have around 200.000 txt files which are placed into the file cloud. >>> The file path is something similar to this: >>> >>> file/97/8f/840/fa4-1.txt >>> file/a6/9d/ab0/ca2-2.txt etc. >>> >>> and we also store the metadata (like title, description, tags etc) >>> about these files in the mysql server. So, what I want to do is to >>> index title, description, tags and other data from mysql, and also get >>> the txt file from file server, and link them as one record for >>> searching, but I could not figure out how to automatize this process. >>> I can give the path from the sql query like, Select id, title, >>> description, file_path, and then solr can use this path to retrieve >>> txt file, but I don't know whether is it possible or not. >>> >>> What is the best way to index these files with their tag title and >>> description without coding in Java (Perl is ok). These txt files are >>> large, between 100kb-10mb, so the last option is to store them in the >>> database. >>> >>> Thanks, >>> >>> Serdar >>> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> > > > -- > Lance Norskog > goks...@gmail.com >
Re: Solr throws TikaException while parsing sample PDF
Can you extract content from this using Tika's standalone command line tool? PDF's are notorious for problems in extracting. To me, it looks like a bug in PDFBox. I would try to isolate it down to there and then send, if possible, the sample document to PDFBox and see if they can come up w/ a fix. -Grant On Apr 18, 2010, at 1:12 PM, pk wrote: > > Hi, > while posting a sample pdf (that comes with Solr dist'n) to solr, i'm > getting a TikaException. > Using Solr-1.4, SolrJ (StreamingUpdateSolrServer) for posting pdf to solr. > Other sample pdfs can be parsed and indexed successfully.. I;m getting same > error with some other pdfs also (but adobe reader can open them fine, so i > dont think they have an issue in formatting or are corrupt etc)... Here is > the trace... > > > found uploaded file : C:\solr_1.4.0\docs\Installing Solr in Tomcat.pdf :: > size=286242 > Apr 18, 2010 10:31:34 PM org.apache.solr.update.processor.LogUpdateProcessor > finish > INFO: {} 0 640 > Apr 18, 2010 10:31:34 PM org.apache.solr.common.SolrException log > SEVERE: org.apache.solr.common.SolrException: > org.apache.tika.exception.TikaException: Una > ble to extract PDF content >at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > mentLoader.java:211) >at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStrea > mHandlerBase.java:54) >at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.jav > a:131) >at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(Re > questHandlers.java:233) >at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) >at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > >at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241 > ) >at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFil > terChain.java:215) >at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain > .java:188) >at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: > 213) >at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java: > 172) >at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) >at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) >at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:10 > 8) >at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:174) >at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:873) >at > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConn > ection(Http11BaseProtocol.java:665) >at > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:5 > 28) >at > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorke > rThread.java:81) >at > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:6 > 89) >at java.lang.Thread.run(Thread.java:595) > Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF > content >at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:58) >at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:51) >at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:119) >at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:105) >at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocu > mentLoader.java:190) >... 20 more > Caused by: java.util.zip.ZipException: incorrect header check >at > java.util.zip.InflaterInputStream.read(InflaterInputStream.java:140) >at org.pdfbox.filter.FlateFilter.decode(FlateFilter.java:97) >at org.pdfbox.cos.COSStream.doDecode(COSStream.java:290) >at org.pdfbox.cos.COSStream.doDecode(COSStream.java:235) >at org.pdfbox.cos.COSStream.getUnfilteredStream(COSStream.java:170) >at org.pdfbox.pdfparser.PDFStreamParser.(PDFStreamParser.java:101) >at org.pdfbox.cos.COSStream.getStreamTokens(COSStream.java:132) >at > org.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:202) >at > org.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:174) >at > org.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:336) >at > org.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:259) >at > org.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:216) >at org.pdfbox.util.PDFTextStripper.getText(PDFTextStripper.java:149) >at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:53) >
Re: LucidWorks Solr
On Apr 18, 2010, at 3:53 AM, Andy wrote: > Just wanted to know if anyone has used LucidWorks Solr. > > - How do you compare it to the standard Apache Solr? We take a release of Solr. We wrap it w/ an installer, tomcat/jetty, our reference guide, Luke, etc. We also add in an optimized version of KStem. Finally, we apply certain patches that came after whatever the release was that didn't make it into the release (we usually delay our release by a few weeks). Many of these things we package simply cannot be in an ASF release b/c of ASF policies, others are there for convenience so that people don't have to go all over the web to get them. > > - the non-blocking IO of LucidWorks Solr -- is that for networking IO or disk > IO? what are its effects? I think this is a legacy from the 1.3 CD on our website. I believe what this is referring to is in Solr 1.4, as it was a patch that was applied to trunk after 1.3 was released. I'll let our web team know to update that. > > - LucidWorks website also talked about "significantly improved faceting > performance" -- what improvements are they? How much improvements? Same as the previous issue. I'll let our web team know to update that. > > Would you recommend using it? > Sure, but I'm biased. ;-) Hopefully, you will find it useful, but choose the one that best fits your needs (and let me know if you need help assessing that.) -- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search
Re: geometric distance
FAIK, There are no columns per se. But in the past I've just used UTM values for each lat lon and just do basic numeric operators >, < to search within a bounding geographic region. Add them as numeric fields though. Easy. There is new support for spatial searching, however I'm not sure how it compares to what I described, which works great. Probably does some automatic conversions or something. Check the wiki. On Sat, 2010-04-17 at 18:39 -0700, Dennis Gearon wrote: > How does solr/lucene do geometric distances? > > Does it use a GEOS point datum, or two columns one for latitude, one for > longitude? > > > Dennis Gearon > > Signature Warning > > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php
Re: Autofill 'id' field with the URL of files posted to Solr?
The DataImportHandler has a tool for doing PDF extraction. This allows you to create new fields, do multiple files, and supply lists of access to get the multiple files. http://wiki.apache.org/solr/TikaEntityProcessor On Sun, Apr 18, 2010 at 9:52 AM, pk wrote: > > Hi, > I need to submit thousands of online PDF/html files to Solr. I can submit > one file using SolrJ (StreamingUpdateSolrServer and > ..solr.common.util.ContentStreamBase.URLStream), setting literal.id > parameter to the url. I can't do the same with a batch of multiple files, as > their 'id' should be unique (set to their urls). > > I couldn't get this to work. Is there a way to somehow get the 'id' field > set automatically to the url of the files posted to Solr (something like to > 'stream_name')? How to set this in solrconfig.xml or schema.xml? or any > other way? > > If their url can be put in some other field (like 'url' iitself) that will > also serve my purpose. > > Thanks for your help. > -- > View this message in context: > http://n3.nabble.com/Autofill-id-field-with-the-URL-of-files-posted-to-Solr-tp727985p727985.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- Lance Norskog goks...@gmail.com
Re: Solr Schema Question
Highlighting is a complex topic. A field has to be stored to be highlight. It does not have to be indexed. But, if it is not, highlighting analyzes it just like if it was indexed in order to highlight it. http://www.lucidimagination.com/search/document/CDRG_ch07_7.9?q=highlighting http://www.lucidimagination.com/blog/2009/02/17/highlighting-highlighter-thoughts/ On Sun, Apr 18, 2010 at 10:12 AM, Serdar Sahin wrote: > Thanks everyone, It works! I have successfully indexed them. Thanks again! > > I have couple of more questions regarding with solr, if you don't mind. > > 1-) As I said before, the text files are quite large, between > 100kb-10mb, but I need to store them as well for highlighting, > including with their title, description, tags (I concat tags while > fetching from the db, and treat them as one row). For search result on > the page, I have to get; > > username (string) > lang (string) > cat (string) > view_count (int) > imgid (int) > thumbs_up (int) > thumbs_down (int) > > these columns as well. These columns are not used for indexing, just > for storing. Do you think it is better idea to store these columns as > well and not query the database? Or, I can just get the ids and query > the database myself. Which approach is better from memory usage and > performance perspective? I was using Sphinx for full text searching on > my production websites, so I am not used to this format as Sphinx only > returns document IDs. > > 2-) I was using Sphinx for other purposes as well, like "browse" > section on the website. http://www.youtube.com/videos. It gives better > performance on large datasets (sorting, ordering etc). I know some > people also use solr(lucene) for this, but I have not seen any website > that use solr on their "browse" section without using Facets. So, even > if I don't use Facets, is it still useful to use solr on that section? > I will be storing a large amount of data on solr, and expect to have 1 > TB data after 6-8 months. > > 3-) I will be using http://wiki.apache.org/solr/MoreLikeThis option > too. As I said the text files are large. Do you have any suggestions > regarding with this feature? > > Thanks again, > > > > > > On Sun, Apr 18, 2010 at 7:53 AM, Lance Norskog wrote: >> Man you people are fast! >> >> There is a bug in Solr/Lucene. It keeps memory around from previous >> fields, so giant text files might run out of memory when they should >> not. This bug is fixed in the trunk. >> >> On 4/17/10, Lance Norskog wrote: >>> The DataImportHandler can let you fetch the file name from the >>> database record, and then load the file as a field and process the >>> text with Tika. >>> >>> It will not be easy :) but it is possible. >>> >>> http://wiki.apache.org/solr/DataImportHandler >>> >>> On 4/17/10, Serdar Sahin wrote: Hi, I am rather new to Solr and have a question. We have around 200.000 txt files which are placed into the file cloud. The file path is something similar to this: file/97/8f/840/fa4-1.txt file/a6/9d/ab0/ca2-2.txt etc. and we also store the metadata (like title, description, tags etc) about these files in the mysql server. So, what I want to do is to index title, description, tags and other data from mysql, and also get the txt file from file server, and link them as one record for searching, but I could not figure out how to automatize this process. I can give the path from the sql query like, Select id, title, description, file_path, and then solr can use this path to retrieve txt file, but I don't know whether is it possible or not. What is the best way to index these files with their tag title and description without coding in Java (Perl is ok). These txt files are large, between 100kb-10mb, so the last option is to store them in the database. Thanks, Serdar >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> >> >> >> -- >> Lance Norskog >> goks...@gmail.com >> > -- Lance Norskog goks...@gmail.com
Re: Facet count problem
Can we see the actual field definitions from your schema file. Ahmet's question is vital and is best answered if you'll copy/paste the relevant configuration entries But based on what you *have* posted, I'd guess you're trying to facet on tokenized fields, which is not recommended. You might take a look at: http://wiki.apache.org/solr/UsingMailingLists, it'll help you frame your questions in a manner that gets you your answers as fast as possibld. Best Erick On Sun, Apr 18, 2010 at 12:59 PM, Ranveer Kumar wrote: > I am.using text for type, which is static. For example: type is a field and > I am using type for categorization. For news type I am using news and for > blog using blog.. type is a text field. > > On Apr 17, 2010 8:38 PM, "Ahmet Arslan" wrote: > > > I am facing problem to get facet result count. I must be > wrong > somewhere. > I am getting proper ... > Are you faceting on a tokenized field? What is the fieldType of your field? >
Re: DIH dataimport.properties with
Because there is a lot of data, and for scalability reasons we want all non-write operations to happen from a slave - we don't want to be using the master unless necessary On 17/04/10 08:28, Otis Gospodnetic wrote: Hm, why not just go to the MySQL master then? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Michael Tibben To: solr-user@lucene.apache.org Sent: Thu, April 15, 2010 10:15:14 PM Subject: DIH dataimport.properties with Hi, I am using the DIH to import data from a mysql slave. However, the slave sometimes runs behind the master. The delay is variable, most of the time it is in sync, but sometimes can run behind by a few minutes. This is a problem, because DIH uses dataimport.properties to determine the last_index_time for delta updates. This last_index_time does not correspond to the position of the slave, and so documents are being missed. What I need to be able to do is tell DIH what the last_index_time should be. Or alternatively, be able to specify another property in dataimport.properties, perhaps called datasource_version or similar. Is this possible? I have thought of a sneaky way to hack around the issue. Just before the delta update is run, I will switch the system time to the mysql slave's replication time. The system is used for nothing but solr master, so I think this should work OK. Any thoughts? Regards, Michael
Re: DIH dataimport.properties with
I don't really understand how this will help. Can you elaborate ? Do you mean that the last_index_time can be imported from somewhere outside solr? But I need to be able to *set* what last_index_time is stored in dataimport.properties, not get properties from somewhere else On 18/04/10 10:02, Lance Norskog wrote: The SolrEntityProcessor allows you to query a Solr instance and use the results as DIH properties. You would have to create your own regular query to do the delta-import instead of using the delta-import feature. https://issues.apache.org/jira/browse/SOLR-1499 On 4/16/10, Otis Gospodnetic wrote: Hm, why not just go to the MySQL master then? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Hadoop ecosystem search :: http://search-hadoop.com/ - Original Message From: Michael Tibben To: solr-user@lucene.apache.org Sent: Thu, April 15, 2010 10:15:14 PM Subject: DIH dataimport.properties with Hi, I am using the DIH to import data from a mysql slave. However, the slave sometimes runs behind the master. The delay is variable, most of the time it is in sync, but sometimes can run behind by a few minutes. This is a problem, because DIH uses dataimport.properties to determine the last_index_time for delta updates. This last_index_time does not correspond to the position of the slave, and so documents are being missed. What I need to be able to do is tell DIH what the last_index_time should be. Or alternatively, be able to specify another property in dataimport.properties, perhaps called datasource_version or similar. Is this possible? I have thought of a sneaky way to hack around the issue. Just before the delta update is run, I will switch the system time to the mysql slave's replication time. The system is used for nothing but solr master, so I think this should work OK. Any thoughts? Regards, Michael
Re: Facet count problem
Hi Erick, My schema configuration is following. On Mon, Apr 19, 2010 at 6:22 AM, Erick Erickson wrote: > Can we see the actual field definitions from your schema file. > Ahmet's question is vital and is best answered if you'll > copy/paste the relevant configuration entries But based > on what you *have* posted, I'd guess you're trying to > facet on tokenized fields, which is not recommended. > > You might take a look at: > http://wiki.apache.org/solr/UsingMailingLists, it'll help you > frame your questions in a manner that gets you your > answers as fast as possibld. > > Best > Erick > > On Sun, Apr 18, 2010 at 12:59 PM, Ranveer Kumar >wrote: > > > I am.using text for type, which is static. For example: type is a field > and > > I am using type for categorization. For news type I am using news and for > > blog using blog.. type is a text field. > > > > On Apr 17, 2010 8:38 PM, "Ahmet Arslan" wrote: > > > > > I am facing problem to get facet result count. I must be > wrong > > somewhere. > I am getting proper ... > > Are you faceting on a tokenized field? What is the fieldType of your > field? > > >
Re: LucidWorks Solr
--- On Sun, 4/18/10, Grant Ingersoll wrote: > > Sure, but I'm biased. ;-) Hopefully, you will find it > useful, but choose the one that best fits your needs (and > let me know if you need help assessing that.) > Thanks for the explanation Grant. WHat is the advantage of KStem over the standard Solr stemmer? On your website it was mentioned that KStem only works for English. What would happen if some of my documents are in other languages? What about the standard Solr stemmer -- does it also work on English only? Is there a stemmer that's sort of "universal" & work on multiple languages?
Re: Autofill 'id' field with the URL of files posted to Solr?
Lance, I can submit and extract pdf contents using Solr and SolrJ, as i indicated earlier. I've made 'id' a mandatory field and i had to submit its value while submitting (request.addParams("literal.id",url)).. If i put multiple files/streams in the request, then i can't put 'id' this way as the params are common to all files/streams which is not what i want. If somehow i can map stream_name/url of the files to 'id' field, that's all i need. Thanks. -- View this message in context: http://n3.nabble.com/Autofill-id-field-with-the-URL-of-files-posted-to-Solr-tp727985p728932.html Sent from the Solr - User mailing list archive at Nabble.com.
Query regarding "copyField"
Hello, Is it a problem if I use *copyField* for some fields and not for others. In my query, I have both fields, the ones mentioned in copyField and ones that are not copied to a common destination. Will this cause an anomaly in my search results. I am seeing some weird behavior. Thanks, Sandhya