Re: Aliases for fields
>What could possibly be a use case for such a need? > I would love to see such a feature. I have a multi core solr setup with each core having utterly different content. Each core has its own "custom search app" that exploits nuances specific to a particular data set. The fieldnames are chosen as best fits a particular data set. However I would also like to have one of two general search features that span all cores. This is a crude, one size fits all, type of search:- One core has fields:- author title text Another has sender subject message Another has placename description I either need to rename all fields within some of the "custom search apps" to account for the needs of the global search or perform lots of copyFields, or construct really nasty queries. I currently use the copyFields approach. I think aliases would allow for far more efficient indexes and clear code. Regards Fergus. >Cheers >Avlesh > >2009/8/18 Licinio Fernández Maurelo > >> Hello everybody, >> >> can i set an alias for a field? Something like : >> >> > stored="true" multiValued="false" termVectors="false" >> alias="source.date"/> >> >> is there any jira issue related? >> >> Thx >> >> -- >> Lici >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Netbeans and Solr : Whac-A-Mole
Hello all, I would appreciate help from somebody who has set up Solr within netbeans, I am wanting to do more work with DIH and particularly its XpathEntityProcessor stuff. I wish to preform the following from within the IDE ant -Dtestcase=TestXPathRecordReader.java test I have spent a few hours playing Whac-A-Mole with classpath and source settings. In the end I got it down to zero flags, but I then added some test cases and the scanner thing then went off and flagged dozens files with undefined classes I removed my change but the rescan did not remove the dozens of flagged files. PS: I am a total netbeans newbie. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Netbeans and Solr : Whac-A-Mole
>We've set-up NetBeans with Solr but we are using command-line for most of >the stuff except for editing the code. > >Does your code build from NetBeans? If not what errors do you see? The code builds and runs from net beans because the underlying build.xml file is being used. But when you want to run testcases... you are doing that from the command line? Are you are only using the IDE as an editor? >Regards >Rajan > >On Mon, Sep 7, 2009 at 3:26 PM, Fergus McMenemie wrote: > >> Hello all, >> >> I would appreciate help from somebody who has set up Solr within >> netbeans, I am wanting to do more work with DIH and particularly its >> XpathEntityProcessor stuff. I wish to preform the following from >> within the IDE >> >> ant -Dtestcase=TestXPathRecordReader.java test >> >> I have spent a few hours playing Whac-A-Mole with classpath and source >> settings. In the end I got it down to zero flags, but I then added >> some test cases and the scanner thing then went off and flagged dozens >> files with undefined classes I removed my change but the rescan did not >> remove the dozens of flagged files. >> >> PS: I am a total netbeans newbie. >> >> -- >> >> === >> Fergus McMenemie >> Email:fer...@twig.me.uk >> Techmore Ltd Phone:(UK) 07721 376021 >> >> Unix/Mac/Intranets Analyst Programmer >> === >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Netbeans and Solr : Whac-A-Mole
>This testcase is quite independent of anything in Solr. It is a >standalone utility and the only dependency is stax. >discalimer (I run these testcases from Intellij and command line) >BTW are you using XpathRecordReader outside of DIH? Nobel, Is there a better way to test and play with XPathRecordReader.java other than ant -Dtestcase=TestXPathRecordReader test Which takes 8secs to run here? I am not using XpathRecordReader outside of DIH, but looking to see how I would add support for xpaths such as //a. Fergus. > >On Mon, Sep 7, 2009 at 3:26 PM, Fergus McMenemie wrote: >> Hello all, >> >> I would appreciate help from somebody who has set up Solr within >> netbeans, I am wanting to do more work with DIH and particularly its >> XpathEntityProcessor stuff. I wish to preform the following from >> within the IDE >> >> ant -Dtestcase=TestXPathRecordReader.java test >> >> I have spent a few hours playing Whac-A-Mole with classpath and source >> settings. In the end I got it down to zero flags, but I then added >> some test cases and the scanner thing then went off and flagged dozens >> files with undefined classes I removed my change but the rescan did not >> remove the dozens of flagged files. >> >> PS: I am a total netbeans newbie. >> >> -- >> >> === >> Fergus McMenemie Email:fer...@twig.me.uk >> Techmore Ltd Phone:(UK) 07721 376021 >> >> Unix/Mac/Intranets Analyst Programmer >> === >> > > > >-- >----- >Noble Paul | Principal Engineer| AOL | http://aol.com -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Netbeans and Solr : Whac-A-Mole
>On Mon, Sep 7, 2009 at 5:58 PM, Fergus McMenemie wrote: > >> >This testcase is quite independent of anything in Solr. It is a >> >standalone utility and the only dependency is stax. >> >discalimer (I run these testcases from Intellij and command line) >> >BTW are you using XpathRecordReader outside of DIH? >> >> Nobel, >> >> Is there a better way to test and play with XPathRecordReader.java >> other than >> >> ant -Dtestcase=TestXPathRecordReader test >> >> Which takes 8secs to run here? I am not using XpathRecordReader >> outside of DIH, but looking to see how I would add support for >> xpaths such as //a. >> >> >The target takes a lot of time because it has to go through all the >test-cases in core and contribs trying to match the value given in >-Dtestcase. > >You could also do ant -Dtestcase=TestXPathRecordReader test-contrib which >should be a little faster. I run individual test cases directly through IDEA >which avoids these extra steps. > Shalin, Hmm, 6 seconds. I looked up IDEA and I guess I should be able to use it for free while working on solr. Is it easier to setup and come up the learning curve? Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Specifying multiple documents in DataImportHandler dataConfig
You can only have one document tag and the entities must be nested within that. >From the wiki, if you issue a simple "/dataimport?command=full-import" all top level entities will be processed. >Maybe I should be more clear: I have multiple tables in my DB that I >need to save to my Solr index. In my app code I have logic to persist >each table, which maps to an application model to Solr. This is fine. >I am just trying to speed up indexing time by using DIH instead of >going through my application. From what I understand of DIH I can >specify one dataSource element and then a series of document/entity >sets, for each of my models. But like I said before, DIH only appears >to want to index the first document declared under the dataSource tag. > >-Rupert > >On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco wrote: >> I am using the DataImportHandler with a JDBC datasource. From my >> understanding of DIH, for each of my "content types" e.g. Blog posts, >> Mesh Categories, etc I would construct a series of document/entity >> sets, like >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Solr parses this just fine and allows me to issue a >> /dataimport?command=full-import and it runs, but it only runs against >> the "first" document (blog_entries). It doesnt run against the 2nd >> document (mesh_categories). >> >> If I remove the 2 document elements and wrap both entity sets in just >> one document tag, then both sets get indexed, which seemingly achieves >> my goal. This just doesnt make sense from my understanding of how DIH >> works. My 2 content types are indeed separate so they logically >> represent two document types, not one. >> >> Is this correct? What am I missing here? >> >> Thanks >> -Rupert >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
RE: Extract info from parent node during data import
>Hi Paul, >The forEach="/document/category/item | /document/category/name" didn't work >(no categoryname was stored or indexed). >However forEach="/document/category/item | /document/category" seems to work >well. I am not sure why category on its own works, but not category/name... >But thanks for tip. It wasn't as painful as I thought it would be. >Venn Hmmm, I had bother with this. Although each occurance of /document/category/item causes a new solr document to indexed, that document contained all the fields from the parent element as well. Did you see this? > >> From: noble.p...@corp.aol.com >> Date: Thu, 10 Sep 2009 09:58:21 +0530 >> Subject: Re: Extract info from parent node during data import >> To: solr-user@lucene.apache.org >> >> try this >> >> add two xpaths in your forEach >> >> forEach="/document/category/item | /document/category/name" >> >> and add a field as follows >> >> > commonField="true"/> >> >> Please try it out and let me know. >> >> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy wrote: >> > >> > Hello, >> > >> > >> > >> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in >> > conjunction with the XPathEntityProcessor. I have successfully imported >> > XML content, but I think I may have found a limitation when it comes to >> > the commonField attribute in the DataImportHandler. >> > >> > >> > >> > Before writing my own parser to read in a whole XML document, I thought >> > I'd post the question here (since I got some great advice last time). >> > >> > >> > >> > The bulk of my content is contained within each tag. However, each >> > item has a parent called and each category has a name which I >> > would like to import. In my forEach loop I specify the >> > /document/category/item as the collection of items I am interested in. Is >> > there anyway to extract an element from underneath a parent node? To be a >> > more more specific (see eg xml below). I would like to index the following: >> > >> > - category: Category 1; id: 1; author: Author 1 >> > >> > - category: Category 1; id: 2; author: Author 2 >> > >> > - category: Category 2; id: 3; author: Author 3 >> > >> > - category: Category 2; id: 4; author: Author 4 >> > >> > >> > >> > Any ideas on how I can get to a parent node from within a child during >> > data import? If it cant be done, what do you suggest would be the best way >> > so I can keep using the DataImportHandler... would XSLT be a good idea to >> > 'flatten out' the structure a bit? >> > >> > >> > >> > Thanks >> > >> > >> > >> > This is what my XML document looks like: >> > >> > >> > >> > Category 1 >> > >> > 1 >> > Author 1 >> > >> > >> > 2 >> > Author 2 >> > >> > >> > >> > Category 2 >> > >> > 3 >> > Author 3 >> > >> > >> > 4 >> > Author 4 >> > >> > >> > >> > >> > >> > >> > And this is what my dataConfig looks like: >> > >> > >> > >> > > > url="http://localhost:9080/data/20090817070752.xml"; >> > processor="XPathEntityProcessor" forEach="/document/category/item" >> > transformer="DateFormatTransformer" stream="true" dataSource="dataSource"> >> >> > commonField="true" /> >> > >> > >> > >> > >> > >> > >> > >> > >> > This is how I have specified my schema >> > >> > > > required="true" /> >> > >> > >> > >> > >> > id >> > id >> > >> > >> > >> > >> > >> > >> > _ >> > Need a place to rent, buy or share? Let us find your next place for you! >> > http://clk.atdmt.com/NMN/go/157631292/direct/01/ >> >> >> >> -- >> - >> Noble Paul | Principal Engineer| AOL | http://aol.com > >_ >Get Hotmail on your iPhone Find out how here >http://windowslive.ninemsn.com.au/article.aspx?id=845706 -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Extract info from parent node during data import
e >>> >> > /document/category/item as the collection of items I am interested in. >>> >> > Is there anyway to extract an element from underneath a parent node? >>> >> > To be a more more specific (see eg xml below). I would like to index >>> >> > the following: >>> >> > >>> >> > - category: Category 1; id: 1; author: Author 1 >>> >> > >>> >> > - category: Category 1; id: 2; author: Author 2 >>> >> > >>> >> > - category: Category 2; id: 3; author: Author 3 >>> >> > >>> >> > - category: Category 2; id: 4; author: Author 4 >>> >> > >>> >> > >>> >> > >>> >> > Any ideas on how I can get to a parent node from within a child during >>> >> > data import? If it cant be done, what do you suggest would be the best >>> >> > way so I can keep using the DataImportHandler... would XSLT be a good >>> >> > idea to 'flatten out' the structure a bit? >>> >> > >>> >> > >>> >> > >>> >> > Thanks >>> >> > >>> >> > >>> >> > >>> >> > This is what my XML document looks like: >>> >> > >>> >> > >>> >> > >>> >> > Category 1 >>> >> > >>> >> > 1 >>> >> > Author 1 >>> >> > >>> >> > >>> >> > 2 >>> >> > Author 2 >>> >> > >>> >> > >>> >> > >>> >> > Category 2 >>> >> > >>> >> > 3 >>> >> > Author 3 >>> >> > >>> >> > >>> >> > 4 >>> >> > Author 4 >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > And this is what my dataConfig looks like: >>> >> > >>> >> > >>> >> > >>> >> > >> >> > url="http://localhost:9080/data/20090817070752.xml"; >>> >> > processor="XPathEntityProcessor" forEach="/document/category/item" >>> >> > transformer="DateFormatTransformer" stream="true" >>> >> > dataSource="dataSource"> >>> >> > >> >> > commonField="true" /> >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > This is how I have specified my schema >>> >> > >>> >> > >> >> > required="true" /> >>> >> > >>> >> > >>> >> > >>> >> > >>> >> > id >>> >> > id >>> >> > -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [DIH] Multiple repeat XPath stmts
>I'm trying to import several RSS feeds using DIH and running into a >bit of a problem. Some feeds define a GUID value that I map to my >Solr ID, while others don't. I also have a link field which I fill in >with the RSS link field. For the feeds that don't have the GUID value >set, I want to use the link field as the id. However, if I define the >same XPath twice, but map it to two diff. columns I don't get the id >value set. > >For instance, I want to do: >schema.xml >required="true"/> > > >DIH config: > > > >Because I am consolidating multiple fields, I'm not able to do >copyFields, unless of course, I wanted to implement conditional copy >fields (only copy if the field is not defined) which I would rather not. > >How do I solve this? > How about. The TemplateTransformer does nothing if its source expression is null. So the first transform assign the fallback value to ID, this is overwritten by the GUID if it is defined. You can not sort of do if-then-else using a combination of template and regex transformers. Adding a bit of maths to the transformers and I think we will have a turing complete language:-) fergus. >Thanks, >Grant -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: FileListEntityProcessor and LineEntityProcessor
>Hi, > >I'm trying to import data from a list of files using the >FileListEntityProcessor. Here is my import configuration: > > > >baseDir="d:\my\directory\" fileName=".*WRK" recursive="false" >rootEntity="false"> > processor="LineEntityProcessor" >url="${f.fileAbsolutePath}" >dataSource="fileDataSource" >transformer="myTransformer"> > > > > >If I have only one file in d:\my\directory\ then everything works correctly. >If I have multiple files then I get the following exception: Sorry but I dont quite follow this. FileListEntityProcessor and LineEntityProcessor are somewhat similar in that they provide a list of filenames which the likes of XPathEntityProcessor then open and parse. Is the above your complete data-config.xml? Can you provide more detail on what you are trying to do? ... You seem to listing all files "d:\my\directory\.*WRK". Do these WRK files contain lists of files to be indexed? >Sep 16, 2009 9:48:46 AM org.apache.solr.handler.dataimport.DocBuilder >buildDocum >ent >SEVERE: Exception while processing: f document : null >org.apache.solr.handler.dataimport.DataImportHandlerException: Problem >reading f >rom input Processing Document # 53812 >at >org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn >tityProcessor.java:112) >at >org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent >ityProcessorWrapper.java:237) >at >org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde >r.java:348) >at >org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde >r.java:376) >at >org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j >ava:224) >at >org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java >:167) >at >org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo >rter.java:316) >at >org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j >ava:376) >at >org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja >va:355) >Caused by: java.io.IOException: Stream closed >at java.io.BufferedReader.ensureOpen(Unknown Source) >at java.io.BufferedReader.readLine(Unknown Source) >at java.io.BufferedReader.readLine(Unknown Source) >at >org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn >tityProcessor.java:109) >... 8 more >Sep 16, 2009 9:48:46 AM org.apache.solr.handler.dataimport.DataImporter >doFullIm >port >SEVERE: Full Import failed >org.apache.solr.handler.dataimport.DataImportHandlerException: Problem >reading f >rom input Processing Document # 53812 >at >org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn >tityProcessor.java:112) >at >org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent >ityProcessorWrapper.java:237) >at >org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde >r.java:348) >at >org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde >r.java:376) >at >org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j >ava:224) >at >org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java >:167) >at >org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo >rter.java:316) >at >org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j >ava:376) >at >org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja >va:355) >Caused by: java.io.IOException: Stream closed >at java.io.BufferedReader.ensureOpen(Unknown Source) >at java.io.BufferedReader.readLine(Unknown Source) >at java.io.BufferedReader.readLine(Unknown Source) >at >org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn >tityProcessor.java:109) >... 8 more > > > >Note that my input files have 53812 lines, which is the same as the document >number that I'm choking on. Does anyone know what I'm doing wrong? > >Thanks, > >Wojtek >-- >View this message in context: >http://www.nabble.com/FileListEntityProcessor-and-LineEntityProcessor-tp25476443p25476443.html >Sent from the Solr - User mailing list archive at Nabble.com. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Extract info from parent node during data import (redirect:)
JIRA SOLR-1437 created "DIH: Enhance XPathRecordReader to deal with //tagname and other improvements." >Fergus, > >Implementing wildcard (//tagname) is definitely possible. I would love >to see it working. But if you wish to take a dig at it I shall do >whatever I can to help. > >>What is the use case that makes flow though so useful? >We do not know to which forEach xpath a given field is associated with. >Currently you can clean up the fields using a transformer. There is an >implicit field '$forEach' which tells you about the xpath tag for each >record that is emitted. > >>The recently added comments in XPathRecordReader are a great help and I >>was planning to add more. Might this be an issue? >I would love to have it. Give a patch and I shall commit it. >XPathRecordReader is a blackbox and AFAIK I am the only one who knows >it. I would love to have more eyes on that. > >>I would like to open a JIRA for improving XPathRecordReader. >Please go ahead. You can paste the contents of this mail in the list . >There may be others with similar ideas > >Noble. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Number of terms in a SOLR field
Hi all, I am attempting to test some changes I made to my DIH based indexing process. The changes only affect the way I describe my fields in data-config.xml, there should be no changes to the way the data is indexed or stored. As a QA check I was wanting to compare the results from indexing the same data before/after the change. I was looking for a way of getting counts of terms in each field. I guess Luke etc most allow this but how? Regards Fergus.
Re: Number of terms in a SOLR field
>Fergus McMenemie wrote: >> Hi all, >> >> I am attempting to test some changes I made to my DIH based >> indexing process. The changes only affect the way I >> describe my fields in data-config.xml, there should be no >> changes to the way the data is indexed or stored. >> >> As a QA check I was wanting to compare the results from >> indexing the same data before/after the change. I was looking >> for a way of getting counts of terms in each field. I >> guess Luke etc most allow this but how? > >Luke uses brute force approach - it traverses all terms, and counts >terms per field. This is easy to implement yourself - just get >IndexReader.terms() enumeration and traverse it. > Thanks Andrzej This is just a one off QA check. How do I get Luke to display terms and counts? > >-- >Best regards, >Andrzej Bialecki Fergus. --
Re: Number of terms in a SOLR field
>Fergus McMenemie wrote: >>> Fergus McMenemie wrote: >>>> Hi all, >>>> >>>> I am attempting to test some changes I made to my DIH based >>>> indexing process. The changes only affect the way I >>>> describe my fields in data-config.xml, there should be no >>>> changes to the way the data is indexed or stored. >>>> >>>> As a QA check I was wanting to compare the results from >>>> indexing the same data before/after the change. I was looking >>>> for a way of getting counts of terms in each field. I >>>> guess Luke etc most allow this but how? >>> Luke uses brute force approach - it traverses all terms, and counts >>> terms per field. This is easy to implement yourself - just get >>> IndexReader.terms() enumeration and traverse it. >>> >> Thanks Andrzej >> >> This is just a one off QA check. How do I get Luke to display >> terms and counts? > >1. get Luke 0.9.9 >2. open index with Luke >3. Look at the Overview panel, you will see the list titled "Available >fields and term counts per field". > > Thanks, That got me going, and I felt a little stupid after stumbling across http://wiki.apache.org/solr/LukeRequestHandler Regards Fergus
Re: Query filters/analyzers
>On Thu, Oct 1, 2009 at 7:59 PM, Claudio Martella > wrote: > >> >> About the copyField issue in general: as it copies the content to the >> other field, what is the sense to define analyzers for the destination >> field? The source is already analyzed so i guess that the RESULT of the >> analysis is copied there. > > >The copy is done before analysis. The original text is sent to the copyField >which can choose to do analysis differently from the source field. > I have been wondering about this as well. The WIKI is not explicit about what happens. Is this correct:- "The original text is sent to the copyField, before any configured analyzers for the originating or destination field are invoked." is so, I will tweak the wiki! Regds Fergus. --
Re: Error when indexing XML files
>Hi, > >I am trying to index XML files using SolrJ. The original XML file contains >nested elements. For example, the following is the snippet of the XML file. > > > SOMETHING > SOME_OTHER_THING > > >I have added the elements "name" and "facility" in Schema.xml file to make >these elements indexable. I have changed the XML document above to look like - > > > > .. > SOMETHING > .. > > > Can you send us the Schema.xml file you created? I suspect that one of the fields should be multivalued. -- Fergus.
Re: Using DIH's special commands....Help needed
Hi, For example, my data-import.conf has the following. It allows me to specify a parameter "single=pathname" on the url used to invoke DIH. It allows a doc to be deleted from the index by, in my case its pathname, which is stored in the field fileAbsolutePath. I feel sure this can be optimised! Fergus. >On Thu, Oct 15, 2009 at 6:25 PM, William Pierce wrote: > >> Folks: >> >> I see in the DIH wiki that there are special commands which according to >> the wiki >> >> "Special commands can be given to DIH by adding certain variables to the >> row returned by any of the components . " >> >> In my use case, my db contains rows that are marked "PendingDelete". How >> do I use the $deleteDocByQuery special command to delete these rows using >> DIH?In other words, where/how do I specify this? >> >> >The $deleteDocByQuery is for deleting Solr documents by a Solr query and not >DB rows. > >-- >Regards, >Shalin Shekhar Mangar. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Error when indexing XML files
Hi, Please find the schema file attached. Please let me know what I am doing wrong. Regards Chaitali --- On Wed, 10/14/09, Fergus McMenemie wrote: From: Fergus McMenemie Subject: Re: Error when indexing XML files To: solr-user@lucene.apache.org Date: Wednesday, October 14, 2009, 2:25 AM >Hi, > >I am trying to index XML files using SolrJ. The original XML file contains >nested elements. For example, the following is the snippet of the XML file. > > > SOMETHING > SOME_OTHER_THING > > >I have added the elements "name" and "facility" in Schema.xml file to make >these elements indexable. I have changed the XML document above to look like - > > > > .. > SOMETHING > .. > > > Can you send us the Schema.xml file you created? I suspect that one of the fields should be multivalued. one or other, perhaps both your fields need to be -- Fergus.
Re: Error when indexing XML files
>Hi, > >Please find the schema file attached. Please let me know what I am doing wrong. > >Regards >Chaitali > >--- On Wed, 10/14/09, Fergus McMenemie wrote: > > >From: Fergus McMenemie >Subject: Re: Error when indexing XML files >To: solr-user@lucene.apache.org >Date: Wednesday, October 14, 2009, 2:25 AM > >>Hi, >> >>I am trying to index XML files using SolrJ. The original XML file contains >>nested >> elements. For example, the following is the snippet of the XML file. >> >> >> SOMETHING >> SOME_OTHER_THING >> >> >>I have added the elements "name" and "facility" in Schema.xml file to make >>these >>elements indexable. I have changed the XML document above to look like - >> >> >> >> .. >> SOMETHING >> .. >> >> >> >Can you send us the Schema.xml file you created? I suspect that >one of the fields should be multivalued. one or other, perhaps both your fields need to be -- Fergus
Re: Question about DIH execution order
Bertie, Not sure what you are trying to do, we need a clearer description of what "select *" returns and what you want to end up in the index. But to answer your question The transformations happen after DIH has performed the SQL statement. In fact the rows output from the SQL command are assigned to the DIH fields and then any transformations are applied. The examples in http://wiki.apache.org/solr/DataImportHandler are quite good. >Hi Noble, > > I tried to understand your suggestions and played different variations >according to your reply. But none of them work. Can you explain it in more >details? > Thanks a lot! > > > > >BTW, do you mean your solution as follows? > > > > template="Course:${Course.CourseId}" name="id"/> > > > > > > > But > 1) There is no TmpCourseId field column. > 2) Can we put two name CourseId and id in the same map? It seems not. > > > > > >2009/11/1 Noble Paul ?? Â Ë³Ë > >> On Sun, Nov 1, 2009 at 11:59 PM, Bertie Shen >> wrote: >> > Hi folks, >> > >> > I have the following data-config.xml. Is there a way to >> > let transformation take place after executing SQL "select comment from >> > Rating where Rating.CourseId = ${Course.CourseId}"? In MySQL database, >> > column CourseId in table Course is integer 1, 2, etc; >> > template transformation will make them like Course:1, Course:2; column >> > CourseId in table Rating is also integer 1, 2, etc. >> > >> > If transformation happens before executing "select comment from Rating >> > where Rating.CourseId = ${Course.CourseId}", then there will no match for >> > the SQL statement execution. >> > >> > >> > >> > > > column="CourseId" template="Course:${Course.CourseId}" name="id"/> >> > >> > >> > >> > >> > >> > >> >> keep the field as follows >> > column="TmpCourseId" name="CourseId" >> template="Course:${Course.CourseId}" name="id"/> >> >> >> >> >> -- >> - >> Noble Paul | Principal Engineer| AOL | http://aol.com >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Trying to run solr-1.3.0 under tomcat 5.5.20 on OS X 10.5.5
(HostConfig.java:809) at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:698) at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:472) at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1122) at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:310) at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1021) at org.apache.catalina.core.StandardHost.start(StandardHost.java:718) at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1013) at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:442) at org.apache.catalina.core.StandardService.start(StandardService.java:450) at org.apache.catalina.core.StandardServer.start(StandardServer.java:709) at org.apache.catalina.startup.Catalina.start(Catalina.java:551) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:294) at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:432) So I guess the solrconfig.xml is been seen! Any help gratefully accepted! -- ======= Fergus McMenemie Email:[EMAIL PROTECTED] Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Large Data Set Suggestions
>Greetings! > >I've been asked to do some indexing performance testing on Solr 1.3 >using large XML document data sets (10M-60M docs) with DIH versus SolrJ. > > >Does anyone have any suggestions where I might find a good data set this >size? > >I saw the wikipedia dump reference in the DIH wiki, but that is only in >the 7M+ doc range. > >Any suggestions would be greatly appreciated. > >Thanks, > >Steve How large should each document be? I quite often do testing using the geonames_dd_dms_date_20081028 dataset from http://earth-info.nga.mil/gns/html/namefiles.htm. It has 6.6M Documents. It is actually a CVS separated file but it is trivial to convert to XML. -- ======= Fergus McMenemie Email:[EMAIL PROTECTED] Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Trying to run solr-1.3.0 under tomcat 5.5.20 on OS X 10.5.5 (works with 1.2.0)
Further to last message. I downloaded and repeated everything using Solr 1.2.0. This time everything worked fine! But I have to confess that my system is running 10.4.11 tiger rather than leopard, I do not know if that is significant. So it seems the instructions for deploying solr version 1.3.0 to tomcat under OS X tiger do not work. The only change is the version of solr. Any ideas? At 12:25 + 6/11/08, Fergus McMenemie wrote: >Hello all, > >I downloaded everything and set it up as per the instructions, and while it >does run under jetty, I can not get it to start under tomcat at all. I get >the following errors. This is with solrconfig.xml straight from the tgz file. > >HTTP Status 500 - > Severe errors in solr configuration. > Check your log files for more detailed information on what may be wrong. > If you want solr to continue after configuration errors, > change: false in > null > - >java.lang.RuntimeException: java.lang.NoSuchMethodError: > > org.apache.lucene.index.IndexReader.open(Lorg/apache/lucene/store/Directory;Z)Lorg/apache/lucene/index/IndexReader; > >at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) >at org.apache.solr.core.SolrCore.(SolrCore.java:470) >at >org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119) > >at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) > >at >org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:223) > >at >org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:304) > >at >org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:77) > >at >org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3634) > >at org.apache.catalina.core.StandardContext.start(StandardContext.java:4217) >at >org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:759) > >at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:739) >at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:524) >at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:809) >at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:698) >at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:472) >at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1122) >at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:310) >at >org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119) > >at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1021) >at org.apache.catalina.core.StandardHost.start(StandardHost.java:718) >at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1013) >at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:442) >at org.apache.catalina.core.StandardService.start(StandardService.java:450) >at org.apache.catalina.core.StandardServer.start(StandardServer.java:709) >at org.apache.catalina.startup.Catalina.start(Catalina.java:551) >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >at >sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >at >sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > >at java.lang.reflect.Method.invoke(Method.java:585) >at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:294) >at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:432). > >Now I retried the above with solrconfig.xml set to >false and saw no change. So I was wondering if the config file was not being >seen. So I renamed the .solr directory to ./solr-not and retried:- > >HTTP Status 500 - > Severe errors in solr configuration. > Check your log files for more detailed information on what may be wrong. > If you want solr to continue after configuration errors, > change: false in > null > - >java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in classpath >or 'solr/conf/', cwd=/usr/local >at >org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:194) > >at >org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:162) > >at org.apache.solr.core.Config.(Config.java:100) >at org.apache.solr.core.SolrConfig.(SolrConfig.java:113) >at org.apache.solr.core.SolrConfig.(SolrConfig.java:70) >at >org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:1
Newbe! Trying to run solr-1.3.0 under tomcat. Please help
Hello all, Further to various messages. I just cannot get solr 1.3 to launch under OS X with tomcat. Solr 1.2 works fine with tomcat and I am OK with 1.3 under jetty. I have tried tomcat-5.5.20 and 5.2.27. I have tried solr 1.3.0 plus the nightly build. I have tried under OS X 10.5 and 10.4 (leopard and tiger) all fail as follows. I also tried cutting and pasting the instructions from:- http://wiki.apache.org/solr/SolrTomcat Here is what I see on the browser. When I try to access http://localhost:8080/solr At 14:26 + 14/11/08, Fergus McMenemie wrote: >HTTP Status 500 - Severe errors in solr configuration. > Check your log files for more detailed information on what may be wrong. > If you want solr to continue after configuration errors, change: > false in null > - >java.lang.RuntimeException: java.lang.NoSuchMethodError: >org.apache.lucene.index.IndexReader.open(Lorg/apache/lucene/store/Directory;Z)Lorg/apache/lucene/index/IndexReader; > at org at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1065) > at org at org.apache.solr.core.SolrCore.(SolrCore.java:553) > at org at > org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:120) > at org at > org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) > at org at > org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:221) > at org at > org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:302) > at org at > org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:78) > at org at > org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3635) > at org at > org.apache.catalina.core.StandardContext.start(StandardContext.java:4222) > at org at > org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:760) > at org at > org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:740) > at org at > org.apache.catalina.core.StandardHost.addChild(StandardHost.java:544) > at org at > org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626) > at org at > org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553) > at org at > org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488) > at org at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1150) > at org at > org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311) > at org at > org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120) > at org at > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1022) > at org at org.apache.catalina.core.StandardHost.start(StandardHost.java:736) > at org at > org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1014) > at org at > org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443) > at org at > org.apache.catalina.core.StandardService.start(StandardService.java:448) > at org at > org.apache.catalina.core.StandardServer.start(StandardServer.java:700) > at org at org.apache.catalina.startup.Catalina.start(Catalina.java:552) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:585) > at org at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:295) > at org at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:433) >Caused by: java.lang.NoSuchMethodError: >org.apache.lucene.index.IndexReader.open(Lorg/apache/lucene/store/Directory;Z)Lorg/apache/lucene/index/IndexReader; > at org at > org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:109) > at org at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1055) ... > 30 more Here is a dump from tomcat/logs/catalina.out. It suggests there is something wrong with my solr/home property, however you can see that earlier on it seemed ok with this property. At 14:26 + 14/11/08, Fergus McMenemie wrote: >Nov 14, 2008 4:55:33 AM org.apache.catalina.core.AprLifecycleListener >lifecycleEvent INFO: > The Apache Tomcat Native library which allows optimal performance in > production environments was not found on the java.library.path: > /usr/local/bin:.:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java >Nov 14, 2008 4:55:33 AM org.apache.coyote.http11.Http11BaseProtocol init INFO: >Initializing Coyote HTTP/1.1 on http-8080 >Nov 14, 2008 4:55
Re: Newbe! Trying to run solr-1.3.0 under tomcat. Solved!
Erik, Thanks for "proving" the stuff for me. I started taking my system apart and was considering a fresh install, when I came across an old lucene jar file /Library/Java/Extensions/lucene-core-2.3.1.jar which was there after a bootcamp tutorial! It was on both my tiger and leopard machines as well. I guess that explains why solr1.2 worked! But what does jetty do differently from tomcat? Regards Fergus. >To be fair, my first message was about Solr trunk + Tomcat 5.5.27, but >I just tried it by pointing to a Solr 1.3.0 official release and it >worked fine as well. > > Erik > >On Nov 14, 2008, at 12:30 PM, Erik Hatcher wrote: > >> Fergus, >> >> I just downloaded Tomcat 5.5.27, put a solr.xml file in conf/ >> Catalina/localhost with the following: >> >> > debug="0" crossContext="true" > >> >> >> >> And Solr started up just fine and it's admin, etc worked as expected. >> >> Oh, and on Mac OS X (of course!), version 10.5.5. >> >> Erik >> >> On Nov 14, 2008, at 12:17 PM, Fergus McMenemie wrote: >> >>> Hello all, >>> >>> Further to various messages. I just cannot get solr 1.3 to launch >>> under OS X with tomcat. Solr 1.2 works fine with tomcat and I am >>> OK with 1.3 under jetty. >>> >>> >>> I have tried tomcat-5.5.20 and 5.2.27. I have tried solr >>> 1.3.0 plus the nightly build. I have tried under OS X 10.5 >>> and 10.4 (leopard and tiger) all fail as follows. I also >>> tried cutting and pasting the instructions from:- >>> http://wiki.apache.org/solr/SolrTomcat >>> >>> Here is what I see on the browser. When I try to access >>> http://localhost:8080/solr >>> >>> >>> At 14:26 + 14/11/08, Fergus McMenemie wrote: >>>> HTTP Status 500 - Severe errors in solr configuration. >>>> Check your log files for more detailed information on what may be >>>> wrong. >>>> If you want solr to continue after configuration errors, change: >>>> false in null >>>> - >>>> java.lang.RuntimeException: java.lang.NoSuchMethodError: >>>> org.apache.lucene.index.IndexReader.open(Lorg/apache/lucene/store/ >>>> Directory;Z)Lorg/apache/lucene/index/IndexReader; >>>> at org at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java: >>>> 1065) >>>> at org at org.apache.solr.core.SolrCore.(SolrCore.java:553) >>>> at org at org.apache.solr.core.CoreContainer >>>> $Initializer.initialize(CoreContainer.java:120) >>>> at org at >>>> org >>>> .apache >>>> .solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) >>>> at org at >>>> org >>>> .apache >>>> .catalina >>>> .core >>>> .ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:221) >>>> at org at >>>> org >>>> .apache >>>> .catalina >>>> .core >>>> .ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java: >>>> 302) >>>> at org at >>>> org >>>> .apache >>>> .catalina >>>> .core.ApplicationFilterConfig.(ApplicationFilterConfig.java: >>>> 78) >>>> at org at >>>> org >>>> .apache >>>> .catalina.core.StandardContext.filterStart(StandardContext.java: >>>> 3635) >>>> at org at >>>> org >>>> .apache.catalina.core.StandardContext.start(StandardContext.java: >>>> 4222) >>>> at org at >>>> org >>>> .apache >>>> .catalina.core.ContainerBase.addChildInternal(ContainerBase.java: >>>> 760) >>>> at org at >>>> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java: >>>> 740) >>>> at org at >>>> org.apache.catalina.core.StandardHost.addChild(StandardHost.java: >>>> 544) >>>> at org at >>>> org >>>> .apache >>>> .catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626) >>>> at org at >>>> org >>>> .apache >>>> .catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553) >>>> at org at >>>> org.apache.catalina.startup.HostConfig.deployApps(HostCon
Upgrade from 1.2 to 1.3 gives 3x slowdown
Hello, I have a CSV file with 6M records which took 22min to index with solr 1.2. I then stopped tomcat replaced the solr stuff inside webapps with version 1.3, wiped my index and restarted tomcat. Indexing the exact same content now takes 69min. My machine has 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M. Are there any tweaks I can use to get the original index time back. I read through the release notes and was expecting a speed up. I saw the bit about increasing ramBufferSizeMB and set it to 64MB; it had no effect. -- === Fergus McMenemie Email:[EMAIL PROTECTED] Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Upgrade from 1.2 to 1.3 gives 3x slowdown
Hello Grant, >Were you overwriting the existing index or did you also clean out the >Solr data directory, too? In other words, was it a fresh index, or an >existing one? And was that also the case for the 22 minute time? No in each case it was a new index. I store the indexes (the "data" dir) outside the solr home directory. For the moment I, rm -rf the index dir after each edit to the solrconfig.sml or schema.xml file and reindex from scratch. The relaunch of tomcat recreates the index dir. >Would it be possible to profile the two instance and see if you notice >anything different? I dont understand this. Do mean run a profiler against the tomcat image as indexing takes place, or somehow compare the indexes? I was think of making a short script that replicates the results, and posting it here, would that help? > >Thanks, >Grant > >On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote: > >> Hello, >> >> I have a CSV file with 6M records which took 22min to index with >> solr 1.2. I then stopped tomcat replaced the solr stuff inside >> webapps with version 1.3, wiped my index and restarted tomcat. >> >> Indexing the exact same content now takes 69min. My machine has >> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M. >> >> Are there any tweaks I can use to get the original index time >> back. I read through the release notes and was expecting a >> speed up. I saw the bit about increasing ramBufferSizeMB and set >> it to 64MB; it had no effect. >> -- >> >> === >> Fergus McMenemie Email:[EMAIL PROTECTED] >> Techmore Ltd Phone:(UK) 07721 376021 >> >> Unix/Mac/Intranets Analyst Programmer >> === -- === Fergus McMenemie Email:[EMAIL PROTECTED] Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [VOTE] Community Logo Preferences
https://issues.apache.org/jira/secure/attachment/12394263/apache_solr_a_blue.jpg
Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Hello Grant, Not much good with Java profilers (yet!) so I thought I would send a script! Details... details! Having decided to produce a script to replicate the 1.2 vis 1.3 speed problem. The required rigor revealed a lot more. 1) The faster version I have previously referred to as 1.2, was actually a "1.3-dev" I had downloaded as part of the solr bootcamp class at ApacheCon Europe 2008. The ID string in the CHANGES.txt document is:- $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $ 2) I did actually download and speed test a version of 1.2 from the internet. It's CHANGES.txt id is:- $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $ Speed wise it was about the same as 1.3 at 64min. It also had lots of char set issues and is ignored from now on. 3) The version I was planning to use, till I found this, speed issue was the "latest" official version:- $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $ I also verified the behavior with a nightly build. $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $ Anyway, The following script indexes the content in 22min for the 1.3-dev version and takes 68min for the newer releases of 1.3. I took the conf directory from the 1.3dev (bootcamp) release and used it replace the conf directory from the official 1.3 release. The 3x slow down was still there; it is not a configuration issue! = #! /bin/bash # This script assumes a /usr/local/tomcat link to whatever version # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. # All the following was done as root. # I have a directory /usr/local/ts which contains four versions of solr. The # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 1.3beata # I got while attending a solr bootcamp. I indexed the same content using the # different versions of solr as follows: cd /usr/local/ts if [ "" ] then echo "Starting from a-fresh" sleep 5 # allow time for me to interrupt! cp -Rp apache-solr-bc/example/solr ./solrbc #bc = bootcamp cp -Rp apache-solr-nightly/example/solr ./solrnightly cp -Rp apache-solr-1.3.0/example/solr ./solr13 # the gaz is regularly updated and its name keeps changing :-) The page # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest # version. curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip"; > geonames.zip unzip -q geonames.zip # delete corrupt blips! perl -i -n -e 'print unless ($. > 2128495 and $. < 2128505) or ($. > 5944254 and $. < 5944260) ;' geonames_dd_dms_date_20081118.txt #following was used to detect bad short records #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if (@F != 26);' geonames_dd_dms_date_20081118.txt # my set of fields and copyfields for the schema.xml fields=' ' copyfields=' ' # add in my fields and copyfields perl -i -p -e "print qq($fields) if s///;" solr*/conf/schema.xml perl -i -p -e "print qq($copyfields) if s[][];" solr*/conf/schema.xml # change the unique key and mark the "id" field as not required perl -i -p -e "s/id/UNI/i;" solr*/conf/schema.xml perl -i -p -e 's/required="true"//i if m/http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip"; echo "Getting ready to index the data set using solrnightly" /usr/local/tomcat/bin/shutdown.sh sleep 15 if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] then echo "Tomcat would not shutdown" exit fi rm -r /usr/local/tomcat/webapps/solr* rm -r /usr/local/tomcat/logs/*.out rm -r /usr/local/tomcat/work/Catalina/localhost/solr cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/webapps rm solr # rm the symbolic link ln -s solrnightly solr rm -r solr/data /usr/local/tomcat/bin/startup.sh sleep 10 # give solr time to launch and setup echo "Starting indexing at " `date` " with solrnightly" time curl "http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip"; >On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote: > >> Hello Grant, >> >>> Were you overwriting the existing index or did you also clean out the >>> Solr data directory, too? In other words, was it a fresh index, or >>> an >>> existing one? And was that also the case for the
Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Hello Grant, > >Haven't forgotten about you, but I've been traveling and then into >some US Holidays here. Happy thanks giving! > >To confirm I am understanding, you are seeing a slowdown between 1.3- >dev from April and one from September, right? Yep. Here are the MD5 hashes:- fergus: md5 *.war MD5 (solr-bc.war) = 8d4f95628d6978c959d63d304788bc25 MD5 (solr-nightly.war) = 10281455a66b0035ee1f805496d880da This is the META-INF/MANIFEST.MF from a recent nightly build. (slow) Manifest-Version: 1.0 Ant-Version: Apache Ant 1.7.0 Created-By: 1.5.0_06-b05 (Sun Microsystems Inc.) Extension-Name: org.apache.solr Specification-Title: Apache Solr Search Server Specification-Version: 1.3.0.2008.11.13.08.16.12 Specification-Vendor: The Apache Software Foundation Implementation-Title: org.apache.solr Implementation-Version: nightly exported - yonik - 2008-11-13 08:16:12 Implementation-Vendor: The Apache Software Foundation X-Compile-Source-JDK: 1.5 X-Compile-Target-JDK: 1.5 This is war file we were given on the course Manifest-Version: 1.0 Ant-Version: Apache Ant 1.7.0 Created-By: 1.5.0_13-121 ("Apple Computer, Inc.") Extension-Name: org.apache.solr Specification-Title: Apache Solr Search Server Specification-Version: 1.2.2008.04.04.08.09.14 Specification-Vendor: The Apache Software Foundation Implementation-Title: org.apache.solr Implementation-Version: 1.3-dev exported - erik - 2008-04-04 08:09:14 Implementation-Vendor: The Apache Software Foundation X-Compile-Source-JDK: 1.5 X-Compile-Target-JDK: 1.5 I have copied both war files to a web site http://www.twig.me.uk/solr/solr-bc.war (solr 1.3 dev == bootcamp) http://www.twig.me.uk/solr/solr-nightly.war (nightly) Regards Fergus. >Can you produce an MD5 hash of the WAR file or something, such that I >can know I have the exact bits. Better yet, perhaps you can put those >files up somewhere where they can be downloaded. > >Thanks, >Grant > >On Nov 26, 2008, at 10:54 AM, Fergus McMenemie wrote: > >> Hello Grant, >> >> Not much good with Java profilers (yet!) so I thought I >> would send a script! >> >> Details... details! Having decided to produce a script to >> replicate the 1.2 vis 1.3 speed problem. The required rigor >> revealed a lot more. >> >> 1) The faster version I have previously referred to as 1.2, >> was actually a "1.3-dev" I had downloaded as part of the >> solr bootcamp class at ApacheCon Europe 2008. The ID >> string in the CHANGES.txt document is:- >> $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $ >> >> 2) I did actually download and speed test a version of 1.2 >> from the internet. It's CHANGES.txt id is:- >> $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $ >> Speed wise it was about the same as 1.3 at 64min. It also >> had lots of char set issues and is ignored from now on. >> >> 3) The version I was planning to use, till I found this, >> speed issue was the "latest" official version:- >> $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $ >> I also verified the behavior with a nightly build. >> $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $ >> >> Anyway, The following script indexes the content in 22min >> for the 1.3-dev version and takes 68min for the newer releases >> of 1.3. I took the conf directory from the 1.3dev (bootcamp) >> release and used it replace the conf directory from the >> official 1.3 release. The 3x slow down was still there; it is >> not a configuration issue! >> = >> >> >> >> >> >> >> #! /bin/bash >> >> # This script assumes a /usr/local/tomcat link to whatever version >> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also >> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. >> # All the following was done as root. >> >> >> # I have a directory /usr/local/ts which contains four versions of >> solr. The >> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or >> a 1.3beata >> # I got while attending a solr bootcamp. I indexed the same content >> using the >> # different versions of solr as follows: >> cd /usr/local/ts >> if [ "" ] >> then >> echo "Starting from a-fresh" >> sleep 5 # allow time for me to interrupt! >> cp -Rp apache-solr-bc/example/solr ./solrbc #bc = bootcamp >> cp -Rp apache-solr-nightly/example/solr ./solrnightly >> cp -Rp apache-solr-1.3.0/example/solr ./solr13 >> >> # the gaz is regularly u
Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!
Yonik >Another thought I just had - do you have autocommit enabled? > No; not as far as I know! The solrconfig.xml from the two versions are equivalent as best I can tell, also they are exactly as provided in the download. The only changes were made by the attached script and should not affect committing. Finally the indexing command has commit=true, which I think means do a single commit at the end of the file? Regards Fergus. >A lucene commit is now more expensive because it syncs the files for >safety. If you commit frequently, this could definitely cause a >slowdown. > >-Yonik > >On Wed, Nov 26, 2008 at 10:54 AM, Fergus McMenemie <[EMAIL PROTECTED]> wrote: >> Hello Grant, >> >> Not much good with Java profilers (yet!) so I thought I >> would send a script! >> >> Details... details! Having decided to produce a script to >> replicate the 1.2 vis 1.3 speed problem. The required rigor >> revealed a lot more. >> >> 1) The faster version I have previously referred to as 1.2, >> was actually a "1.3-dev" I had downloaded as part of the >> solr bootcamp class at ApacheCon Europe 2008. The ID >> string in the CHANGES.txt document is:- >> $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $ >> >> 2) I did actually download and speed test a version of 1.2 >> from the internet. It's CHANGES.txt id is:- >> $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $ >> Speed wise it was about the same as 1.3 at 64min. It also >> had lots of char set issues and is ignored from now on. >> >> 3) The version I was planning to use, till I found this, >> speed issue was the "latest" official version:- >> $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $ >> I also verified the behavior with a nightly build. >> $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $ >> >> Anyway, The following script indexes the content in 22min >> for the 1.3-dev version and takes 68min for the newer releases >> of 1.3. I took the conf directory from the 1.3dev (bootcamp) >> release and used it replace the conf directory from the >> official 1.3 release. The 3x slow down was still there; it is >> not a configuration issue! >> = >> >> >> >> >> >> >> #! /bin/bash >> >> # This script assumes a /usr/local/tomcat link to whatever version >> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also >> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. >> # All the following was done as root. >> >> >> # I have a directory /usr/local/ts which contains four versions of solr. The >> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a >> 1.3beata >> # I got while attending a solr bootcamp. I indexed the same content using the >> # different versions of solr as follows: >> cd /usr/local/ts >> if [ "" ] >> then >> echo "Starting from a-fresh" >> sleep 5 # allow time for me to interrupt! >> cp -Rp apache-solr-bc/example/solr ./solrbc #bc = bootcamp >> cp -Rp apache-solr-nightly/example/solr ./solrnightly >> cp -Rp apache-solr-1.3.0/example/solr ./solr13 >> >> # the gaz is regularly updated and its name keeps changing :-) The page >> # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest >> # version. >> curl >> "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip"; > >> geonames.zip >> unzip -q geonames.zip >> # delete corrupt blips! >> perl -i -n -e 'print unless >> ($. > 2128495 and $. < 2128505) or >> ($. > 5944254 and $. < 5944260) >> ;' geonames_dd_dms_date_20081118.txt >> #following was used to detect bad short records >> #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" >> if (@F != 26);' geonames_dd_dms_date_20081118.txt >> >> # my set of fields and copyfields for the schema.xml >> fields=' >> >> > required="true" /> >> > stored="true"/> >> > stored="true"/> >> > stored="true"/> >> > stored="true"/> >> > stored="true"/> >> > stored="true"/> >> > stored="true"/> >> > stored="true"/> >> > stored="true"/> >>
correct use of copyFields in schema.xml
Hello all, Reviewing the various examples that comes with Solr I cant make up mind wether the copyFields element should be nested within the fields element or not. The http://wiki.apache.org/solr/SchemaXml documentation makes it clear it should be outside, yet a number of examples have it nested. IMHO, being able to nest copyFields inside fields makes for more self documenting code! Regards Fergus -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
getting DIH to read my XML files
Hello, I am trying to use DIH with FileListEntityProcessor to to walk the disk and read XML documents. I have a dataConfig.xml as follows:- 0 But when I try and start the walker I get:- INFO: [jdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 13, 2009 3:38:11 PM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_1,version=1231861070710,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_2,version=1231861070711,generation=2,filenames=[segments_2] Jan 13, 2009 3:38:11 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1231861070711 Jan 13, 2009 3:38:11 PM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: jcurrent document : null org.apache.solr.handler.dataimport.DataImportHandlerException: No dataSource :null available for entity :x Processing Document # 1 at org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:287) at org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:86) at org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:78) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:309) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:137) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:337) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:397) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:378) Jan 13, 2009 3:38:11 PM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed org.apache.solr.handler.dataimport.DataImportHandlerException: No dataSource :null available for entity :x Processing Document # 1 at org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:287) at org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:86) at org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:78) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:243) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:309) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:137) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:337) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:397) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:378) Anybody able to point out what I have done wrong? Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: getting DIH to read my XML files
putFactory.java:543) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:604) at com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:660) at com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:331) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:81) ... 10 more > >On Tue, Jan 13, 2009 at 9:28 PM, Fergus McMenemie wrote: > >> Hello, >> >> I am trying to use DIH with FileListEntityProcessor to to walk the >> disk and read XML documents. I have a dataConfig.xml as follows:- >> >> >> >> > processor="FileListEntityProcessor" >> fileName=".*xml" >> newerThan="'NOW-1000DAYS'" >> recursive="true" >> rootEntity="false" >> dataSource="null" >> baseDir="/Volumes/spare/ts/j/groups"> >> > processor="XPathEntityProcessor" >> url="${jcurrent.fileAbsolutePath}" >> stream="false" >> forEach="/record" >> transformer="DateFormatTransformer">0 >> >> > xpath="/record/metadata/subje...@qualifier='fullTitle']"/> >> >> > xpath="/record/metadata/subje...@qualifier='publication']"/> >> > xpath="/record/metadata/subje...@qualifier='pubAbbrev']"/> >> > xpath="/record/metadata/da...@qualifier='pubDate']"/> >> >> >> >> >> >> >> But when I try and start the walker I get:- >> >> INFO: [jdocs] REMOVING ALL DOCUMENTS FROM INDEX >> Jan 13, 2009 3:38:11 PM org.apache.solr.core.SolrDeletionPolicy onInit >> INFO: SolrDeletionPolicy.onInit: commits:num=2 >> >> commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_1,version=1231861070710,generation=1,filenames=[segments_1] >> >> commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_2,version=1231861070711,generation=2,filenames=[segments_2] >> Jan 13, 2009 3:38:11 PM org.apache.solr.core.SolrDeletionPolicy >> updateCommits >> INFO: last commit = 1231861070711 >> Jan 13, 2009 3:38:11 PM org.apache.solr.handler.dataimport.DocBuilder >> buildDocument >> SEVERE: Exception while processing: jcurrent document : null >> org.apache.solr.handler.dataimport.DataImportHandlerException: No >> dataSource :null available for entity :x Processing Document # 1 >> at >> org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:287) >> at >> org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:86) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:78) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:243) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:309) >> at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179) >> at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:137) >> at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:337) >> at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:397) >> at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:378) >> Jan 13, 2009 3:38:11 PM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> SEVERE: Full Import failed >> org.apache.solr.handler.dataimport.DataImportHandlerException: No >> dataSource :null available for entity :x Processing Document # 1 >> at >> org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:287) >> at >> org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:86) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:78) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:243) >> at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:309)
DIH XPathEntityProcessor fails with docs containing
Hello all, as the subject says: DIH XPathEntityProcessor fails with docs containing This is using a solr nightly build from monday. INFO: Server startup in 3623 ms Jan 16, 2009 9:54:12 AM org.apache.solr.handler.dataimport.SolrWriter readIndexerProperties INFO: Read dataimport.properties Jan 16, 2009 9:54:12 AM org.apache.solr.core.SolrCore execute INFO: [jdocs] webapp=/solr path=/walkj params={command=full-import} status=0 QTime=13 Jan 16, 2009 9:54:12 AM org.apache.solr.handler.dataimport.DataImporter doFullImport INFO: Starting Full Import Jan 16, 2009 9:54:12 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [jdocs] REMOVING ALL DOCUMENTS FROM INDEX Jan 16, 2009 9:54:12 AM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_c,version=1232026423291,generation=12,filenames=[segments_c, _4.fnm, _4.frq, _4.prx, _4.tis, _4.tii, _4.nrm, _4.fdx, _4.fdt] commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_d,version=1232026423292,generation=13,filenames=[segments_d] Jan 16, 2009 9:54:12 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1232026423292 Jan 16, 2009 9:54:13 AM org.apache.solr.handler.dataimport.DocBuilder buildDocument SEVERE: Exception while processing: jcurrent document : null org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:/j/dtd/jxml/data/news/2008/frp70450.xmlrows processed :0 Processing Document # 1 at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: (was java.io.FileNotFoundException) /../config/jml-delivery-norm-2.1.dtd (No such file or directory) at [row,col {unknown-source}]: [3,81] at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 9 more Caused by: com.ctc.wstx.exc.WstxParsingException: (was java.io.FileNotFoundException) /../config/jml-delivery-norm-2.1.dtd (No such file or directory) at [row,col {unknown-source}]: [3,81] at com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630) at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461) at com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:475) at com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358) at com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351) at com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988) at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) ... 10 more Jan 16, 2009 9:54:13 AM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed A fragment from the top of the failing document is http://dtd.j.com/2002/Content/"; id="frp70450" urname="record"> http://www.w3.org/1999/xlink"; xlink:href="" urname="metadata" xlink:type="simple"> http://purl.org/dc/elements/1.1/"; qualifier="pdate">20080131 The DTD does exist at the specified location. Removing the DOCTYPE directive fixes everything. I know that use of DOCTYPE is out of fashion, and it does not exist in our newer documents, however there are lots of older XML docs about! Regards Fergus. -
Re: Is it just me or multicore default is broken? Can't ping
Julian, This is with the nightly from jan 12. I am using mutli core and playing about with DIH. I cant its Interactive development mode to work properly and suspect that to use it I need to run in single core mode. I am still developing, so I have nothing setup within tomcat startup files, it all depends on the directory I launch tomcat from which is /Volumes/spare/ts:- fergus: ls -al /Volumes/spare/ts total 2657816 drwxrwxrwx 19 rootfergus646 Jan 14 11:06 . drwxrwxr-x 18 rootadmin 680 Jan 13 10:46 .. -rw-rw-rw-@ 1 fergus fergus 6148 Jan 16 14:58 .DS_Store drwxr-xr-x 16 fergus fergus544 Apr 8 2008 apache-solr-bc drwxr-xr-x@ 15 fergus fergus510 Jan 14 11:06 apache-solr-nightly drwxr-xr-x 3 fergus fergus102 Jan 13 11:06 solr -rw-r--r--@ 1 fergus fergus 57874925 Jan 12 22:31 solr-2009-01-12.tgz drwxr-xr-x 8 fergus fergus272 Dec 16 17:53 solrbc drwxr-xr-x 7 fergus fergus238 Jan 16 12:08 solrnightlyjanes fergus: ls -al /Volumes/spare/ts/solr total 8 drwxr-xr-x 3 fergus fergus 102 Jan 13 11:06 . drwxrwxrwx 19 rootfergus 646 Jan 14 11:06 .. -rw-rw-rw-@ 1 fergus fergus 500 Jan 13 11:07 solr.xml fergus: more /Volumes/spare/ts/solr/solr.xml Here is a fragment from the top of one of my solrconfig.xml file. Note the use of solr.data.dir. fergus: more /Volumes/spare/ts/solrnightlyjanes/conf/solrconfig.xml file. ${solr.abortOnConfigurationError:true} ${solr.data.dir:./solr/data} fergus: get 'http://localhost:8080/solr/admin/cores' | perl -p -e 's[()][$1\n ]g;' 0 2 gazetteer solr/../solrbc/ solrbc/data/ 2009-01-16T12:08:56.033Z 3078174 6705364 6705364 1229202899164 false true false org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/Volumes/spare/ts/solrbc/data/index 2008-12-13T21:39:08Z janesdocs solr/../solrnightlyjanes/ solrnightlyjanes/data/ 2009-01-16T12:08:56.613Z 3077596 269 269 1232107736664 true true false org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/Volumes/spare/ts/solrnightlyjanes/data/index 2009-01-16T12:57:40Z fergus: get 'http://localhost:8080/solr/janesdocs/admin/ping' | perl -p -e 's[()][$1\n ]g;' 0 2 all all solrpingquery standard OK fergus: get 'http://localhost:8080/solr/gazetteer/admin/ping' | perl -p -e 's[()][$1\n ]g;' 0 2 all all solrpingquery standard OK Hope this helps. >I gave few new shots today: >- with jetty and nightly build 16 Jan - same problem null pointer exception >- Then I decided I am not using solr multicore but rather tomcat to >handle this. So I get latets tomcat and again with using 1.3.0 solr.war >I setup all >as explained >http://wiki.apache.org/solr/SolrTomcat#head-024d7e11209030f1dbcac9974e55106abae837ac >Links again are all smooth for admin and all but I still get 500 on pings :( > >Is everyone using solr with single index(core) ? > >Cheers > >All setup is smooth, working > >Julian Davchev wrote: >> Hi, >> >> I am trying with 1.3.0from >> http://apache.cbox.biz/lucene/solr/1.3.0/apache-solr-1.3.0.tgz >> >> which I supposed is stable release. >> >> Otis Gospodnetic wrote: >> >>> Not sure, I'd have to try it. But you didn't mention which version of Solr >>> you are using. Nightly build? >>> >>> >>> Otis >>> -- >>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch >>> >>> >>> >>> - Original Message >>> >>> >>>> From: Julian Davchev >>>> To: solr-user@lucene.apache.org >>>> Sent: Thursday, January 15, 2009 9:53:37 AM >>>> Subject: Is it just me or multicore default is broken? Can't ping >>>> >>>> Hi, >>>> I am trying to setup multicore solr. So I just download default one with >>>> jetty...goto example/ >>>> and run >>>> java -Dsolr.solr.home=multicore -jar start.jar >>>> >>>> >>>> All looks smooth without errors on startup. >>>> Also can can open admin at >>>> >>>> http://localhost:8983/solr/core1/admin/ >>>> >>>&
Re: getting DIH to read my XML files: solved
Shalin, thanks for the pointer. The following data-config.xml worked. The trick was realising that EVERY entity tag needs to have its own datasource, I guess I had been assuming that it was implicit for certain processors. The whole thing is confusing in that there is both the dataSource element(s), which is to all intents and purposes required, and an optional dataSource attribute of the entity element. If the entity dataSource attribute is missing it defaults to one of the defined ones??? Unless you are using FileListEntityProcessor where you have to explicitly state you are not using a dataSource. As a newbie I think my lesson learnt, is to name every dataSource element I define and to reference named dataSources from every entity element I add, except for FileListEntityProcessor where is has to be set to null. 0 Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Cant get HTMLStripTransformer's stripHTML to work in DIH.
Hello all, I have the following DIH data-config.xml file. Adding HTMLStripTransformer and the associated stripHTML on the para tag seems to have broke things. I am using a nightly build from 12-jan-2009 The /record/sect1/para contains HTML sub tags which need to be discarded. Is my use of stripHTML correct? -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
extRow(XPathEntityProcessor.java:197) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.NullPointerException at java.io.StringReader.(StringReader.java:33) at org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) at org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) ... 9 more Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback >On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie wrote: > >> Hello all, >> >> I have the following DIH data-config.xml file. Adding >> HTMLStripTransformer and the associated stripHTML on the >> para tag seems to have broke things. I am using a nightly >> build from 12-jan-2009 >> >> The /record/sect1/para contains HTML sub tags which need >> to be discarded. Is my use of stripHTML correct? >> >> >> >> >> >processor="FileListEntityProcessor" >>fileName=".*xml" >>newerThan="'NOW-1000DAYS'" >>recursive="true" >>rootEntity="false" >>dataSource="null" >>baseDir="/Volumes/spare/ts/jxml/data/news/groups"> >> >>> dataSource="myfilereader" >> processor="XPathEntityProcessor" >> url="${jcurrent.fileAbsolutePath}" >> stream="false" >> forEach="/record" >> >> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer"> >> >> > template="${jcurrent.fileAbsolutePath}" /> >> > replaceWith="$1" sourceColName="fileAbsePath"/> >> >> > stripHTML="true" /> >> > xpath="/record/metadata/subje...@qualifier='fullTitle']" /> >> > xpath="/record/metadata/subje...@qualifier='publication']" /> >> > xpath="/record/metadata/da...@qualifier='pubDate']" >> dateTimeFormat="MMdd" /> >> >> >> >> >> >> -- >> >> === >> Fergus McMenemie >> Email:fer...@twig.me.uk >> Techmore Ltd Phone:(UK) 07721 376021 >> >> Unix/Mac/Intranets Analyst Programmer >> === >> > > > >-- >Regards, >Shalin Shekhar Mangar. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
AM org.apache.solr.handler.dataimport.DataImporter >doFullImport >SEVERE: Full Import failed >org.apache.solr.handler.dataimport.DataImportHandlerException: >java.lang.NullPointerException > at > org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) > at > org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) > at > org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) > at > org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) > at > org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) > at > org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) > at > org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) > at > org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) > at > org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >Caused by: java.lang.NullPointerException > at java.io.StringReader.(StringReader.java:33) > at > org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) > at > org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) > at > org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) > ... 9 more >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback >INFO: start rollback >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback >INFO: end_rollback > > >>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie wrote: >> >>> Hello all, >>> >>> I have the following DIH data-config.xml file. Adding >>> HTMLStripTransformer and the associated stripHTML on the >>> para tag seems to have broke things. I am using a nightly >>> build from 12-jan-2009 >>> >>> The /record/sect1/para contains HTML sub tags which need >>> to be discarded. Is my use of stripHTML correct? >>> >>> >>> >>> >>> >>processor="FileListEntityProcessor" >>>fileName=".*xml" >>>newerThan="'NOW-1000DAYS'" >>>recursive="true" >>>rootEntity="false" >>>dataSource="null" >>>baseDir="/Volumes/spare/ts/jxml/data/news/groups"> >>> >>>>> dataSource="myfilereader" >>> processor="XPathEntityProcessor" >>> url="${jcurrent.fileAbsolutePath}" >>> stream="false" >>> forEach="/record" >>> >>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer"> >>> >>> >> template="${jcurrent.fileAbsolutePath}" /> >>> >> replaceWith="$1" sourceColName="fileAbsePath"/> >>> >>> >> stripHTML="true" /> >>> >> xpath="/record/metadata/subje...@qualifier='fullTitle']" /> >>> >> xpath="/record/metadata/subje...@qualifier='publication']" /> >>> >> xpath="/record/metadata/da...@qualifier='pubDate']" >>> dateTimeFormat="MMdd" /> >>> >>> >>> >>> >>> >>> -- >>> >>> === >>> Fergus McMenemie >>> Email:fer...@twig.me.uk >>> Techmore Ltd Phone:(UK) 07721 376021 >>> >>> Unix/Mac/Intranets Analyst Programmer >>> === >>> >> >> >> >>-- >>Regards, >>Shalin Shekhar Mangar. > >-- > >=== >Fergus McMenemie Email:fer...@twig.me.uk >Techmore Ltd Phone:(UK) 07721 376021 > >Unix/Mac/Intranets Analyst Programmer >=== -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
(DataImporter.java:321) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) Caused by: java.lang.RuntimeException: java.util.NoSuchElementException at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) at org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) ... 9 more Caused by: java.util.NoSuchElementException at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) at org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) at org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) ... 10 more Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback >Ah, it needs a null check for multi valued fields. I've committed a fix to >trunk. The next nightly build should have it. You can checkout and build >from the trunk if need this immediately. > >On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie wrote: > >> Hmmm, >> >> Just to clarify I retested the thing using the nightly as of today >> 18-jan-2009. The problem is still there and this traceback is from >> that nightly. >> >> >>This looks fine. Can you post the stack trace? >> >> >> >Yep, here is the juicy bit. Let me know if you need more. >> > >> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start >> >INFO: Server startup in 2390 ms >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute >> >INFO: [janesdocs] webapp=/solr path=/dataimport >> params={command=full-import} status=0 QTime=12 >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter >> readIndexerProperties >> >INFO: Read dataimport.properties >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> >INFO: Starting Full Import >> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 >> deleteAll >> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit >> >INFO: SolrDeletionPolicy.onInit: commits:num=2 >> > >> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1] >> > >> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2] >> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy >> updateCommits >> >INFO: last commit = 1232363283059 >> >Jan 19, 2009 11:14:06 AM >> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer >> >WARNING: transformer threw error >> >java.lang.NullPointerException >> > at java.io.StringReader.(StringReader.java:33) >> > at >> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71) >> > at >> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54) >> > at >> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) >> > at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) >> > at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >> > at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >> > at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >> > at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >> > at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder &
Re: DIH XPathEntityProcessor fails with docs containing
at > org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82) > ... 10 more >Jan 16, 2009 9:54:13 AM org.apache.solr.handler.dataimport.DataImporter >doFullImport >SEVERE: Full Import failed > >A fragment from the top of the failing document is > > >href="../../../../config/support/j-deliver.xsl"?> > >http://dtd.j.com/2002/Content/"; id="frp70450" >urname="record"> > http://www.w3.org/1999/xlink"; xlink:href="" > urname="metadata" xlink:type="simple"> > http://purl.org/dc/elements/1.1/"; > qualifier="pdate">20080131 > >The DTD does exist at the specified location. Removing the DOCTYPE directive >fixes everything. I know that use of DOCTYPE is out of fashion, and it does >not exist in our newer documents, however there are lots of older XML docs >about! -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.
>Hi Fergus, > >It seems a field it is expecting is missing from the XML. You mean there is some field in the document we are indexing that is missing? > >sourceColName="*fileAbsePath*"/> > >I guess "fileAbsePath" is a typo? Can you check if that is the cause? Well spotted. I had made a mess of sanitizing the config file I sent to you. I will in future make sure the stuff I am messing with matches what I send to the list. However there is no typo in the underlying file; at least not on that line:-) > > >On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie wrote: > >> Shalin >> >> Downloaded nightly for 21jan and tried DIH again. Its better but >> still broken. Dozens of embeded tags are stripped from documents >> but it now fails every few documents for no reason I can see. Manually >> removing embeded tags causes a given problem document to be indexed, >> only to have a it fail on one of the next few documents. I think the >> problem is still in stripHTML >> >> Here is the traceback. >> >> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start >> INFO: Server startup in 3377 ms >> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter >> readIndexerProperties >> INFO: Read dataimport.properties >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute >> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import} >> status=0 QTime=13 >> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter >> doFullImport >> INFO: Starting Full Import >> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2 >> deleteAll >> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit >> INFO: SolrDeletionPolicy.onInit: commits:num=2 >> >> >> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1] >> >> >> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2] >> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy >> updateCommits >> INFO: last commit = 1232539612131 >> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder >> buildDocument >> SEVERE: Exception while processing: jc document : null >> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing >> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0 >> Processing Document # 9 >>at >> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72) >>at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252) >>at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177) >> at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339) >>at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202) >>at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147) >>at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321) >>at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381) >>at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362) >> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException >>at >> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85) >>at >> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242) >>... 9 more >> Caused by: java.util.NoSuchElementException >>at >> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083) >>at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141) >>at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174) >>at >> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89) >>at >> or
Re: DIH XPathEntityProcessor fails with docs containing
Seems to work fin on this mornings 23-jan-2009 nightly. Thanks very much. >On Wed, Jan 21, 2009 at 6:05 PM, Fergus McMenemie wrote: > >> >> After looking looking at http://issues.apache.org/jira/browse/SOLR-964, >> where >> it seems this issue has been addressed, I had another go at indexing >> documents >> containing DOCTYPE. It failed as follows. >> >> >That patch has not been committed to the trunk yet. I'll take it up. > >-- >Regards, >Shalin Shekhar Mangar. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: How to make Relationships work for Multi-valued Index Fields?
Hello, I am also a newbie and was wanting to do almost the exact same thing. I was planning on doing the equivalent of:- ***change** ID is no longer unique within Solr, There would be multiple "documents" with a given ID; one for each address. You can then search on ID and get the three addresses, you can also search on an address more sensibly. I have not been able to try this yet as other issues are still to be dealt with. Comments? >Hi >I may be completely off on this being new to SOLR but I am not sure >how to index related groups of fields in a document and preserver >their 'grouping'. I would appreciate any help on this.Detailed >description of the problem below. > >I am trying to index an entity that can have multiple occurrences in >the same document - e.g. Address. The address could be Shipping, >Home, Office etc. Each address element has multiple values in it >like street, state etc.Thus each address element is a group with >the state and street in one address element being related to each other. > >It looks like this in my source xml > > > > > > > > >I have setup my DIH to treat these as entities as below > > > > >baseDir="***" > fileName=".*xml" > rootEntity="false" > dataSource="null" > > name="record" > processor="XPathEntityProcessor" > stream="false" > forEach="/record" >url="${f.fileAbsolutePath}"> > > > >name="record_adr" >processor="XPathEntityProcessor" >stream="false" >forEach="/record/address" >url="${f.fileAbsolutePath}"> > > xpath="/record/address//@state" /> > > > > > > > > >The problem is as follows. DIH seems to treat these as entities but >solr seems to flatten them out on indexing to fields in a document >(losing the entity part). > >So when I search for the an ID - in the response all the street fields >are bunched to-gather, followed by all the state fields type etc. >Thus I can't associate which street address corresponds to which >address type in the response. > >What seems harder is this - say I need to query on 'Street' = XYZ1 and >type="Office". This should NOT return a document since the street for >the office address is "XY2" and not "XYZ1". However when I query for >address_state:"XYZ1" and address_type:"Office" I get back this document. > >The problem seems to be that while DIH allows 'entities' within a >document the SOLR schema does not preserve them - it 'flattens' all >of them out as indices for the document. > >I could work around the problem by creating SOLR fields like >"home_address_street" and "office_address_street" and do some xpath >mapping. However I don't want to do it as we can have multiple >'other' addresses. Also I have other fields whose type is not easily >distinguished like address. > >As I mentioned being new to SOLR I might have completely goofed on a >way to set it up - much appreciate any direction on it. I am using >SOLR 1.3 > >Regards, >Guna -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
DIH FileListEntityProcessor recursion and fileName clash
Hello I have been trying to find out why DIH in FileListEntityProcessor mode did not appear to be recursing into subdirectories. Going through FileListEntityProcessor.java I eventually tumbled to the fact that my filename filter setting from data-config.xml also applied to directory names. Now, I feel that the fieldName filter should be applied to files fed into the parser, it should not be applied to the directory names we are recursing through. I bodged the code as follows to adjust the behavior so that the "FileName" and "excludes" attributes of "entity" only apply to filenames and not directory names. It now recurses though my directory tree only indexing the appropriate files! I think the new behavior is more standard. Is this a change valid? Regards Fergus. --- /Volumes/spare/ts/apache-solr-nightlyjan23/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/FileListEntityProcessor.java 2009-02-01 18:19:38.0 + +++ /Volumes/spare/ts/apache-solr-nightlyjan29/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/FileListEntityProcessor.java 2008-10-02 20:38:30.0 +0100 @@ -85,10 +85,11 @@ if (r != null) recursive = Boolean.parseBoolean(r); excludes = context.getEntityAttribute(EXCLUDES); -if (excludes != null) { +if (excludes != null) excludes = resolver.replaceTokens(excludes); +if (excludes != null) excludesPattern = Pattern.compile(excludes); -} + } private Date getDate(String dateStr) { @@ -139,41 +140,42 @@ return getFromRowCache(); while (true) { Map r = getNext(); - if (r != null) r = applyTransformer(r); - return r; + if (r == null) +return null; + r = applyTransformer(r); + if (r != null) +return r; } } private void getFolderFiles(File dir, final List> fileDetails) { -// Fetch an array of file objects that pass the filter, however the -// returned array is never populated; accept() always returns false. -// Rather we make use of the fileDetails array which is populated as -// a side affect of the accept method. dir.list(new FilenameFilter() { public boolean accept(File dir, String name) { -File fileObj = new File(dir, name); -LOG.info("Testing acceptance of dir:"+dir +" name:"+name); - if (fileObj.isDirectory()) { - LOG.info(" Recursing into directory "+fileObj); - if (recursive) getFolderFiles(fileObj, fileDetails); - } -else if (fileNamePattern == null) { +if (fileNamePattern == null) { addDetails(fileDetails, dir, name); - } -else if (fileNamePattern.matcher(name).find()) { - if (excludesPattern != null && excludesPattern.matcher(name).find()) return false; + return false; +} +if (fileNamePattern.matcher(name).find()) { + if (excludesPattern != null && excludesPattern.matcher(name).find()) +return false; addDetails(fileDetails, dir, name); - } -return false; } - }); -} + +return false; + } +}); + } private void addDetails(List> files, File dir, String name) { Map details = new HashMap(); File aFile = new File(dir, name); -if (aFile.isDirectory()) return; +if (aFile.isDirectory()) { + if (!recursive) +return; + getFolderFiles(aFile, files); + return; +} long sz = aFile.length(); Date lastModified = new Date(aFile.lastModified()); if (biggerThan != -1 && sz <= biggerThan) -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
DIH using values from solrconfig.xml inside data-config.xml
Hello As per several postings I noted that I can define variables inside an invariants list section of the DIH handler of solrconfig.xml:- data-config.xml /Volumes/spare/ts I can also reference these variables within data-config.xml. This works, the solr field "test" is nicely populated. However how do I use this variable within my regex transformer? Here is my data-config.xml:- indexing my content I get an error as follows:- INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_7,version=1233583868834,generation=7,filenames=[_7.frq, _4.fdt, _7.tii, _7.fnm, _4.fdx, _7.tis, segments_7, _7.nrm, _7.prx] commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_8,version=1233583868835,generation=8,filenames=[segments_8] Feb 2, 2009 5:00:50 PM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1233583868835 Feb 2, 2009 5:00:57 PM org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer WARNING: transformer threw error java.util.regex.PatternSyntaxException: Illegal repetition near index 0 ${dataimporter.request.finstalldir}(.*) ^ at java.util.regex.Pattern.error(Pattern.java:1650) at java.util.regex.Pattern.closure(Pattern.java:2706) at java.util.regex.Pattern.sequence(Pattern.java:1798) at java.util.regex.Pattern.expr(Pattern.java:1687) at java.util.regex.Pattern.compile(Pattern.java:1397) at java.util.regex.Pattern.(Pattern.java:1124) at java.util.regex.Pattern.compile(Pattern.java:817) at org.apache.solr.handler.dataimport.RegexTransformer.getPattern(RegexTransformer.java:129) at org.apache.solr.handler.dataimport.RegexTransformer.process(RegexTransformer.java:88) at org.apache.solr.handler.dataimport.RegexTransformer.transformRow(RegexTransformer.java:74) at org.apache.solr.handler.dataimport.RegexTransformer.transformRow(RegexTransformer.java:42) at org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187) at org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197) at org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:333) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:359) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:222) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:155) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:324) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:384) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:365) Is there some simple escape or other syntax to be used or is this an enhancement? Regards Fergus. -- ======= Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH FileListEntityProcessor recursion and fileName clash
Shalin, OK! I got myself a JIRA account and opened solr-1000 and followed the wiki instructions on creating a patch which I have now uploaded! Only problem is that while the fix seems fine the test case I added to TestFileListEntityProcessor.java fails. I need somebody who knows what they are doing to point out what I am doing wrong and/or how to debug test failures. It would also be nice if I knew how to run or debug one Junit test rather than all of them, which takes almost 8min. @Test public void testRECURSION() throws IOException { long time = System.currentTimeMillis(); File childdir = new File("." + time + "/child" ); childdir.mkdirs(); childdir.deleteOnExit(); createFile(childdir, "a.xml", "a.xml".getBytes(), true); createFile(childdir, "b.xml", "b.xml".getBytes(), true); createFile(childdir, "c.props", "c.props".getBytes(), true); Map attrs = AbstractDataImportHandlerTest.createMap( FileListEntityProcessor.FILE_NAME, "^.*\\.xml$", FileListEntityProcessor.BASE_DIR, childdir.getAbsolutePath(), FileListEntityProcessor.RECURSIVE, true); Context c = AbstractDataImportHandlerTest.getContext(null, new VariableResolverImpl(), null, 0, Collections.EMPTY_LIST, attrs); FileListEntityProcessor fileListEntityProcessor = new FileListEntityProcessor(); fileListEntityProcessor.init(c); List fList = new ArrayList(); while (true) { // add the documents to the index Map f = fileListEntityProcessor.nextRow(); if (f == null) break; fList.add((String) f.get(FileListEntityProcessor.ABSOLUTE_FILE)); } System.out.println("List of files indexed -- " + fList); Assert.assertEquals(3, fList.size()); } Regards Fergus. >On Mon, Feb 2, 2009 at 2:36 AM, Fergus McMenemie wrote: > >> Hello >> >> I have been trying to find out why DIH in FileListEntityProcessor >> mode did not appear to be recursing into subdirectories. Going through >> FileListEntityProcessor.java I eventually tumbled to the fact that my >> filename filter setting from data-config.xml also applied to directory >> names. > > >Hmm, not good. > > >> >> >>> processor="FileListEntityProcessor" >> fileName=".*\.xml" >> newerThan="'NOW-1000DAYS'" >> recursive="true" >> rootEntity="false" >> dataSource="null" >> baseDir="/Volumes/spare/ts/stuff/ford"> >> >> Now, I feel that the fieldName filter should be applied to files fed >> into the parser, it should not be applied to the directory names we are >> recursing through. I bodged the code as follows to adjust the behavior >> so that the "FileName" and "excludes" attributes of "entity" only >> apply to filenames and not directory names. > > >I agree with you. > >Perhaps we can have separate filters for directories and files but let's >hold on till the need comes up. > >> >> >> It now recurses though my directory tree only indexing the appropriate >> files! I think the new behavior is more standard. >> >> Is this a change valid? > > >Absolutely. Can you please create an issue and attach the patch? Thanks! > >-- >Regards, >Shalin Shekhar Mangar. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH using values from solrconfig.xml inside data-config.xml
The solr data field is populated properly. So I guess that bit works. I really wish I could use xpath="//para" >A separate problem: when I used the DIH in December, the xpath >implementation had few features. '[...@qualifier='Date']' may not be >supported. > > dateTimeFormat="MMdd" /> > > >On Mon, Feb 2, 2009 at 9:24 AM, Noble Paul ?? Â Ë³Ë < >noble.p...@gmail.com> wrote: > >> this patch must help >> >> On Mon, Feb 2, 2009 at 10:49 PM, Shalin Shekhar Mangar >> wrote: >> > On Mon, Feb 2, 2009 at 10:34 PM, Fergus McMenemie >> wrote: >> > >> >> >> >> Is there some simple escape or other syntax to be used or is >> >> this an enhancement? >> >> >> > >> > I guess the problem is that we are creating the regex Pattern without >> first >> > resolving the variable. So we need to call VariableResolver.resolve on >> the >> > 'regex' attribute's value before creating the Pattern object. >> > >> > Please raise an issue for this change. Nice use-case though. I guess we >> > never thought someone would need to use a variable in the regex attribute >> :) >> > >> > -- >> > Regards, >> > Shalin Shekhar Mangar. >> > >> >> >> >> -- >> --Noble Paul >> > > > >-- >Lance Norskog >goks...@gmail.com >650-922-8831 (US) -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH using values from solrconfig.xml inside data-config.xml
>: > The solr data field is populated properly. So I guess that bit works. >: > I really wish I could use xpath="//para" > >: The limitation comes from streaming the XML instead of creating a DOM. >: XPathRecordReader is a custom streaming XPath parser implementation and >: streaming is easy only because we limit the syntax. You can use >: PlainTextEntityProcessor which gives the XML as a string to a custom >: Transformer. This Transformer can create a DOM, run your XPath query and >: populate the fields. It's more expensive but it is an option. > >Maybe it's just me, but it seems like i'm noticing that as DIH gets used >more, many people are noting that the XPath processing in DIH doesn't work >the way they expect because it's a custom XPath parser/engine designed for >streaming. > >It seems like it would be helpful to have an alternate processor for >people who don't need the streaming support (ie: are dealing with small >enough docs that they can load the full DOM tree into memory) that would >use the default Java XPath engine (and have less caveats/suprises) ... i >wou think it would probably even make sense for this new XPath processor >to be the one we suggest for new users, and only suggest the existing >(stream based) processor if they have really big xml docs to deal with. > >(In hindsight XPathEntityProcessor and XPathRecordReader should probably >have been named StreamingXPathEntityProcessor and >StreamingXPathRecordReader) > Four thoughts! 1) My use case involves a few million XML documents ranging in size from a few K to 500K. 95% of the documents are under 25KBytes, 5 of the documents are around 0.5Mbytes. So.. sod it, I think I need a streaming parser. 2) "streaming XPath parser"? I only half understand all this stuff, but, and this is based on the little bit of SAX stuff I have written, I would have thought that //para was trivial for any kind of streaming XML parser. 3) Much of the confusion may be arising because the DIH wiki page is not to clear on what is and is not allowed. We need better, more explicit examples. What seems to be allowed is:- I will add these to the wiki. Just to be sure, I tested xpath="//para". It does not work! 4) XML documents are ether well structured with good separation of data and presentation in which case absolute xpaths work fine. Or older, in my case text documents, which have been forced into XML format with poor structure where the data and presentation is all mixed up. I suspect that the addition of //para would cover many of the use cases, and what was left could be covered by a preceding XSLT transform. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH, assigning multiple xpaths to the same solr field: solved
Thanks Shalin, Using the following appears to work properly! Regards Fergus >On Wed, Feb 4, 2009 at 1:35 AM, Fergus McMenemie wrote: > >> > dataSource="myfilereader" >> processor="XPathEntityProcessor" >> url="${jc.fileAbsolutePath}" >> stream="false" >> forEach="/record"> >> >> >> >> >> >> Below is the line from my schema.xml >> >> > multiValued="true"/> >> >> Now a given document will only have one style of layout, and of course >> the /a/b/c /d/e/f/g stuff is made up. For a document that has a single >> Hello world element I see search results as follows, the >> one string seems to have been entered into the index four times. >> I only saw duplicate results before adding the extra made-up stuff. >> >> >I think there is something fishy with the XPathEntityProcessor. For now, I >think you can work around by giving each field a different 'column' and >attribute 'name=para' on each of them. > >-- >Regards, >Shalin Shekhar Mangar. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
DIH fails to import after svn update
Hello, I had a nice working version of SOLR building from trunk, I think it was from about 2-4th Feb, On the 7th I performed a "svn update" and it now fails as follows when performing get 'http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import' I have performed a "svn update" on the 11th (today) again. It still fails. Feb 11, 2009 4:27:34 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll INFO: [] REMOVING ALL DOCUMENTS FROM INDEX Feb 11, 2009 4:27:34 AM org.apache.solr.core.SolrDeletionPolicy onInit INFO: SolrDeletionPolicy.onInit: commits:num=2 commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1234326438927,generation=1,filenames=[segments_1] commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1234326438928,generation=2,filenames=[segments_2] Feb 11, 2009 4:27:34 AM org.apache.solr.core.SolrDeletionPolicy updateCommits INFO: last commit = 1234326438928 Feb 11, 2009 4:27:34 AM org.apache.solr.handler.dataimport.DataImporter doFullImport SEVERE: Full Import failed java.lang.NoSuchFieldError: docCount at org.apache.solr.handler.dataimport.SolrWriter.getDocCount(SolrWriter.java:231) at org.apache.solr.handler.dataimport.DataImportHandlerException.(DataImportHandlerException.java:42) at org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:81) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:293) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:222) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:155) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:324) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:384) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:365) Feb 11, 2009 4:27:34 AM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: start rollback Feb 11, 2009 4:27:34 AM org.apache.solr.update.DirectUpdateHandler2 rollback INFO: end_rollback Feb 11, 2009 4:27:34 AM org.apache.solr.update.DirectUpdateHandler2 commit INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true) Feb 11, 2009 4:27:34 AM org.apache.solr.search.SolrIndexSearcher Regards to all. -- ======= Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
"ant dist" of a nightly download fails
Hi, I have been looking at the nightly downloads, trying to work backwards through the nightly's till my code starts working again! I have downloaded all the available nightly's and they all fail to "ant dist" as follows:- >root: ant dist >Buildfile: build.xml > >init-forrest-entities: > >compile-solrj: > >make-manifest: > >dist-solrj: > [jar] Building jar: > /Volumes/spare/ts/apache-solr-nightly/dist/apache-solr-solrj-1.4-dev.jar > >compile: > >dist-jar: > [jar] Building jar: > /Volumes/spare/ts/apache-solr-nightly/dist/apache-solr-core-1.4-dev.jar > >dist-contrib: > >init: > >init-forrest-entities: > >compile-solrj: > >compile: > >make-manifest: > >compile: > >build: > [jar] Building jar: > /Volumes/spare/ts/apache-solr-nightly/contrib/dataimporthandler/target/apache-solr-dataimporthandler-1.4-dev.jar > >dist: > [copy] Copying 2 files to /Volumes/spare/ts/apache-solr-nightly/build/web >[mkdir] Created dir: > /Volumes/spare/ts/apache-solr-nightly/build/web/WEB-INF/lib > [copy] Copying 1 file to > /Volumes/spare/ts/apache-solr-nightly/build/web/WEB-INF/lib > [copy] Copying 1 file to /Volumes/spare/ts/apache-solr-nightly/dist > >init: > >init-forrest-entities: > >compile-solrj: > >compile: > >make-manifest: > >compile: > >build: > [jar] Building jar: > /Volumes/spare/ts/apache-solr-nightly/contrib/extraction/build/apache-solr-cell-1.4-dev.jar > >dist: > [copy] Copying 1 file to /Volumes/spare/ts/apache-solr-nightly/dist > >clean: > [delete] Deleting directory > /Volumes/spare/ts/apache-solr-nightly/contrib/javascript/dist > >create-dist-folder: >[mkdir] Created dir: > /Volumes/spare/ts/apache-solr-nightly/contrib/javascript/dist > >concat: > >docs: >[mkdir] Created dir: > /Volumes/spare/ts/apache-solr-nightly/contrib/javascript/dist/doc > [java] Exception in thread "main" java.lang.NoClassDefFoundError: > org/mozilla/javascript/tools/shell/Main > [java]at JsRun.main(Unknown Source) > >BUILD FAILED >/Volumes/spare/ts/apache-solr-nightly/common-build.xml:338: The following >error occurred while executing this line: >/Volumes/spare/ts/apache-solr-nightly/common-build.xml:215: The following >error occurred while executing this line: >/Volumes/spare/ts/apache-solr-nightly/contrib/javascript/build.xml:74: Java >returned: 1 > >Total time: 3 seconds >root: Performing "ant test" is fine. Removing the javascript contrib directory allows the "ant dist" to complete and I have a usable war file. However I suspect this may not represent best practise; however "ant test" is still fine. What does removal of the this contrib function loose me? I was wondering if it went with the DIH ScriptTransformer? Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH fails to import after svn update
Thanks, That fixed it. >On Wed, Feb 11, 2009 at 4:19 PM, Fergus McMenemie wrote: > > >> java.lang.NoSuchFieldError: docCount >>at >> org.apache.solr.handler.dataimport.SolrWriter.getDocCount(SolrWriter.java:231) >>at >> org.apache.solr.handler.dataimport.DataImportHandlerException.(DataImportHandlerException.java:42) >>at >> org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:81) >>at >> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:293) >>at >> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:222) >>at >> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:155) >>at >> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:324) >>at >> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:384) >>at >> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:365) >> > >Seems like this was not a clean compile. The AtomicInteger field docCount >was changed to a AtomicLong. > >Can you please do a "ant clean dist"? > >-- >Regards, >Shalin Shekhar Mangar. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Is this DIH entity forEach expression OK?
Hello, I am having bother with forEach. I have XML source documents containing many embedded images within mediaBlock elements. Each image has a an associated caption. I want to implement a separate image search function which searches the captions and brings back the associated image. Is is OK to have an xpath expression within forEach which is a child of another of the forEach xpath expressions? Or.. is there a better way of doing this? Regards -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Is this DIH entity forEach expression OK? ... yes
>Hello, > >I am having bother with forEach. I have XML source documents containing >many embedded images within mediaBlock elements. Each image has a an >associated caption. I want to implement a separate image search function >which searches the captions and brings back the associated image. > > dataSource="myfilereader" >processor="XPathEntityProcessor" >url="${jc.fileAbsolutePath}" >stream="false" >forEach="/record | /record/mediaBlock" >> > > xpath="/record/mediaBlock/mediaObject/@vurl" /> > xpath="/record/mediaBlock/caption" /> > >Is is OK to have an xpath expression within forEach which is a child >of another of the forEach xpath expressions? > Yes. It works fine, duplicate "uniqueKey"s were making it appear otherwise. But -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Problem using DIH templatetransformer to create uniqueKey
Hello, templatetransformer behaves rather ungracefully if one of the replacement fields is missing. I am parsing a single XML document into multiple separate solr documents. It turns out that none of the source documents fields can be used to create a uniqueKey alone. I need to combine two, using template transformer as follows: The trouble is that vurl is only defined as a child of "/record/mediaBlock" so my attempt to create id, the uniqueKey fails for the parent document "/record" I am hacking around with "TemplateTransformer.java" to sort this but was wondering if there was a good reason for this behavior. Regards. -- ======= Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Problem using DIH templatetransformer to create uniqueKey
>Hello, > >templatetransformer behaves rather ungracefully if one of the replacement >fields is missing. Looking at TemplateString.java I see that left to itself fillTokens would replace a missing variable with "". It is an extra check in TemplateTransformer that is throwing the warning and stopping the row being returned. Commenting out the check seems to solve my problem. Having done this, an undefined replacement string in TemplateTransformer is replaced with "". However a neater fix would probably involve making use of the default value which can be assigned to a row? in schema.xml. >I am parsing a single XML document into multiple separate solr documents. >It turns out that none of the source documents fields can be used to create >a uniqueKey alone. I need to combine two, using template transformer as >follows: > > dataSource="myfilereader" > processor="XPathEntityProcessor" > url="${jc.fileAbsolutePath}" > rootEntity="true" > stream="false" > forEach="/record | /record/mediaBlock" > transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer" > > > > > regex="${dataimporter.request.installdir}(.*)" replaceWith="/ford$1" > sourceColName="fileAbsolutePath"/> > template="${jc.fileAbsolutePath}${x.vurl}" /> > xpath="/record/mediaBlock/mediaObject/@vurl" /> > >The trouble is that vurl is only defined as a child of "/record/mediaBlock" >so my attempt to create id, the uniqueKey fails for the parent document >"/record" > >I am hacking around with "TemplateTransformer.java" to sort this but was >wondering if there was a good reason for this behavior. > -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Problem using DIH templatetransformer to create uniqueKey
Paul, Following up your usenet sussgetion: and to add more to what I was thinking... if the field is undefined in the input document, but the schema.xml does allow a default value, then TemplateTransformer can use the default value. If there is no default value defined in schema.xml then it can fail as at present. This would allow "" or any other value to be fed into TemplateTransformer, and still enable avoidance of the partial strings you referred to. Regards Fergus. >>Hello, >> >>templatetransformer behaves rather ungracefully if one of the replacement >>fields is missing. > >Looking at TemplateString.java I see that left to itself fillTokens would >replace a missing variable with "". It is an extra check in TemplateTransformer >that is throwing the warning and stopping the row being returned. Commenting >out the check seems to solve my problem. > >Having done this, an undefined replacement string in TemplateTransformer >is replaced with "". However a neater fix would probably involve making >use of the default value which can be assigned to a row? in schema.xml. > >>I am parsing a single XML document into multiple separate solr documents. >>It turns out that none of the source documents fields can be used to create >>a uniqueKey alone. I need to combine two, using template transformer as >>follows: >> >>> dataSource="myfilereader" >> processor="XPathEntityProcessor" >> url="${jc.fileAbsolutePath}" >> rootEntity="true" >> stream="false" >> forEach="/record | /record/mediaBlock" >> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer" >> > >> >> >> > regex="${dataimporter.request.installdir}(.*)" replaceWith="/ford$1" >> sourceColName="fileAbsolutePath"/> >> > template="${jc.fileAbsolutePath}${x.vurl}" /> >> > xpath="/record/mediaBlock/mediaObject/@vurl" /> >> >>The trouble is that vurl is only defined as a child of "/record/mediaBlock" >>so my attempt to create id, the uniqueKey fails for the parent document >>"/record" >> >>I am hacking around with "TemplateTransformer.java" to sort this but was >>wondering if there was a good reason for this behavior. >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Problem using DIH templatetransformer to create uniqueKey
Hmmm. Just gave that a go! No luck But how many layers of defaults do we need? Rgds Fergus >What about having the template transformer support ${field:default} >syntax? I'm assuming it doesn't support that currently right? The >replace stuff in the config files does though. > > Erik > > >On Feb 13, 2009, at 8:17 AM, Fergus McMenemie wrote: > >> Paul, >> >> Following up your usenet sussgetion: >> >> > ignoreMissingVariables="true"/> >> >> and to add more to what I was thinking... >> >> if the field is undefined in the input document, but the schema.xml >> does allow a default value, then TemplateTransformer can use the >> default value. If there is no default value defined in schema.xml >> then it can fail as at present. This would allow "" or any other >> value to be fed into TemplateTransformer, and still enable avoidance >> of the partial strings you referred to. >> >> Regards Fergus. >> >>>> Hello, >>>> >>>> templatetransformer behaves rather ungracefully if one of the >>>> replacement >>>> fields is missing. >>> >>> Looking at TemplateString.java I see that left to itself fillTokens >>> would >>> replace a missing variable with "". It is an extra check in >>> TemplateTransformer >>> that is throwing the warning and stopping the row being returned. >>> Commenting >>> out the check seems to solve my problem. >>> >>> Having done this, an undefined replacement string in >>> TemplateTransformer >>> is replaced with "". However a neater fix would probably involve >>> making >>> use of the default value which can be assigned to a row? in >>> schema.xml. >>> >>>> I am parsing a single XML document into multiple separate solr >>>> documents. >>>> It turns out that none of the source documents fields can be used >>>> to create >>>> a uniqueKey alone. I need to combine two, using template >>>> transformer as >>>> follows: >>>> >>>> >>> dataSource="myfilereader" >>>> processor="XPathEntityProcessor" >>>> url="${jc.fileAbsolutePath}" >>>> rootEntity="true" >>>> stream="false" >>>> forEach="/record | /record/mediaBlock" >>>> transformer >>>> ="DateFormatTransformer,TemplateTransformer,RegexTransformer" >>>>> >>>> >>>> >>>> >>> sourceColName="fileAbsolutePath"/> >>>> >>>> >>>> >>>> The trouble is that vurl is only defined as a child of "/record/ >>>> mediaBlock" >>>> so my attempt to create id, the uniqueKey fails for the parent >>>> document "/record" >>>> >>>> I am hacking around with "TemplateTransformer.java" to sort this >>>> but was >>>> wondering if there was a good reason for this behavior. >>>> >> >> -- >> >> === >> Fergus McMenemie Email:fer...@twig.me.uk >> Techmore Ltd Phone:(UK) 07721 376021 >> >> Unix/Mac/Intranets Analyst Programmer >> === -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
DIH transformers
Hello. I have been beating my head around the data-config.xml listed at the end of this message. It breaks in a few different ways. 1) I have bodged TemplateTransformer to allow it to return when one of the variables is undefined. This ensures my uniqueKey is always defined. But thinking more on Nobel's comments there is use in having it work both ways. ie leaving the column undefined or replacing the variable with "". I still like my idea about using the default value of a solr field from schema.xml, but I cant figure out how/where to best implement it. 2) Having used TemplateTransformer to assign a value to an entity column that column cannot be used in other TemplateTransformer operations. In my project I am attempting to reuse "x.fileWebPath". To fix this, the last line of transformRow() in TemplateTransformer.java needs replaced with the following which as well as 'putting' the templated-ed string in 'row' also saves it into the 'resolver'. **originally** row.put(column, resolver.replaceTokens(expr)); } **new** String columnName = map.get(DataImporter.COLUMN); expr=resolver.replaceTokens(expr); row.put(columnName, expr); resolverMapCopy.put(columnName, expr); } As an aside I think I ran into the issues covered by SOLR-993. It took a while to figure out I could not a a single columnname/value to the resolver. I had instead to add to the map that was already stored within the resolver. 3) No entity column names can be used within RegexTransformer. I guess all the stuff that was added to TemplateTransformer to allow column names to be used in templates needs re-added into RegexTransformer. I am doing that now... but am confused by the fragment of code which copies from resolverMap into resolverMapCopy. As best I can see resolverMap is always empty; but I am barely able to follow the code! Can somebody explain when/why resolverMap would be populated. Also, I begin to understand comments made by Noble in SOL-1001 about resolving "entity attributes in ContextImpl.getEntityAttribute" and I guess Shalin was right as well. However it also seems wrong that at the top of every transformer we are going to repeat the same code to load the resolver with information about the entity. 4) In that I am reusing template output within other templates the order of execution becomes important. Can I assume that the explicitly listed columns in an entity are processed by the various transformers in the order they appear within data-config.xml. I *think* that the list of columns within an entity as returned by getAllEntityFields() is actually an ArrayList which I think or order dependent. IS this correct? 5) Should I raise this as a single JIRA issue? 6) Having played with this stuff, I was going to add a bit more to the wiki highlighting some of the possibilities and issues with transformers. But want to check with the list first! Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH transformers - sect 2
>On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie wrote: >> >> 2) Having used TemplateTransformer to assign a value to an >> entity column that column cannot be used in other >> TemplateTransformer operations. In my project I am >> attempting to reuse "x.fileWebPath". To fix this, the >> last line of transformRow() in TemplateTransformer.java >> needs replaced with the following which as well as >> 'putting' the templated-ed string in 'row' also saves it >> into the 'resolver'. >> >> **originally** >> row.put(column, resolver.replaceTokens(expr)); >> } >> >> **new** >> String columnName = map.get(DataImporter.COLUMN); >> expr=resolver.replaceTokens(expr); >> row.put(columnName, expr); >> resolverMapCopy.put(columnName, expr); >> } > >isn't it better to write a custom transformer to achieve this. I did >not want a standard component to change the state of the >VariableResolver . > >I am not sure what is the best way. > Noble, (Good to have email working :-) Hmm not sure why this requires a custom transformer. Why is this not more in the nature of a bug fix? Also the current behavior temporarily adds all the column names into the resolver for the duration of the TemplateTransformer's operation, removing them again at the end. I do not think there is any permanent change to the state of the VariableResolver. Surely if we have defined a value for a column, that value should be temporarily available in subsequent template or regexp operations? Fergus. >> >> >> >> >> >>> processor="FileListEntityProcessor" >> fileName="^.*\.xml$" >> newerThan="'NOW-1000DAYS'" >> recursive="true" >> rootEntity="false" >> dataSource="null" >> baseDir="/Volumes/spare/ts/solr/content" >> > >>> dataSource="myfilereader" >> processor="XPathEntityProcessor" >> url="${jc.fileAbsolutePath}" >> rootEntity="true" >> stream="false" >> forEach="/record | /record/mediaBlock" >> >> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer"> >> >> >> > replaceWith="/ford$1" sourceColName="fileAbsolutePath"/> >> >> >> >> > xpath="/record/metadata/da...@qualifier='pubDate']" >> dateTimeFormat="MMdd" /> >> >> > xpath="/record/mediaBlock/mediaObject/@vurl" /> >> > template="${dataimporter.request.fordinstalldir}" /> >> >> >> > template="${dataimporter.request.contentinstalldir}" /> >> >> > replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/> >> > replaceWith="$1/imagery/${x.vurl}.jpg" sourceColName="fileWebPath"/> >> > template="${jc.fileAbsolutePath}#${x.vurl}" /> >> >> >> >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH transformers - sect 2 - SOLR-1033
I have created SOLR-1033 in JIRA to address this issue. At 13:32 + 21/2/09, Fergus McMenemie wrote: >>On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie wrote: >>> >>> 2) Having used TemplateTransformer to assign a value to an >>> entity column that column cannot be used in other >>> TemplateTransformer operations. In my project I am >>> attempting to reuse "x.fileWebPath". To fix this, the >>> last line of transformRow() in TemplateTransformer.java >>> needs replaced with the following which as well as >>> 'putting' the templated-ed string in 'row' also saves it >>> into the 'resolver'. >>> >>> **originally** >>> row.put(column, resolver.replaceTokens(expr)); >>> } >>> >>> **new** >>> String columnName = map.get(DataImporter.COLUMN); >>> expr=resolver.replaceTokens(expr); >>> row.put(columnName, expr); >>> resolverMapCopy.put(columnName, expr); >>> } >> >>isn't it better to write a custom transformer to achieve this. I did >>not want a standard component to change the state of the >>VariableResolver . >> >>I am not sure what is the best way. >> > >Noble, (Good to have email working :-) > >Hmm not sure why this requires a custom transformer. Why is this not >more in the nature of a bug fix? Also the current behavior temporarily >adds all the column names into the resolver for the duration of the >TemplateTransformer's operation, removing them again at the end. I >do not think there is any permanent change to the state of the >VariableResolver. > >Surely if we have defined a value for a column, that value should be >temporarily available in subsequent template or regexp operations? > >Fergus. > >>> >>> >>> >>> >>> >>>>> processor="FileListEntityProcessor" >>> fileName="^.*\.xml$" >>> newerThan="'NOW-1000DAYS'" >>> recursive="true" >>> rootEntity="false" >>> dataSource="null" >>> baseDir="/Volumes/spare/ts/solr/content" >>> > >>>>> dataSource="myfilereader" >>> processor="XPathEntityProcessor" >>> url="${jc.fileAbsolutePath}" >>> rootEntity="true" >>> stream="false" >>> forEach="/record | /record/mediaBlock" >>> >>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer"> >>> >>> >>> >> replaceWith="/ford$1" sourceColName="fileAbsolutePath"/> >>> >>> >>> >>> >> xpath="/record/metadata/da...@qualifier='pubDate']" >>> dateTimeFormat="MMdd" /> >>> >>> >> xpath="/record/mediaBlock/mediaObject/@vurl" /> >>> >> template="${dataimporter.request.fordinstalldir}" /> >>> >> /> >>> >>> >> template="${dataimporter.request.contentinstalldir}" /> >>> >>> >> replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/> >>> >> replaceWith="$1/imagery/${x.vurl}.jpg" sourceColName="fileWebPath"/> >>> >> template="${jc.fileAbsolutePath}#${x.vurl}" /> >>> >>> >>> >>> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
passing parameters into the XSLTResponseWriter: particularly hostname
Hello all, I was wondering if there was a way of passing parameters into the XSLTResponseWriter writer. I always like the option of formatting my search results as an RSS feed. Users can therefore configure their phone, browser etc to automatically redo a search every so often and have new items in the result set highlighted to them. However many RSS clients require links to the underlying content to be absolute. So I need to pass in the full hostname, of the machine serving the results, to the transform generating my RSS feed. How do I do this? Regards Fergus -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: passing parameters into the XSLTResponseWriter: particularly hostname
>: I was wondering if there was a way of passing parameters into >: the XSLTResponseWriter writer. > >I don't think there's anyway to pass input in the traditional >sense, but you can set default/invariant params along with echoParams=all >to get the values you want into the XML doc itself where your stylesheet >has access to it. > > >-Hoss Doh! of course. Thanks. -- ======= Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
a new DIH manifestEnityProcessor
Hello, I have almost finished a new DIH EntityProcessor which I am calling the manifestEnityProcessor. It is designed around the idea that whatever demon is used to maintain your set of a few 100,000 xml documents it is likely to drop a report or log file explaining what has been changed within your content store. This assumes a file based content repository. The manifestEnityProcessor is used as follows The idea is you have a log file or other report, perhaps from tar or zip, and you wish to use this to control the indexing of the new content. The new entity fields are as follows. manifestFileName is the name of the manifest file. If this value is relative, it assumed to be relative to baseDir. Required. manifestAddRegex is a required regex to identify lines which when matched should cause docs to be added to the index. manifestDelRegex is an optional value of a regex to identify documents which when matched should be deleted from the index **PLANNED** allowRegex a required regex to identify the portion of the ADD/DELete line identified above which contains the file or pathname to ADDed or DELeted. If the resulting value relative, it assumed to be relative to baseDir. What do I do next? Raise a JIRA issue and add the code? Is DIH the right place to add this? Suggestions for a different name? Suggestions on how to do the delete bitty from within an entity? Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: a new DIH manifestEnityProcessor
>manifest processing has a very limited usecase. Why can't it be >processed using a PlainTextEntityProcessor and write a Tranformer to >read lines using regex? > Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough insight to see how this could be used to index each of the files listed by a 'tar xvf' report. Can you explain further? About the limited usecase. Verity thought it was useful enough to have there own "bulk insert file" or bif file format that did the same and was far less flexible. In my experience we generally start off with some kind of file walker or crawler looking after file repositories. But these always proved slow and unreliable and over time they were always replaced it with some kind of manifest based control of the indexer. Where we could get a report of changes we always used it, and only relied on walkers or crawlers where we had to. Fergus > >--Noble > >On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie wrote: >> Hello, >> >> I have almost finished a new DIH EntityProcessor which >> I am calling the manifestEnityProcessor. It is designed >> around the idea that whatever demon is used to maintain >> your set of a few 100,000 xml documents it is likely to >> drop a report or log file explaining what has been changed >> within your content store. This assumes a file based >> content repository. >> >> The manifestEnityProcessor is used as follows >> >> > processor="ManifestEntityProcessor" >> baseDir="/Volumes/Techmore/ts/aaa/schema/data" >> rootEntity="false" >> dataSource="null" >> >> allowRegex="^.*\.xml$" >> manifestFileName="/Volumes/ts/man-find.txt" >> manifestAddRegex="(.*)$" >> > >> >> The idea is you have a log file or other report, perhaps >> from tar or zip, and you wish to use this to control the >> indexing of the new content. The new entity fields are as >> follows. >> >> manifestFileName is the name of the manifest file. If >> this value is relative, it assumed to >> be relative to baseDir. Required. >> >> manifestAddRegex is a required regex to identify lines >> which when matched should cause docs to >> be added to the index. >> >> manifestDelRegex is an optional value of a regex to >> identify documents which when matched should >> be deleted from the index **PLANNED** >> >> allowRegex a required regex to identify the portion >> of the ADD/DELete line identified above >> which contains the file or pathname to >> ADDed or DELeted. If the resulting value >> relative, it assumed to be relative to >> baseDir. >> >> What do I do next? >> Raise a JIRA issue and add the code? >> Is DIH the right place to add this? >> Suggestions for a different name? >> Suggestions on how to do the delete bitty from within an entity? >> >> Regards Fergus. >--Noble Paul -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: a new DIH manifestEnityProcessor
>Hi Fergus, >The idea is that we have something generic which can be applicable to >a large set of users. If the manifest is a text file it can be read in >somestandard way (say line by line). So we can have an EntityProcessor >which reads a text file line and filer it by a regex like the way >'grep' works. Yes. That is what I have written. It is just an alternate form of the FileListEntityProcessor except that rather than walking the file system it reads from a file, line by line, and identifies the portion of the line containing the filename using a regexp. > >On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie wrote: >>>manifest processing has a very limited usecase. Why can't it be >>>processed using a PlainTextEntityProcessor and write a Tranformer to >>>read lines using regex? >>> >> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough >> insight to see how this could be used to index each of the files >> listed by a 'tar xvf' report. Can you explain further? >> >> About the limited usecase. Verity thought it was useful enough >> to have there own "bulk insert file" or bif file format that >> did the same and was far less flexible. >> >> In my experience we generally start off with some kind of >> file walker or crawler looking after file repositories. But >> these always proved slow and unreliable and over time they >> were always replaced it with some kind of manifest based >> control of the indexer. Where we could get a report of changes >> we always used it, and only relied on walkers or crawlers >> where we had to. >> >> Fergus >> >>> >>>--Noble >>> >>>On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie wrote: >>>> Hello, >>>> >>>> I have almost finished a new DIH EntityProcessor which >>>> I am calling the manifestEnityProcessor. It is designed >>>> around the idea that whatever demon is used to maintain >>>> your set of a few 100,000 xml documents it is likely to >>>> drop a report or log file explaining what has been changed >>>> within your content store. This assumes a file based >>>> content repository. >>>> >>>> The manifestEnityProcessor is used as follows >>>> >>>> >>> processor="ManifestEntityProcessor" >>>> baseDir="/Volumes/Techmore/ts/aaa/schema/data" >>>> rootEntity="false" >>>> dataSource="null" >>>> >>>> allowRegex="^.*\.xml$" >>>> manifestFileName="/Volumes/ts/man-find.txt" >>>> manifestAddRegex="(.*)$" >>>> > >>>> >>>> The idea is you have a log file or other report, perhaps >>>> from tar or zip, and you wish to use this to control the >>>> indexing of the new content. The new entity fields are as >>>> follows. >>>> >>>> manifestFileName is the name of the manifest file. If >>>> this value is relative, it assumed to >>>> be relative to baseDir. Required. >>>> >>>> manifestAddRegex is a required regex to identify lines >>>> which when matched should cause docs to >>>> be added to the index. >>>> >>>> manifestDelRegex is an optional value of a regex to >>>> identify documents which when matched should >>>> be deleted from the index **PLANNED** >>>> >>>> allowRegex a required regex to identify the portion >>>> of the ADD/DELete line identified above >>>> which contains the file or pathname to >>>> ADDed or DELeted. If the resulting value >>>> relative, it assumed to be relative to >>>> baseDir. >>>> >>>> What do I do next? >>>> Raise a JIRA issue and add the code? >>>> Is DIH the right place to add this? >>>> Suggestions for a different name? >>>> Suggestions on how to do the delete bitty from within an entity? >>>> >>>> Regards Fergus. >>>--Noble Paul >> >> -- >> >> === >> Fergus McMenemie Email:fer...@twig.me.uk >> Techmore Ltd Phone:(UK) 07721 376021 >> >> Unix/Mac/Intranets Analyst Programmer >> === >> > > > >-- >--Noble Paul -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH with a list of changed documents?
>Hello List, > >how would I implement entity-processor if I were able to get the list >of recently changed documents of our sites? > >thanks for hints. > >paul > >Attachment converted: OSX:smime 65.p7s (/) (00213A09) H, this sounds like a job for my manifestEnityProcessor see if you can find the thread titled:- "a new DIH manifestEnityProcessor" is your list of changed documents a list of additions and updates only, or does it contain deletes as well? Fergus. -- ======= Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH with a list of changed documents?
>Le 09-mars-09 à 22:29, Fergus McMenemie a écrit : >>> how would I implement entity-processor if I were able to get the list >>> of recently changed documents of our sites? >> >> H, this sounds like a job for my manifestEnityProcessor >> see if you can find the thread titled:- >> >> "a new DIH manifestEnityProcessor" >> >> is your list of changed documents a list of additions and >> updates only, or does it contain deletes as well? > >Fergus, > >I think you should then rename it... Manifest is not the right name to >me (manifest refers to something such as the manifest of a jar or of >an IMS-content-package, both are a metadata of the data). Its all in the jargon, I guess. Our content repositories are changed by update kits, some of the kits come with manifests or in other cases we capture the output from un-tar or un-zip commands and we call these manifests. The name is up for grabs if a better suggestion comes along; I would have used FileListEntityProcessor except the name was taken;-) >I looked at your original description and I could not read anything >about the changed files. >The regex approach is a nice one for sure... Yep, our "manifest"s quite often include jpegs, avis etc which we do not want indexed. And if it's a tar output it will contain directory stubs as well. >I think a useful DIH Entity-processor that would maintain its deltas >well would have as parameters, url to a list of recently updated urls, >url to a list of recently deleted urls. Is this yours? urls hu! Never thought of that, i was just assuming it would be a local file. However I guess that could be added... so "manifestFileName" would become "manifestURL"? In my use cases some of the "manifests" are along the lines of ADD -checksum-xxx --pathname_1-- DEL --pathname_b-- Hence "manifestAddRegex" and "manifestDelRegex". I also, in other cases, have separate files, one for adding another for deleting. This I was going to deal with as two separate DIH imports. >I would have one for URLs with the list of recent things basically >from an RSS; the transformer is custom in all cases. The output from my manifestEnityProcessor is fed to an XPathEntityProcessor > >paul > Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: a new DIH manifestEnityProcessor SOLR-1060 on jira
OK, SOLR-1060 created. >To this requirement I would add the basic requirement that this file >(what Fergus calls the manifest to which I still don't agree) >represents a update-set and that there should be a delete-set as well. > >ChangeSetEntityProcessor, on there I would jump with two feet. > >paul > > >Le 10-mars-09 à 05:40, Noble Paul ?? >Â Ë³Ë a écrit : > >> Hi Fergus open a JIRA issue anyway. put in your thoughts and we can >> refine the requirements as a part of the discussion. >> >> Basically the requirements are , >> 1)read a file line by line >> 2) filter out lines (include or exclude ) based on a regex >> 3) extract parts (named parts) from the line using another regex >> >> Noble >> >> >> On Tue, Mar 10, 2009 at 1:50 AM, Fergus McMenemie >> wrote: >>>> Hi Fergus, >>>> The idea is that we have something generic which can be applicable >>>> to >>>> a large set of users. If the manifest is a text file it can be >>>> read in >>>> somestandard way (say line by line). So we can have an >>>> EntityProcessor >>>> which reads a text file line and filer it by a regex like the way >>>> 'grep' works. >>> Yes. That is what I have written. It is just an alternate form of the >>> FileListEntityProcessor except that rather than walking the file >>> system >>> it reads from a file, line by line, and identifies the portion of the >>> line containing the filename using a regexp. >>> >>> >>>> >>>> On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie >>>> wrote: >>>>>> manifest processing has a very limited usecase. Why can't it be >>>>>> processed using a PlainTextEntityProcessor and write a >>>>>> Tranformer to >>>>>> read lines using regex? >>>>>> >>>>> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough >>>>> insight to see how this could be used to index each of the files >>>>> listed by a 'tar xvf' report. Can you explain further? >>>>> >>>>> About the limited usecase. Verity thought it was useful enough >>>>> to have there own "bulk insert file" or bif file format that >>>>> did the same and was far less flexible. >>>>> >>>>> In my experience we generally start off with some kind of >>>>> file walker or crawler looking after file repositories. But >>>>> these always proved slow and unreliable and over time they >>>>> were always replaced it with some kind of manifest based >>>>> control of the indexer. Where we could get a report of changes >>>>> we always used it, and only relied on walkers or crawlers >>>>> where we had to. >>>>> >>>>> Fergus >>>>> >>>>>> >>>>>> --Noble >>>>>> >>>>>> On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie >>>>> > wrote: >>>>>>> Hello, >>>>>>> >>>>>>> I have almost finished a new DIH EntityProcessor which >>>>>>> I am calling the manifestEnityProcessor. It is designed >>>>>>> around the idea that whatever demon is used to maintain >>>>>>> your set of a few 100,000 xml documents it is likely to >>>>>>> drop a report or log file explaining what has been changed >>>>>>> within your content store. This assumes a file based >>>>>>> content repository. >>>>>>> >>>>>>> The manifestEnityProcessor is used as follows >>>>>>> >>>>>>> >>>>>> processor="ManifestEntityProcessor" >>>>>>> baseDir="/Volumes/Techmore/ts/aaa/schema/data" >>>>>>> rootEntity="false" >>>>>>> dataSource="null" >>>>>>> >>>>>>> allowRegex="^.*\.xml$" >>>>>>> manifestFileName="/Volumes/ts/man-find.txt" >>>>>>> manifestAddRegex="(.*)$" >>>>>>> > >>>>>>> >>>>>>> The idea is you have a log file or other
Re: Problem using DIH templatetransformer to create uniqueKey: solved
Folks, Template transformer will fail to return if a variable if undefined, however the regex transformer does still return. So where the following would fail:- This can be used instead:- So I guess we have the best of both worlds! Fergus. >Hmmm. Just gave that a go! No luck >But how many layers of defaults do we need? > > >Rgds Fergus > >>What about having the template transformer support ${field:default} >>syntax? I'm assuming it doesn't support that currently right? The >>replace stuff in the config files does though. >> >> Erik >> >> >>On Feb 13, 2009, at 8:17 AM, Fergus McMenemie wrote: >> >>> Paul, >>> >>> Following up your usenet sussgetion: >>> >>> >> ignoreMissingVariables="true"/> >>> >>> and to add more to what I was thinking... >>> >>> if the field is undefined in the input document, but the schema.xml >>> does allow a default value, then TemplateTransformer can use the >>> default value. If there is no default value defined in schema.xml >>> then it can fail as at present. This would allow "" or any other >>> value to be fed into TemplateTransformer, and still enable avoidance >>> of the partial strings you referred to. >>> >>> Regards Fergus. >>> >>>>> Hello, >>>>> >>>>> templatetransformer behaves rather ungracefully if one of the >>>>> replacement >>>>> fields is missing. >>>> >>>> Looking at TemplateString.java I see that left to itself fillTokens >>>> would >>>> replace a missing variable with "". It is an extra check in >>>> TemplateTransformer >>>> that is throwing the warning and stopping the row being returned. >>>> Commenting >>>> out the check seems to solve my problem. >>>> >>>> Having done this, an undefined replacement string in >>>> TemplateTransformer >>>> is replaced with "". However a neater fix would probably involve >>>> making >>>> use of the default value which can be assigned to a row? in >>>> schema.xml. >>>> >>>>> I am parsing a single XML document into multiple separate solr >>>>> documents. >>>>> It turns out that none of the source documents fields can be used >>>>> to create >>>>> a uniqueKey alone. I need to combine two, using template >>>>> transformer as >>>>> follows: >>>>> >>>>> >>>> dataSource="myfilereader" >>>>> processor="XPathEntityProcessor" >>>>> url="${jc.fileAbsolutePath}" >>>>> rootEntity="true" >>>>> stream="false" >>>>> forEach="/record | /record/mediaBlock" >>>>> transformer >>>>> ="DateFormatTransformer,TemplateTransformer,RegexTransformer" >>>>>> >>>>> >>>>> >>>>> >>>> sourceColName="fileAbsolutePath"/> >>>>> >>>>> >>>>> >>>>> The trouble is that vurl is only defined as a child of "/record/ >>>>> mediaBlock" >>>>> so my attempt to create id, the uniqueKey fails for the parent >>>>> document "/record" >>>>> >>>>> I am hacking around with "TemplateTransformer.java" to sort this >>>>> but was >>>>> wondering if there was a good reason for this behavior. >>>>> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
DIH use of the ?command=full-import entity= command option
Hello, Can anybody describe the intended purpose, or provide a few examples, of how the DIH entity= command option works. Am I supposed to build a data-conf.xml file which contains many different alternate entities.. or Regards -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH use of the ?command=full-import entity= command option
If my data-config.xml contains multiple root level entities what is the expected action if I call full-import without an entity=XXX sub-command? Does it process all entities one after the other or only the first? (It would be useful IMHO if it only did the first.) >On Fri, Mar 13, 2009 at 3:17 AM, Fergus McMenemie wrote: > >> Hello, >> >> Can anybody describe the intended purpose, or provide a >> few examples, of how the DIH entity= command option works. >> >> Am I supposed to build a data-conf.xml file which contains >> many different alternate entities.. or >> > >With the entity parameter you can specify the name of any root entity and >import only that one. You can specify multiple entity parameters too. For >example: >/dataimport?command=full-import&entity=x&entity=y > >You may need to specify preImportDeleteQuery separately on each entity to >make sure all documents are not deleted. >-- >Regards, >Shalin Shekhar Mangar. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Problem encoding ':' char in a solr query
Hello I have a solr field:- which an unrelated query reveals is populated with:- file:///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml however when I try and query for that exact document explicitly:- http://localhost:8080/apache-solr-1.4-dev/select?q=fileAbsolutePath:file%3a///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml&wt=xml it fails. HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'fileAbsolutePath:file:///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml': Encountered " ":" ": "" at line 1, column 21. Was expecting one of: ... ... ... "+" ... "-" ... "(" ... "*" ... "^" ... ... ... ... ... ... "[" ... "{" ... ... My encoding did not work! Help! -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH - read datasource param values from property file or configure JNDI datasource
>I am looking for a implementation of DIH feature: It also takes in a >properties file for the data source configuration >(http://issues.apache.org/jira/browse/SOLR-469) > >I want to externalize the data source parameters like driver, url, user and >password to property file outside the solr. My aim to hide the parameters from >developer code in Production environment. So that admin can enter these values. > >Or else can DIH read JNDI data source from server environment. > >Let me know the best practice to follow in production environment? > >Thanks >Shyamsunder > This is an idea rather than a recommendation. But, as per the DIH FAQ, you can pass in extra arguments on the URL used to invoke DIH and use these arguments within data-config.xml. Not too sure the extent they can be used with the various dataSource's. But if you are happy to pass the information in via the URL or even solconfig.xml then this may the route to go down. Fergus. --
Re: Scheduling DIH
H, my tuppence worth! IMHO I do not think this should be built into solr. Doing it properly leads to all kinds of nasty platform dependent issues... will we then want to add notification features on success/failure? via email? Ideally, all the scheduled activities on a system should be centralised in one place such as cron, or as few places as possible. From a system administration point of view there is then a single locations from where everything can be viewed and controlled. There are generally dependencies between different activities and having to chase around and configure many separate proprietary schedulers is a nuisance as well as being error prone. Fergus. Tricia Williams wrote: Hello, Is there a best way to schedule the DataImportHandler? The idea being to schedule a delta-import every Sunday morning at 7am or perhaps every hour without human intervention. Writing a cron job to do this wouldn't be difficult. I'm just wondering is this a built in feature? Tricia
Clarifying use of
Hello, Due to limitations with the way my content is organised and DIH I have to add “-imgCaption:[* TO *]” to some of my queries. I discovered the name=”appends” functionality tucked away inside solconfig.xml. This looks a very useful feature, and I created a new requestHandler to deal with my problem queries. I tried adding the following to my alternate requestHandler:- -imgCaption:[* TO *] which did not work; however -imgCaption:[* TO *] worked fine and is also more efficient. I guess I was caught by the “identify values which should be appended to the list of ***multi-val params from the query” portion of the comment within solconfig.xml. I am now wondering how do I know which query params are "multi-val" or not? Is this documented anywhere? Regards Fergus.
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
Grant, After all my playing about at boot camp, I gave things a rest. It was not till months later that got back to looking at solr again. So after 643465 (2008-Apr-01) the next version I tried was 694377 from (2008-Sep-11). Nothing in between. Yep so 643465 is the latest version I tried that still performs. Every later revision is slower. However I need to repeat the tests using 643465, 694377 and whatever is the latest version. On my macbook I am only seeing a 2x slowdown of 643465 vis today, where as I had been seeing a 3x slowdown using my Imac. Fergus >Fregus, > >Is rev 643465 the absolute latest you tried that still performs? i.e. >every revision after is slower? > >-Grant > >On Mar 30, 2009, at 12:45 PM, Grant Ingersoll wrote: > >> Fergus, >> >> I think the problem may actually be due to something that was >> introduced by a change to Solr's StopFilterFactory and the way it >> loads the stop words set. See >> https://issues.apache.org/jira/browse/SOLR-1095 >> >> I am in the process of testing it out and will let you know. >> >> -Grant >> >> On Mar 28, 2009, at 11:00 AM, Grant Ingersoll wrote: >> >>> Hey Fergus, >>> >>> Finally got a chance to run your scripts, etc. per the thread: >>> http://www.lucidimagination.com/search/document/5c3de15a4e61095c/upgrade_from_1_2_to_1_3_gives_3x_slowdown_script#8324a98d8840c623 >>> >>> I can reproduce your slowdown. >>> >>> One oddity with rev 643465 is: >>> >>> On the old version, there is an exception during startup: >>> Mar 28, 2009 10:44:31 AM org.apache.solr.common.SolrException log >>> SEVERE: java.lang.NullPointerException >>> at >>> org >>> .apache >>> .solr >>> .handler >>> .component.SearchHandler.handleRequestBody(SearchHandler.java:129) >>> at >>> org >>> .apache >>> .solr >>> .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: >>> 125) >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:953) >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:968) >>> at >>> org >>> .apache >>> .solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java: >>> 50) >>> at org.apache.solr.core.SolrCore$3.call(SolrCore.java:797) >>> at java.util.concurrent.FutureTask >>> $Sync.innerRun(FutureTask.java:303) >>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>> at java.util.concurrent.ThreadPoolExecutor >>> $Worker.runTask(ThreadPoolExecutor.java:885) >>> at java.util.concurrent.ThreadPoolExecutor >>> $Worker.run(ThreadPoolExecutor.java:907) >>> at java.lang.Thread.run(Thread.java:637) >>> >>> I see two things in CHANGES.txt that might apply, but I'm not sure: >>> 1. I think commons-csv was upgraded >>> 2. The CSV loader stuff was refactored to share common code >>> >>> I'm still investigating. >>> >>> -Grant >> >> -- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >> using Solr/Lucene: >> http://www.lucidimagination.com/search -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
>Can you verify that rev 701485 still performs reasonably well? This >is from October 2008 and I get similar results to the earlier rev. >Am now trying some other versions between October and when you first >reported the issue in November. OK. Can you tell me how to get a hold of revision 701485. What is the magic svn line? >On Mar 30, 2009, at 3:37 PM, Grant Ingersoll wrote: > >> Fregus, >> >> Is rev 643465 the absolute latest you tried that still performs? >> i.e. every revision after is slower? >> >> -Grant >> >> On Mar 30, 2009, at 12:45 PM, Grant Ingersoll wrote: >> >>> Fergus, >>> >>> I think the problem may actually be due to something that was >>> introduced by a change to Solr's StopFilterFactory and the way it >>> loads the stop words set. See >>> https://issues.apache.org/jira/browse/SOLR-1095 >>> >>> I am in the process of testing it out and will let you know. >>> >>> -Grant >>> >>> On Mar 28, 2009, at 11:00 AM, Grant Ingersoll wrote: >>> >>>> Hey Fergus, >>>> >>>> Finally got a chance to run your scripts, etc. per the thread: >>>> http://www.lucidimagination.com/search/document/5c3de15a4e61095c/upgrade_from_1_2_to_1_3_gives_3x_slowdown_script#8324a98d8840c623 >>>> >>>> I can reproduce your slowdown. >>>> >>>> One oddity with rev 643465 is: >>>> >>>> On the old version, there is an exception during startup: >>>> Mar 28, 2009 10:44:31 AM org.apache.solr.common.SolrException log >>>> SEVERE: java.lang.NullPointerException >>>> at >>>> org >>>> .apache >>>> .solr >>>> .handler >>>> .component.SearchHandler.handleRequestBody(SearchHandler.java:129) >>>> at >>>> org >>>> .apache >>>> .solr >>>> .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: >>>> 125) >>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:953) >>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:968) >>>> at >>>> org >>>> .apache >>>> .solr >>>> .core.QuerySenderListener.newSearcher(QuerySenderListener.java:50) >>>> at org.apache.solr.core.SolrCore$3.call(SolrCore.java:797) >>>> at java.util.concurrent.FutureTask >>>> $Sync.innerRun(FutureTask.java:303) >>>> at java.util.concurrent.FutureTask.run(FutureTask.java:138) >>>> at java.util.concurrent.ThreadPoolExecutor >>>> $Worker.runTask(ThreadPoolExecutor.java:885) >>>> at java.util.concurrent.ThreadPoolExecutor >>>> $Worker.run(ThreadPoolExecutor.java:907) >>>> at java.lang.Thread.run(Thread.java:637) >>>> >>>> I see two things in CHANGES.txt that might apply, but I'm not sure: >>>> 1. I think commons-csv was upgraded >>>> 2. The CSV loader stuff was refactored to share common code >>>> >>>> I'm still investigating. >>>> >>>> -Grant >>> >>> -- >>> Grant Ingersoll >>> http://www.lucidimagination.com/ >>> >>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >>> using Solr/Lucene: >>> http://www.lucidimagination.com/search >>> >> >> -- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >> using Solr/Lucene: >> http://www.lucidimagination.com/search >> > >-- >Grant Ingersoll >http://www.lucidimagination.com/ > >Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >using Solr/Lucene: >http://www.lucidimagination.com/search -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH; Hardcode field value/replacement based on source column
Hmmm, I am sure I have seen this as well! I get the #${x.imgvurl} added twice. Fergus. >On 3/31/09 11:50 AM, "Wesley Small" wrote: > >> I am trying to find a clean way to *hardcode* a field/column to a specific >> value during the DIH process. It does seems to be possible but I am getting >> an slightly invalid constant value in my index. >> >> > replaceWith="Video" /> >> >> However, the value in the index was set to "VideoVideo" for all documents. >> >> Any idea why this DIH instruction would see constant value appear twice?? >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
Grant, I am messing with the script, and with your tip I expect I can make it recurse over as many releases as needed. I did run it again using the full file, this time using my Imac:- 643465took 22min 14sec 2008-04-01 734796 73min 58sec 2009-01-15 758795 70min 55sec 2009-03-26 I then ran it again using only the first 1M records:- 643465took 2m51.516s 2008-04-01 734796 7m29.326s 2009-01-15 758795 8m18.403s 2009-03-26 this time with commit=true. 643465took 2m49.200s 2008-04-01 734796 8m27.414s 2009-01-15 758795 9m32.459s 2009-03-26 this time with commit=false&overwrite=false. 643465took 2m46.149s 2008-04-01 734796 3m29.909s 2009-01-15 758795 3m26.248s 2009-03-26 Just read your latest post. I will apply the patches and retest the above. >Can you try adding &overwrite=false and running against the latest >version? My current working theory is that Solr/Lucene has changed >how deletes are handled such that work that was deferred before is now >not deferred as often. In fact, you are not seeing this cost paid (or >at least not noticing it) because you are not committing, but I >believe you do see it when you are closing down Solr, which is why it >takes so long to exit. It can take ages! (>15min to get tomcat to quit). Also my script does have the separate commit step, which does not take any time! >I also think that Lucene adding fsync() into >the equation may cause some slow down, but that is a penalty we are >willing to pay as it gives us higher data integrity. Data integrity is always good. However if performance seems unreasonable, user/customers tend to take things into their own hands and kill the process or machine. This tends to be very bad for data integrity. >So, depending on how you have your data, I think a workaround is to: >Add a field that contains a single term identifying the data type for >this particular CSV file, i.e. something like field: type, value: >fergs-csv >Then, before indexing, you can issue a Delete By Query: type:fergs-csv >and then add your CSV file using overwrite=false. This amounts to a >batch delete followed by a batch add, but without the add having to >issue deletes for each add. Ok.. but... for these test cases I am starting off with an empty index. The script does a "rm -rf solr/data" before tomcat is launched. So I do not understand how the above helps. UNLESS there are duplicate gaz entries. >In the meantime, I'm trying to see if I can pinpoint down a specific >change and see if there is anything that might help it perform better. > >-Grant > -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
Grant, Redoing the work with your patch applied does not seem to make a difference! Is this the expected result? I did run it again using the full file, this time using my Imac:- 643465took 22min 14sec 2008-04-01 734796 73min 58sec 2009-01-15 758795 70min 55sec 2009-03-26 Again using only the first 1M records with commit=false&overwrite=true:- 643465took 2m51.516s 2008-04-01 734796 7m29.326s 2009-01-15 758795 8m18.403s 2009-03-26 SOLR-1095 7m41.699s this time with commit=true&overwrite=true. 643465took 2m49.200s 2008-04-01 734796 8m27.414s 2009-01-15 758795 9m32.459s 2009-03-26 SOLR-1095 7m58.825s this time with commit=false&overwrite=false. 643465took 2m46.149s 2008-04-01 734796 3m29.909s 2009-01-15 758795 3m26.248s 2009-03-26 SOLR-1095 2m49.997s -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
>On Apr 1, 2009, at 9:39 AM, Fergus McMenemie wrote: > >> Grant, >> >> Redoing the work with your patch applied does not seem to > >> >> make a difference! Is this the expected result? > >No, I didn't expect Solr 1095 to fix the problem. Overwrite = false + >1095, does, however, AFAICT by your last line, right? > >> >> >> I did run it again using the full file, this time using my Imac:- >> 643465took 22min 14sec 2008-04-01 >> 734796 73min 58sec 2009-01-15 >> 758795 70min 55sec 2009-03-26 >> Again using only the first 1M records with >> commit=false&overwrite=true:- >> 643465took 2m51.516s 2008-04-01 >> 734796 7m29.326s 2009-01-15 >> 758795 8m18.403s 2009-03-26 >> SOLR-1095 7m41.699s >> this time with commit=true&overwrite=true. >> 643465took 2m49.200s 2008-04-01 >> 734796 8m27.414s 2009-01-15 >> 758795 9m32.459s 2009-03-26 >> SOLR-1095 7m58.825s >> this time with commit=false&overwrite=false. >> 643465took 2m46.149s 2008-04-01 >> 734796 3m29.909s 2009-01-15 >> 758795 3m26.248s 2009-03-26 >> SOLR-1095 2m49.997s >> Grant, Hmmm, the big difference is made by &overwrite=false. But, can you explain why &overwrite=false makes such a difference. I am starting off with an empty index and I have checked the content there are no duplicates in the uniqueKey field. I guess if &overwrite=false then a few checks can be removed from the indexing process, and if I am confident that my content contains no duplicates then this is a good speed up. http://wiki.apache.org/solr/UpdateCSV says that if overwrite is true (the default) then overwrite documents based on the uniqueKey. However what will solr/lucene do if the uniqueKey is not unique and overwrite=false? fergus: perl -nlaF"\t" -e 'print "$F[2]";' geonames.txt | wc -l 100 fergus: perl -nlaF"\t" -e 'print "$F[2]";' geonames.txt | sort -u | wc -l 100 fergus: /usr/bin/head geonames.txt RC UFI UNI LAT LONGDMS_LAT DMS_LONGMGRSJOG FC DSG PC CC1 ADM1ADM2POP ELEVCC2 NT LC SHORT_FORM GENERIC SORT_NAME FULL_NAME FULL_NAME_ND MODIFY_DATE 1 -130782860524 12.47 -69.9 122800 -695400 19PDP0219578323 ND19-14 T MT AA 00 PALUMARGA Palu Marga Palu Marga 1995-03-23 1 -1307756-189172012.5-70.016667 123000 -700100 19PCP8952982056 ND19-14 P PPLX PS. do you want me to do some kind of chop through the different versions to see where the slow down happened or are you happy you have nailed it? -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Problem using ExtractingRequestHandler with tomcat
Hello all, I cant get ExtractingRequestHandler to work with tomcat. Using the latest version from svn and then a "make clean dist" and copying the war file to a clean tomcat does not work. Adding the following to solconfig.xml ands restarting tomcat i get > class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> > > last_modified > true > > >Apr 2, 2009 9:20:02 AM org.apache.solr.util.plugin.AbstractPluginLoader load >INFO: created /update/javabin: >org.apache.solr.handler.BinaryUpdateRequestHandler >Apr 2, 2009 9:20:02 AM org.apache.solr.common.SolrException log >SEVERE: org.apache.solr.common.SolrException: Error loading class >'org.apache.solr.handler.extraction.ExtractingRequestHandler' > at > org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:310) > at > org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:325) > at > org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:84) > at > org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:154) > at > org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:163) Any ideas? -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Problem using ExtractingRequestHandler with tomcat
>On Apr 2, 2009, at 4:26 AM, Fergus McMenemie wrote: >> I cant get ExtractingRequestHandler to work with tomcat. Using the >> latest version from svn and then a "make clean dist" and copying the >> war file to a clean tomcat does not work. > >make?! :) Oops! > >try "ant example" to see if that gets it working - it copies the >ExtractingRequestHandler JAR and dependencies to /lib > > Erik > Thanks. Copying all those jar files to my solr/lib directory was the trick. But why do I have to do this; is it by design or because ExtractingRequestHandler is yet to be fully incorporated into Solr? Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Using ExtractingRequestHandler to index a large PDF
Hello, Sorry if this is a FAQ; I suspect it could be. But how do I work around the following:- INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/oceania.pdf} status=0 QTime=318 Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log SEVERE: org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the request was rejected because its size (4585774) exceeds the configured maximum (2097152) at org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.(FileUploadBase.java:914) at org.apache.commons.fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331) at org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:349) at org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) at org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:343) at org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:396) at org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) Although the PDF is big, it contains very little text; it is a map. "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother with it. Fergus... -- ======= Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
Grant, >I should note, however, that the speed difference you are seeing may >not be as pronounced as it appears. If I recall during ApacheCon, I >commented on how long it takes to shutdown your Solr instance when >exiting it. That time it takes is in fact Solr doing the work that >was put off by not committing earlier and having all those deletes >pile up. > I am confused about "work that was put off" vs committing. My script was doing a commit right after the CVS import, and you are right about the massive times required to shut tomcat down. But in my tests the time taken to do the commit was under a second, yet I had to allow 300secs for tomcat shutdown. Also I dont have any duplicates. So what sort of work was being done at shutdown that was not being done by a commit? Optimise! Thanks for the all the help. Fergus. -- ======= Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Additive filter queries
>I have a design question for all of those who might be willing to provide an >answer. > >We are looking for a way to do a type of additive filters. Our documents >are comprised of a single item of a specified color. We will use shoes as >an example. Each document contains a multivalued ³size² field with all >sizes and a multivalued ³width² field for all widths available for a given >color. Our issue is that the values are not linked to each other. This >issue can be seen when a user chooses a size (e.g. 7) and we filter the >options down to only size 7. When the width facet is displayed it will have >all widths available for all documents that match on size 7 even though most >don¹t come in a wide width. We are looking for strategies to filter facets >based on other facets in separate queries. > >-- >Jeff Newburn >Software Engineer, Zappos.com >jnewb...@zappos.com - 702-943-7562 Ditto! As best I understand, you somehow need to arrange for each different combination of colour, size and width to be indexed as a separate sol document. -- ======= Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH API for specifying a either specific or all configurations imported
>Good Morning, > >Is there any way to specify or debug a specific DIH configuration via the >API/http request? > >I have the following: > > >dih_pc_default_feed.xml > > >dih_pc_cms_article_feed.xml > > >dih_pc_local_event_feed.xml > > >For example, is there any to specific only the "pc_local_event" be process >(imported)? > >Another questions, if command=full-import, this should effectively mean that >all DIH configuration are executed in sequential order. Is that correct? I >am not seeing that behaviour at present. > Wesley, I do not think the above is valid syntactically. I am a still coming up to speed on DIH, however I have taken to storing all my DIH import configurations in a single file. Each of your different configurations would be within its own top level entity tag. Each of which MUST be named. It is also a good idea to explicitly name each of your datasource descriptions, and then have the entities reference there datasource by name. I can then invoke only that entity from the URL as follows:- http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import&entity=jc See the docs at:- http://wiki.apache.org/solr/DataImportHandler#head-1582242c1bfc1f3e89f4025bf2055791848acefb Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Using ExtractingRequestHandler to index a large PDF ~solved
Hmmm, Not sure how this all hangs together. But editing my solrconfig.xml as follows sorted the problem:- to Also, my initial report of the issue was misled by the log messages. The mention of "oceania.pdf" refers to a previous successful tika extract. There no mention of the filename that was rejected in the logs or any information that would help me identify it! Regards Fergus. >Sorry if this is a FAQ; I suspect it could be. But how do I work around the >following:- > >INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract >params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/oceania.pdf} > status=0 QTime=318 >Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log >SEVERE: >org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the >request was rejected because its size (4585774) exceeds the configured maximum >(2097152) > at > org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.(FileUploadBase.java:914) > at > org.apache.commons.fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331) > at > org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:349) > at > org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) > at > org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:343) > at > org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:396) > at > org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) > >Although the PDF is big, it contains very little text; it is a map. > > "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother with it. > >Fergus... >-- > >=== >Fergus McMenemie Email:fer...@twig.me.uk >Techmore Ltd Phone:(UK) 07721 376021 > >Unix/Mac/Intranets Analyst Programmer >=== -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Searching on mulit-core Solr
vivek, 404 from the URL you provided in the message! Similar URLs work OK for me. hmm try http://localhost:8080/solr/admin/cores?action=status and see if that gives a 404. Also are you running a nightly build or a svn checkout? Using tomcat? Perhaps it should be http://localhost:8080/apache-solr-1.4-dev/admin/cores?action=status Fergus. >Hi, > > Any help on this. I've looked at DistributedSearch on Wiki, but that >doesn't seem to be working for me on multi-core and multiple Solr >instances on the same box. > >Scenario, > >1) Two boxes (localhost, 10.4.x.x) >2) Two Solr instances on each box (8080 and 8085 ports) >3) Two cores on each instance (core0, core1) > >I'm not sure how to construct my search on the above setup if I need >to search across all the cores on all the boxes. Here is what I'm >trying, > >http://localhost:8080/solr/core0/select?shards=localhost:8080/solr/core0,localhost:8085/solr/core0,localhost:8080/solr/core1,localhost:8085/solr/core1,10.4.x.x:8080/solr/core0,10.4.x.x:8085/solr/core0,10.4.x.x:8080/solr/core1,10.4.x.x:8085/solr/core1&indent=true&q=vivek+japan > >I get 404 error. Is this the right URL construction for my setup? How >else can I do this? > >Thanks, >-vivek > >On Fri, Apr 3, 2009 at 1:02 PM, vivek sar wrote: >> Hi, >> >> I've a multi-core system (one core per day), so there would be around >> 30 cores in a month on a box running one Solr instance. We have two >> boxes running the Solr instance and input data is feeded to them in >> round-robin fashion. Each box can have up to 30 cores in a month. Here >> are questions, >> >> 1) How would I search for a term in multiple cores on same box? >> >> Single core I'm able to search like, >> http://localhost:8080/solr/20090402/select?q=*:* >> >> 2) How would I search for a term in multiple cores on both boxes at >> the same time? >> >> 3) Is it possible to have two Solr instances on one box with one doing >> the indexing and other perform only searches on that index? The idea >> is have two JVMs with each doing its own task - I'm not sure whether >> the indexer process needs to know about searcher process - like do >> they need to have the same solr.xml (for multicore etc). We don't want >> to replicate the indexes also (we got very light search traffic, but >> very high indexing traffic) so they need to use the same index. >> >> >> Thanks, >> -vivek >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: How could I avoid reindexing same files?
Veselin, Well, as far as solr is concerned, there is two issues here:- 1) To stop the same document ending up in the indexes twice, use the document pathname as the unique ID. Then if you do index it twice, the previous index information will be discarded. Not very efficient, but it may be tolerable. IMHO using pathname as the unique ID is often best practice. 2) To stop a document even being submitted to solr. You need to implement some middle ware that either performs a search/lookup using a documents pathname to see if it is already indexed. Or, after examining timestampts, only submits documents which have changed since the last folder scan. Fergus. >Hello Paul, >I'm indexing with "curl http://localhost... -F myfi...@file.pdf" > >Regards, >Veselin K > > >On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble Paul ? >?? wrote: >> how are you indexing? >> >> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev >> wrote: >> > Hello, >> > apologies for the basic question. >> > >> > How can I avoid double indexing files? >> > >> > In case all my files are in one folder which is scanned frequently, is >> > there a Solr feature of checking and skipping a file if it has already >> > been indexed >> > and not changed since? >> > >> > >> > Thank you. >> > >> > Regards, >> > Veselin K >> > >> > >> >> >> >> -- >> --Noble Paul -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: How could I avoid reindexing same files?
>Thank you much Fergus, > >I was considering implementing a database which would hold a path name >and an MD5 sum of each file. Snap. That is close to what we did. However due to our pervious duff full text search engine we had to hold this information in a separate checksums file. Solr is much better at allowing you to add extra meta information as the document is being submitted for indexing. curl http://localhost...update/extract -F "myfi...@file.pdf;ext.literal.id=file.pdg;ext.literal.chksum=X" >Then as a part of Solr indexing, one could check against the DB if a >file path exists, if Yes, then compare MD5 and only index if different. Using solr you could hold the checksum and pathname as solr fields, then rather than looking up a DB you would look up solr. Having every thing in the one place is better for consistency and quality. You could also dump all checksums and pathnames from solr if/when you wanted to validate your folder structure and or indexes. >Regards, >Veselin K > >On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote: >> Veselin, >> >> Well, as far as solr is concerned, there is two issues here:- >> >> 1) To stop the same document ending up in the indexes twice, use the document >>pathname as the unique ID. Then if you do index it twice, the previous >> index >>information will be discarded. Not very efficient, but it may be >> tolerable. >>IMHO using pathname as the unique ID is often best practice. >> >> 2) To stop a document even being submitted to solr. You need to implement >> some >>middle ware that either performs a search/lookup using a documents >> pathname >>to see if it is already indexed. Or, after examining timestampts, only >> submits >>documents which have changed since the last folder scan. >> >> Fergus. >> >Hello Paul, >> >I'm indexing with "curl http://localhost... -F myfi...@file.pdf" >> > >> >Regards, >> >Veselin K >> > >> > >> >On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble Paul ? >> >?? wrote: >> >> how are you indexing? >> >> >> >> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev >> >> wrote: >> >> > Hello, >> >> > apologies for the basic question. >> >> > >> >> > How can I avoid double indexing files? >> >> > >> >> > In case all my files are in one folder which is scanned frequently, is >> >> > there a Solr feature of checking and skipping a file if it has already >> >> > been indexed >> >> > and not changed since? >> >> > >> >> > >> >> > Thank you. >> >> > >> >> > Regards, >> >> > Veselin K >> >> --Noble Paul -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: DIH; Hardcode field value/replacement based on source column
>: Indeed. I wrote the following test: >: >: Pattern p = Pattern.compile("(.*)"); >: Matcher m = p.matcher("xyz"); >: Assert.assertEquals("", "Video", m.replaceAll("Video")); >: >: The test fails. It gives "VideoVideo" as the actual result. I guess there is >: something about Matcher.replaceAll that I don't know. Off to read the >: javadocs then. > >".*" matches the empty string (for that matter any regex clause with the >"*" modifier applied matches the empty string), and iterating over pattern >matches (ie: what happens if you call Matcher.find() or >Matcher.replaceAll()) always advances to "first character not matched by >[the previous] match." (ie: let prev = m.end(); if (m.find) then prev <= >m.start()). > >So ".*" always matches twice on any given String x ... once when it >matches from 0 to x.length()-1, and one when it matches the empty string >starting and ending at x.length()-1. > >That's why using "^.*" doesn't have this problem ... "*" is greedy so it >only matches once at the start of the string and then there can't be any >more matches. Conversly: ".*$" and ".*\z" will still have this problem, >because any number of matches can have the same ending offset. > > >-Hoss Hmmm, given the chance perl behaves the same. Although attempting to use /*/ fails. Another lesson learnt! #! /usr/local/bin/perl use strict; my($s)="cat mat rat hat"; my($c)=0; print " a-match", ++$c, "='$1'\n" while( $s =~ m/(at)/g ); $c=0; print " b-match", ++$c, "='$1'\n" while( $s =~ m/(.*)/g ); $c=0; print " c-match", ++$c, "='$1'\n" while( $s =~ m/^(.*)/g ); $c=0; print " d-match", ++$c, "='$1'\n" while( $s =~ m/(.*)$/g ); -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: How could I avoid reindexing same files?
>Hi Fergus, > >On Tue, Apr 07, 2009 at 05:06:23PM +0100, Fergus McMenemie wrote: >> >Thank you much Fergus, >> > >> >I was considering implementing a database which would hold a path name >> >and an MD5 sum of each file. >> Snap. That is close to what we did. However due to our pervious >> duff full text search engine we had to hold this information in >> a separate checksums file. Solr is much better at allowing you >> to add extra meta information as the document is being submitted >> for indexing. >> >> curl http://localhost...update/extract >>-F "myfi...@file.pdf;ext.literal.id=file.pdf;ext.literal.chksum=X" > >- Great idea, simpler and cleaner! > > >> >Then as a part of Solr indexing, one could check against the DB if a >> >file path exists, if Yes, then compare MD5 and only index if different. >> Using solr you could hold the checksum and pathname as solr fields, >> then rather than looking up a DB you would look up solr. Having every >> thing in the one place is better for consistency and quality. You >> could also dump all checksums and pathnames from solr if/when you wanted >> to validate your folder structure and or indexes. > >- What kind of query could I use with Solr, to check for a specific > filename/checksum and get an answer as close to "TRUE or FALSE" as possible? Some thought needs to be given to this to make sure that the performance is adequate. But at its simplest:- curl http://localhost.../select?id=file.pdf&fl=id,chksum -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Searching on mulit-core Solr
Valve.invoke(StandardContextValve.java:191) >> at >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) >> at >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) >> at >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) >> at >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286) >> at >> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845) >> at >> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583) >> at >> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) >> at java.lang.Thread.run(Thread.java:637) >> >> >> Any tips on how can I search on multicore on same solr instance? >> >> Thanks, >> -vivek >> >> On Mon, Apr 6, 2009 at 2:40 PM, Fergus McMenemie wrote: >>> vivek, >>> >>> 404 from the URL you provided in the message! Similar URLs work >>> OK for me. >>> >>> hmm try http://localhost:8080/solr/admin/cores?action=status and see >>> if that gives a 404. >>> >>> Also are you running a nightly build or a svn checkout? Using tomcat? >>> Perhaps it should be >>> >>> http://localhost:8080/apache-solr-1.4-dev/admin/cores?action=status >>> >>> Fergus. >>> >>>>Hi, >>>> >>>> Any help on this. I've looked at DistributedSearch on Wiki, but that >>>>doesn't seem to be working for me on multi-core and multiple Solr >>>>instances on the same box. >>>> >>>>Scenario, >>>> >>>>1) Two boxes (localhost, 10.4.x.x) >>>>2) Two Solr instances on each box (8080 and 8085 ports) >>>>3) Two cores on each instance (core0, core1) >>>> >>>>I'm not sure how to construct my search on the above setup if I need >>>>to search across all the cores on all the boxes. Here is what I'm >>>>trying, >>>> >>>>http://localhost:8080/solr/core0/select?shards=localhost:8080/solr/core0,localhost:8085/solr/core0,localhost:8080/solr/core1,localhost:8085/solr/core1,10.4.x.x:8080/solr/core0,10.4.x.x:8085/solr/core0,10.4.x.x:8080/solr/core1,10.4.x.x:8085/solr/core1&indent=true&q=vivek+japan >>>> >>>>I get 404 error. Is this the right URL construction for my setup? How >>>>else can I do this? >>>> >>>>Thanks, >>>>-vivek >>>> >>>>On Fri, Apr 3, 2009 at 1:02 PM, vivek sar wrote: >>>>> Hi, >>>>> >>>>> I've a multi-core system (one core per day), so there would be around >>>>> 30 cores in a month on a box running one Solr instance. We have two >>>>> boxes running the Solr instance and input data is feeded to them in >>>>> round-robin fashion. Each box can have up to 30 cores in a month. Here >>>>> are questions, >>>>> >>>>> 1) How would I search for a term in multiple cores on same box? >>>>> >>>>> Single core I'm able to search like, >>>>> http://localhost:8080/solr/20090402/select?q=*:* >>>>> >>>>> 2) How would I search for a term in multiple cores on both boxes at >>>>> the same time? >>>>> >>>>> 3) Is it possible to have two Solr instances on one box with one doing >>>>> the indexing and other perform only searches on that index? The idea >>>>> is have two JVMs with each doing its own task - I'm not sure whether >>>>> the indexer process needs to know about searcher process - like do >>>>> they need to have the same solr.xml (for multicore etc). We don't want >>>>> to replicate the indexes also (we got very light search traffic, but >>>>> very high indexing traffic) so they need to use the same index. >>>>> >>>>> >>>>> Thanks, >>>>> -vivek >>>>> >>> >>> -- >>> >>> === >>> Fergus McMenemie Email:fer...@twig.me.uk >>> Techmore Ltd Phone:(UK) 07721 376021 >>> >>> Unix/Mac/Intranets Analyst Programmer >>> === >>> >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: Using ExtractingRequestHandler to index a large PDF ~solved
>On Apr 6, 2009, at 10:16 AM, Fergus McMenemie wrote: > >> Hmmm, >> >> Not sure how this all hangs together. But editing my solrconfig.xml >> as follows >> sorted the problem:- >> >>> multipartUploadLimitInKB="2048" /> >> to >> >>> multipartUploadLimitInKB="20048" /> >> > >We should document this on the wiki or in the config, if it isn't >already. As best I could tell it is not documented. I stumbled across the idea of changing multipartUploadLimitInKB after reviewing http://wiki.apache.org/solr/UpdateRichDocuments. But this leads onto wondering if streaming files from a local disk was in some way also available via enableRemoteStreaming for the solr-cell feature? With 20:20 hindsight I see that http://wiki.apache.org/solr/SolrConfigXml does briefly refer to "file upload size" I feel that the requestDispatcher section of solrconfig.xml needs a more complete description. I get the impression it acts a filter on *any* URL sent to SOLR? What does it do? I will mark up the wiki when this is clarified > >> Also, my initial report of the issue was misled by the log messages. >> The mention >> of "oceania.pdf" refers to a previous successful tika extract. There >> no mention >> of the filename that was rejected in the logs or any information >> that would help >> me identify it! > >We should fix this so it at least spits out a meaningful message. Can >you open a JIRA? > OK SOLR-1113 raised. >> >> Regards Fergus. >> >>> Sorry if this is a FAQ; I suspect it could be. But how do I work >>> around the following:- >>> >>> INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract >>> params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/ >>> oceania.pdf} status=0 QTime=318 >>> Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log >>> SEVERE: org.apache.commons.fileupload.FileUploadBase >>> $SizeLimitExceededException: the request was rejected because its >>> size (4585774) exceeds the configured maximum (2097152) >>> at org.apache.commons.fileupload.FileUploadBase >>> $FileItemIteratorImpl.(FileUploadBase.java:914) >>> at >>> org >>> .apache >>> .commons >>> .fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331) >>> at >>> org >>> .apache >>> .commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java: >>> 349) >>> at >>> org >>> .apache >>> .commons >>> .fileupload >>> .servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126) >>> at >>> org >>> .apache >>> .solr >>> .servlet >>> .MultipartRequestParser >>> .parseParamsAndFillStreams(SolrRequestParsers.java:343) >>> at >>> org >>> .apache >>> .solr >>> .servlet >>> .StandardRequestParser >>> .parseParamsAndFillStreams(SolrRequestParsers.java:396) >>> at >>> org >>> .apache >>> .solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114) >>> at >>> org >>> .apache >>> .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: >>> 217) >>> at >>> org >>> .apache >>> .catalina >>> .core >>> .ApplicationFilterChain >>> .internalDoFilter(ApplicationFilterChain.java:202) >>> at >>> org >>> .apache >>> .catalina >>> .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: >>> 173) >>> at >>> org >>> .apache >>> .catalina >>> .core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) >>> at >>> org >>> .apache >>> .catalina >>> .core.StandardContextValve.invoke(StandardContextValve.java:178) >>> at >>> org >>> .apache >>> .catalina.core.StandardHostValve.invoke(StandardHostValve.java:126) >>> at >>> org >>> .apache >>> .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105) >>> >>> Although the PDF is big, it contains very little text; it is a map. >>> >>> "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother >>> with it. >>> >>> Fergus... > >-- >Grant Ingersoll >http://www.lucidimagination.com/ > >Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) >using Solr/Lucene: >http://www.lucidimagination.com/search -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: indexing txt file
>Hi all, >I'm trying to use solr1.3 and trying to index a text file. I wrote a >schema.xsd and a xml file. Just to make sure I understand things Do you just have one of these text files, containing many reports? Or Do you have many of these text files each containing one report? Also, is the report a single line, that has been wrapped for email? Fergus. > >*The content of my text file is * >#src dstprotook >sportdportpktsbytesflowsfirst >atest >192.168.220.13526.147.238.1466 13283980 >6 463 1 1237333861.4657640001237333861.664701000 > >*schema file is * > > >http://www.w3.org/2001/XMLSchema";> > > > > > >type="xs:string" use="required"/> >use="required"/> >use="required"/> >type="xs:string" use="required"/> >use="required"/> >use="required"/> >type="xs:string" use="required"/> >use="required"/> >use="required"/> >type="xs:string" use="required"/> >use="required"/> > > > > > > > > >*and my xml file is * > > >http://www.w3.org/2001/XMLSchema-instance"; >xsi:noNamespaceSchemaLocation="C:\DOCUME~1\tpham\Desktop\networkTraffic.xsd"> >protocolPortNumber="6" ok="1" sourcePort="32439" destinationPort="80" >packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000" >terminationTimestamp="1237963861.664701000"/> >protocolPortNumber="17" ok="1" sourcePort="32439" destinationPort="80" >packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000" >terminationTimestamp="1237963861.664701000"/> >protocolPortNumber="6" ok="1" sourcePort="32139" destinationPort="80" >packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000" >terminationTimestamp="1237963861.664701000"/> >protocolPortNumber="6" ok="1" sourcePort="32839" destinationPort="80" >packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000" >terminationTimestamp="1237963861.664701000"/> >protocolPortNumber="17" ok="1" sourcePort="32839" destinationPort="80" >packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000" >terminationTimestamp="1237963861.664701000"/> >protocolPortNumber="17" ok="1" sourcePort="32439" destinationPort="80" >packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000" >terminationTimestamp="1237963861.664701000"/> >protocolPortNumber="6" ok="1" sourcePort="36839" destinationPort="80" >packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000" >terminationTimestamp="1237963861.664701000"/> >protocolPortNumber="6" ok="1" sourcePort="32839" destinationPort="80" >packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000" >terminationTimestamp="1237963861.664701000"/> > > > > >Can someone please show me where do I put these files? I'm aware that the >schema.xsd file goes into the directory conf. What about my xml file, and >txt file? > >Thank you, >Alex > > >On Tue, Apr 14, 2009 at 12:37 AM, Alejandro Gonzalez < >alejandrogonzalezd...@gmail.com> wrote: > >> you should construct the xml containing the fields defined in your >> schema.xml and give them the values from the text files. for example if you >> have an schema defining two fields "title" and "text" you should construct >> an xml with a field "title" and its value and another called "text" >> containing the body of your doc. then you can post it to Solr you have >> deployed and make a commit an it's done. it's possible to construct an xml >> defining more than jus t a doc >> >> >> >> >> "doc1 title" >> "doc1 text" >> >> . >> . >> . >> >> "docn title" >> "docn text" >> >> >> >> >> >> 2009/4/14 Noble Paul ?? Â Ë³Ë >> >> > what is the cntent of your text file? >> > Solr does not directly index files >> > --Noble >> > >> > On Tue, Apr 14, 2009 at 3:54 AM, Alex Vu wrote: >> > > Hi all, >> > > >> > > Currently I wrote an xml file and schema.xml file. What is the next >> step >> > to >> > > index a txt file? Where should I put my txt file I want to index? >> > > >> > > thank you, >> > > Alex V. >> > > >> > >> > >> > >> > -- >> > --Noble Paul >> > >> -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===
Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown
>On Apr 2, 2009, at 9:23 AM, Fergus McMenemie wrote: > >> Grant, >> >> >> >>> I should note, however, that the speed difference you are seeing may >>> not be as pronounced as it appears. If I recall during ApacheCon, I >>> commented on how long it takes to shutdown your Solr instance when >>> exiting it. That time it takes is in fact Solr doing the work that >>> was put off by not committing earlier and having all those deletes >>> pile up. >>> >> I am confused about "work that was put off" vs committing. My script >> was doing a commit right after the CVS import, and you are right >> about the massive times required to shut tomcat down. But in my tests >> the time taken to do the commit was under a second, yet I had to allow >> 300secs for tomcat shutdown. Also I dont have any duplicates. So >> what sort of work was being done at shutdown that was not being done >> by a commit? Optimise! >> > >The work being done is addressing the deletes, AIUI, but of course >there are other things happening during shutdown, too. There are no deletes to do. It was a clean index to begin with and there were no duplicates. >How long is the shutdown if you do a commit first and then a shutdown? Still very long, sometimes 300sec. My script always did a commit! >At any rate, I don't know that there is a satisfying answer to the >larger issue due to the things like the fsync stuff, which is an >overall win for Lucene/Solr despite it being more slower. Have you >tried running the tests on other machines (non-Mac?) Nope. Although next week I will have real "PC" running vista, so I could try it there. I think we should knock this on the head and move on. I rarely need to index this content and I can take the performance hit, and of course your work around provides a good speed up. Regards Fergus. -- === Fergus McMenemie Email:fer...@twig.me.uk Techmore Ltd Phone:(UK) 07721 376021 Unix/Mac/Intranets Analyst Programmer ===