from:"fergus mcmenemie"

Re: Aliases for fields

2009-08-18 Thread Fergus McMenemie

>What could possibly be a use case for such a need?
>

I would love to see such a feature.

I have a multi core solr setup with each core having utterly 
different content. Each core has its own "custom search app"
that exploits nuances specific to a particular data set. The
fieldnames are chosen as best fits a particular data set.

However I would also like to have one of two general search
features that span all cores. This is a crude, one size fits
all, type of search:-

One core has fields:-
author
title
text
Another has
sender
subject
message
Another has
placename
description

I either need to rename all fields within some of the
"custom search apps" to account for the needs of the global
search or perform lots of copyFields, or construct really
nasty queries. 

I currently use the copyFields approach. I think aliases 
would allow for far more efficient indexes and clear code.

Regards Fergus.

>Cheers
>Avlesh
>
>2009/8/18 Licinio Fernández Maurelo 
>
>> Hello everybody,
>>
>> can i set an alias for a field? Something like :
>>
>> > stored="true" multiValued="false" termVectors="false"
>> alias="source.date"/>
>>
>> is there any jira issue related?
>>
>> Thx
>>
>> --
>> Lici
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Netbeans and Solr : Whac-A-Mole

2009-09-07 Thread Fergus McMenemie

Hello all,

I would appreciate help from somebody who has set up Solr within
netbeans, I am wanting to do more work with DIH and particularly its
XpathEntityProcessor stuff. I wish to preform the following from
within the IDE

   ant -Dtestcase=TestXPathRecordReader.java test

I have spent a few hours playing Whac-A-Mole with classpath and source
settings. In the end I got it down to zero flags, but I then added 
some test cases and the scanner thing then went off and flagged dozens
files with undefined classes I removed my change but the rescan did not 
remove the dozens of flagged files.

PS: I am a total netbeans newbie. 

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Netbeans and Solr : Whac-A-Mole

2009-09-07 Thread Fergus McMenemie

>We've set-up NetBeans with Solr but we are using command-line for most of
>the stuff except for editing the code.
>
>Does your code build from NetBeans? If not what errors do you see?

The code builds and runs from net beans because the underlying build.xml
file is being used. But when you want to run testcases... you are doing 
that from the command line? Are you are only using the IDE as an editor?

>Regards
>Rajan
>
>On Mon, Sep 7, 2009 at 3:26 PM, Fergus McMenemie  wrote:
>
>> Hello all,
>>
>> I would appreciate help from somebody who has set up Solr within
>> netbeans, I am wanting to do more work with DIH and particularly its
>> XpathEntityProcessor stuff. I wish to preform the following from
>> within the IDE
>>
>>   ant -Dtestcase=TestXPathRecordReader.java test
>>
>> I have spent a few hours playing Whac-A-Mole with classpath and source
>> settings. In the end I got it down to zero flags, but I then added
>> some test cases and the scanner thing then went off and flagged dozens
>> files with undefined classes I removed my change but the rescan did not
>> remove the dozens of flagged files.
>>
>> PS: I am a total netbeans newbie.
>>
>> --
>>
>> ===
>> Fergus McMenemie   
>> Email:fer...@twig.me.uk
>> Techmore Ltd   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets     Analyst Programmer
>> ===
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Netbeans and Solr : Whac-A-Mole

2009-09-07 Thread Fergus McMenemie

>This testcase is quite independent of anything in Solr. It is a
>standalone utility and the only dependency is stax.
>discalimer (I run these testcases from Intellij and command line)
>BTW are you using XpathRecordReader outside of DIH?

Nobel,

Is there a better way to test and play with XPathRecordReader.java
other than

 ant -Dtestcase=TestXPathRecordReader test

Which takes 8secs to run here? I am not using XpathRecordReader
outside of DIH, but looking to see how I would add support for
xpaths such as //a.

Fergus.

>
>On Mon, Sep 7, 2009 at 3:26 PM, Fergus McMenemie wrote:
>> Hello all,
>>
>> I would appreciate help from somebody who has set up Solr within
>> netbeans, I am wanting to do more work with DIH and particularly its
>> XpathEntityProcessor stuff. I wish to preform the following from
>> within the IDE
>>
>>   ant -Dtestcase=TestXPathRecordReader.java test
>>
>> I have spent a few hours playing Whac-A-Mole with classpath and source
>> settings. In the end I got it down to zero flags, but I then added
>> some test cases and the scanner thing then went off and flagged dozens
>> files with undefined classes I removed my change but the rescan did not
>> remove the dozens of flagged files.
>>
>> PS: I am a total netbeans newbie.
>>
>> --
>>
>> ===
>> Fergus McMenemie               Email:fer...@twig.me.uk
>> Techmore Ltd                   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets             Analyst Programmer
>> ===
>>
>
>
>
>-- 
>-----
>Noble Paul | Principal Engineer| AOL | http://aol.com

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Netbeans and Solr : Whac-A-Mole

2009-09-07 Thread Fergus McMenemie

>On Mon, Sep 7, 2009 at 5:58 PM, Fergus McMenemie  wrote:
>
>> >This testcase is quite independent of anything in Solr. It is a
>> >standalone utility and the only dependency is stax.
>> >discalimer (I run these testcases from Intellij and command line)
>> >BTW are you using XpathRecordReader outside of DIH?
>>
>> Nobel,
>>
>> Is there a better way to test and play with XPathRecordReader.java
>> other than
>>
>>  ant -Dtestcase=TestXPathRecordReader test
>>
>> Which takes 8secs to run here? I am not using XpathRecordReader
>> outside of DIH, but looking to see how I would add support for
>> xpaths such as //a.
>>
>>
>The target takes a lot of time because it has to go through all the
>test-cases in core and contribs trying to match the value given in
>-Dtestcase.
>
>You could also do ant -Dtestcase=TestXPathRecordReader test-contrib which
>should be a little faster. I run individual test cases directly through IDEA
>which avoids these extra steps.
>
Shalin,

Hmm, 6 seconds. I looked up IDEA and I guess I should be able
to use it for free while working on solr. Is it easier to 
setup and come up the learning curve?

Regards Fergus.
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Specifying multiple documents in DataImportHandler dataConfig

2009-09-08 Thread Fergus McMenemie


You can only have one document tag and the entities must be nested
within that.

>From the wiki, if you issue a simple "/dataimport?command=full-import"
all top level entities will be processed.


>Maybe I should be more clear: I have multiple tables in my DB that I
>need to save to my Solr index. In my app code I have logic to persist
>each table, which maps to an application model to Solr. This is fine.
>I am just trying to speed up indexing time by using DIH instead of
>going through my application. From what I understand of DIH I can
>specify one dataSource element and then a series of document/entity
>sets, for each of my models. But like I said before, DIH only appears
>to want to index the first document declared under the dataSource tag.
>
>-Rupert
>
>On Tue, Sep 8, 2009 at 4:05 PM, Rupert Fiasco wrote:
>> I am using the DataImportHandler with a JDBC datasource. From my
>> understanding of DIH, for each of my "content types" e.g. Blog posts,
>> Mesh Categories, etc I would construct a series of document/entity
>> sets, like
>>
>> 
>> 
>>
>>    
>>    
>>      
>>        
>>        
>>        
>>        
>>      
>>    
>>
>>    
>>    
>>      
>>        
>>        
>>        
>>        
>>        
>>        
>>      
>>    
>> 
>> 
>>
>>
>> Solr parses this just fine and allows me to issue a
>> /dataimport?command=full-import and it runs, but it only runs against
>> the "first" document (blog_entries). It doesnt run against the 2nd
>> document (mesh_categories).
>>
>> If I remove the 2 document elements and wrap both entity sets in just
>> one document tag, then both sets get indexed, which seemingly achieves
>> my goal. This just doesnt make sense from my understanding of how DIH
>> works. My 2 content types are indeed separate so they logically
>> represent two document types, not one.
>>
>> Is this correct? What am I missing here?
>>
>> Thanks
>> -Rupert
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

RE: Extract info from parent node during data import

2009-09-10 Thread Fergus McMenemie

>Hi Paul,
>The forEach="/document/category/item | /document/category/name" didn't work 
>(no categoryname was stored or indexed).
>However forEach="/document/category/item | /document/category" seems to work 
>well. I am not sure why category on its own works, but not category/name...
>But thanks for tip. It wasn't as painful as I thought it would be.
>Venn

Hmmm, I had bother with this. Although each occurance of 
/document/category/item 
causes a new solr document to indexed, that document contained all the fields 
from
the parent element as well.

Did you see this?

>
>> From: noble.p...@corp.aol.com
>> Date: Thu, 10 Sep 2009 09:58:21 +0530
>> Subject: Re: Extract info from parent node during data import
>> To: solr-user@lucene.apache.org
>> 
>> try this
>> 
>> add two xpaths in your forEach
>> 
>> forEach="/document/category/item | /document/category/name"
>> 
>> and add a field as follows
>> 
>> > commonField="true"/>
>> 
>> Please try it out and let me know.
>> 
>> On Thu, Sep 10, 2009 at 7:30 AM, venn hardy  wrote:
>> >
>> > Hello,
>> >
>> >
>> >
>> > I am using SOLR 1.4 (from nighly build) and its URLDataSource in 
>> > conjunction with the XPathEntityProcessor. I have successfully imported 
>> > XML content, but I think I may have found a limitation when it comes to 
>> > the commonField attribute in the DataImportHandler.
>> >
>> >
>> >
>> > Before writing my own parser to read in a whole XML document, I thought 
>> > I'd post the question here (since I got some great advice last time).
>> >
>> >
>> >
>> > The bulk of my content is contained within each  tag. However, each 
>> > item has a parent called  and each category has a name which I 
>> > would like to import. In my forEach loop I specify the 
>> > /document/category/item as the collection of items I am interested in. Is 
>> > there anyway to extract an element from underneath a parent node? To be a 
>> > more more specific (see eg xml below). I would like to index the following:
>> >
>> > - category: Category 1; id: 1; author: Author 1
>> >
>> > - category: Category 1; id: 2; author: Author 2
>> >
>> > - category: Category 2; id: 3; author: Author 3
>> >
>> > - category: Category 2; id: 4; author: Author 4
>> >
>> >
>> >
>> > Any ideas on how I can get to a parent node from within a child during 
>> > data import? If it cant be done, what do you suggest would be the best way 
>> > so I can keep using the DataImportHandler... would XSLT be a good idea to 
>> > 'flatten out' the structure a bit?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>> >
>> > This is what my XML document looks like:
>> >
>> > 
>> >  
>> >  Category 1
>> >  
>> >   1
>> >   Author 1
>> >  
>> >  
>> >   2
>> >   Author 2
>> >  
>> >  
>> >  
>> >  Category 2
>> >  
>> >   3
>> >   Author 3
>> >  
>> >  
>> >   4
>> >   Author 4
>> >  
>> >  
>> > 
>> >
>> >
>> >
>> > And this is what my dataConfig looks like:
>> > 
>> >  
>> >  
>> >   > > url="http://localhost:9080/data/20090817070752.xml"; 
>> > processor="XPathEntityProcessor" forEach="/document/category/item" 
>> > transformer="DateFormatTransformer" stream="true" dataSource="dataSource">
>> >> > commonField="true" />
>> >
>> >
>> >   
>> >  
>> > 
>> >
>> >
>> >
>> > This is how I have specified my schema
>> > 
>> >   > > required="true" />
>> >   
>> >   
>> > 
>> >
>> > id
>> > id
>> >
>> >
>> >
>> >
>> >
>> >
>> > _
>> > Need a place to rent, buy or share? Let us find your next place for you!
>> > http://clk.atdmt.com/NMN/go/157631292/direct/01/
>> 
>> 
>> 
>> -- 
>> -
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>
>_
>Get Hotmail on your iPhone Find out how here
>http://windowslive.ninemsn.com.au/article.aspx?id=845706

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Extract info from parent node during data import

2009-09-11 Thread Fergus McMenemie

e 
>>> >> > /document/category/item as the collection of items I am interested in. 
>>> >> > Is there anyway to extract an element from underneath a parent node? 
>>> >> > To be a more more specific (see eg xml below). I would like to index 
>>> >> > the following:
>>> >> >
>>> >> > - category: Category 1; id: 1; author: Author 1
>>> >> >
>>> >> > - category: Category 1; id: 2; author: Author 2
>>> >> >
>>> >> > - category: Category 2; id: 3; author: Author 3
>>> >> >
>>> >> > - category: Category 2; id: 4; author: Author 4
>>> >> >
>>> >> >
>>> >> >
>>> >> > Any ideas on how I can get to a parent node from within a child during 
>>> >> > data import? If it cant be done, what do you suggest would be the best 
>>> >> > way so I can keep using the DataImportHandler... would XSLT be a good 
>>> >> > idea to 'flatten out' the structure a bit?
>>> >> >
>>> >> >
>>> >> >
>>> >> > Thanks
>>> >> >
>>> >> >
>>> >> >
>>> >> > This is what my XML document looks like:
>>> >> >
>>> >> > 
>>> >> > 
>>> >> > Category 1
>>> >> > 
>>> >> > 1
>>> >> > Author 1
>>> >> > 
>>> >> > 
>>> >> > 2
>>> >> > Author 2
>>> >> > 
>>> >> > 
>>> >> > 
>>> >> > Category 2
>>> >> > 
>>> >> > 3
>>> >> > Author 3
>>> >> > 
>>> >> > 
>>> >> > 4
>>> >> > Author 4
>>> >> > 
>>> >> > 
>>> >> > 
>>> >> >
>>> >> >
>>> >> >
>>> >> > And this is what my dataConfig looks like:
>>> >> > 
>>> >> > 
>>> >> > 
>>> >> > >> >> > url="http://localhost:9080/data/20090817070752.xml"; 
>>> >> > processor="XPathEntityProcessor" forEach="/document/category/item" 
>>> >> > transformer="DateFormatTransformer" stream="true" 
>>> >> > dataSource="dataSource">
>>> >> > >> >> > commonField="true" />
>>> >> > 
>>> >> > 
>>> >> > 
>>> >> > 
>>> >> > 
>>> >> >
>>> >> >
>>> >> >
>>> >> > This is how I have specified my schema
>>> >> > 
>>> >> > >> >> > required="true" />
>>> >> > 
>>> >> > 
>>> >> > 
>>> >> >
>>> >> > id
>>> >> > id
>>> >> >

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: [DIH] Multiple repeat XPath stmts

2009-09-13 Thread Fergus McMenemie

>I'm trying to import several RSS feeds using DIH and running into a  
>bit of a problem.  Some feeds define a GUID value that I map to my  
>Solr ID, while others don't.  I also have a link field which I fill in  
>with the RSS link field.  For the feeds that don't have the GUID value  
>set, I want to use the link field as the id.  However, if I define the  
>same XPath twice, but map it to two diff. columns I don't get the id  
>value set.
>
>For instance, I want to do:
>schema.xml
>required="true"/>
>
>
>DIH config:
>
>
>
>Because I am consolidating multiple fields, I'm not able to do  
>copyFields, unless of course, I wanted to implement conditional copy  
>fields (only copy if the field is not defined) which I would rather not.
>
>How do I solve this?
>

How about.


  
  
  
  

The TemplateTransformer does nothing if its source expression is null.
So the first transform assign the fallback value to ID, this is
overwritten by the GUID if it is defined.

You can not sort of do if-then-else using a combination of template
and regex transformers. Adding a bit of maths to the transformers and
I think we will have a turing complete language:-) 

fergus.

>Thanks,
>Grant

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: FileListEntityProcessor and LineEntityProcessor

2009-09-16 Thread Fergus McMenemie

>Hi,
>
>I'm trying to import data from a list of files using the
>FileListEntityProcessor. Here is my import configuration:
>
>  
>  
>baseDir="d:\my\directory\" fileName=".*WRK" recursive="false"
>rootEntity="false">
>  processor="LineEntityProcessor"
>url="${f.fileAbsolutePath}"
>dataSource="fileDataSource"
>transformer="myTransformer">
>  
>
>  
>
>If I have only one file in d:\my\directory\ then everything works correctly.
>If I have multiple files then I get the following exception: 

Sorry but I dont quite follow this. FileListEntityProcessor and
LineEntityProcessor are somewhat similar in that they provide
a list of filenames which the likes of XPathEntityProcessor
then open and parse.

Is the above your complete data-config.xml?

Can you provide more detail on what you are trying to do? ...
You seem to listing all files "d:\my\directory\.*WRK". Do 
these WRK files contain lists of files to be indexed?





>Sep 16, 2009 9:48:46 AM org.apache.solr.handler.dataimport.DocBuilder
>buildDocum
>ent
>SEVERE: Exception while processing: f document : null
>org.apache.solr.handler.dataimport.DataImportHandlerException: Problem
>reading f
>rom input Processing Document # 53812
>at
>org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn
>tityProcessor.java:112)
>at
>org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>ityProcessorWrapper.java:237)
>at
>org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>r.java:348)
>at
>org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>r.java:376)
>at
>org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
>ava:224)
>at
>org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
>:167)
>at
>org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
>rter.java:316)
>at
>org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
>ava:376)
>at
>org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
>va:355)
>Caused by: java.io.IOException: Stream closed
>at java.io.BufferedReader.ensureOpen(Unknown Source)
>at java.io.BufferedReader.readLine(Unknown Source)
>at java.io.BufferedReader.readLine(Unknown Source)
>at
>org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn
>tityProcessor.java:109)
>... 8 more
>Sep 16, 2009 9:48:46 AM org.apache.solr.handler.dataimport.DataImporter
>doFullIm
>port
>SEVERE: Full Import failed
>org.apache.solr.handler.dataimport.DataImportHandlerException: Problem
>reading f
>rom input Processing Document # 53812
>at
>org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn
>tityProcessor.java:112)
>at
>org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(Ent
>ityProcessorWrapper.java:237)
>at
>org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>r.java:348)
>at
>org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilde
>r.java:376)
>at
>org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.j
>ava:224)
>at
>org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java
>:167)
>at
>org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImpo
>rter.java:316)
>at
>org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.j
>ava:376)
>at
>org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.ja
>va:355)
>Caused by: java.io.IOException: Stream closed
>at java.io.BufferedReader.ensureOpen(Unknown Source)
>at java.io.BufferedReader.readLine(Unknown Source)
>at java.io.BufferedReader.readLine(Unknown Source)
>at
>org.apache.solr.handler.dataimport.LineEntityProcessor.nextRow(LineEn
>tityProcessor.java:109)
>... 8 more
>
>
>
>Note that my input files have 53812 lines, which is the same as the document
>number that I'm choking on. Does anyone know what I'm doing wrong?
>
>Thanks,
>
>Wojtek
>-- 
>View this message in context: 
>http://www.nabble.com/FileListEntityProcessor-and-LineEntityProcessor-tp25476443p25476443.html
>Sent from the Solr - User mailing list archive at Nabble.com.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Extract info from parent node during data import (redirect:)

2009-09-17 Thread Fergus McMenemie

JIRA SOLR-1437 created 

  "DIH: Enhance XPathRecordReader to deal with //tagname and other 
improvements."

>Fergus,
>
>Implementing  wildcard (//tagname) is definitely possible. I would love
>to see it working. But if you wish to take a dig at it I shall do
>whatever I can to help.
>
>>What is the use case that makes flow though so useful? 
>We do not know to which forEach xpath a given field is associated with.
>Currently you can clean up the fields using a transformer. There is an
>implicit field '$forEach' which tells you about the xpath tag for each
>record that is emitted.
>
>>The recently added comments in XPathRecordReader are a great help and I
>>was planning to add more. Might this be an issue?
>I would love to have it. Give a patch and I shall commit it.
>XPathRecordReader is a blackbox and AFAIK I am the only one who knows
>it. I would love to have more eyes on that.
>
>>I would like to open a JIRA for improving XPathRecordReader.
>Please go ahead. You can paste the contents of this mail in the list .
>There may be others with similar ideas
>
>Noble.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Number of terms in a SOLR field

2009-09-29 Thread Fergus McMenemie

Hi all,

I am attempting to test some changes I made to my DIH based
indexing process. The changes only affect the way I 
describe my fields in data-config.xml, there should be no
changes to the way the data is indexed or stored.

As a QA check I was wanting to compare the results from
indexing the same data before/after the change. I was looking
for a way of getting counts of terms in each field. I 
guess Luke etc most allow this but how?

Regards Fergus.

Re: Number of terms in a SOLR field

2009-09-30 Thread Fergus McMenemie


>Fergus McMenemie wrote:
>> Hi all,
>> 
>> I am attempting to test some changes I made to my DIH based
>> indexing process. The changes only affect the way I 
>> describe my fields in data-config.xml, there should be no
>> changes to the way the data is indexed or stored.
>> 
>> As a QA check I was wanting to compare the results from
>> indexing the same data before/after the change. I was looking
>> for a way of getting counts of terms in each field. I 
>> guess Luke etc most allow this but how?
>
>Luke uses brute force approach - it traverses all terms, and counts 
>terms per field. This is easy to implement yourself - just get 
>IndexReader.terms() enumeration and traverse it.
>
Thanks Andrzej 

This is just a one off QA check. How do I get Luke to display
terms and counts?

>
>-- 
>Best regards,
>Andrzej Bialecki 

Fergus.  
--

Re: Number of terms in a SOLR field

2009-09-30 Thread Fergus McMenemie

>Fergus McMenemie wrote:
>>> Fergus McMenemie wrote:
>>>> Hi all,
>>>>
>>>> I am attempting to test some changes I made to my DIH based
>>>> indexing process. The changes only affect the way I 
>>>> describe my fields in data-config.xml, there should be no
>>>> changes to the way the data is indexed or stored.
>>>>
>>>> As a QA check I was wanting to compare the results from
>>>> indexing the same data before/after the change. I was looking
>>>> for a way of getting counts of terms in each field. I 
>>>> guess Luke etc most allow this but how?
>>> Luke uses brute force approach - it traverses all terms, and counts 
>>> terms per field. This is easy to implement yourself - just get 
>>> IndexReader.terms() enumeration and traverse it.
>>>
>> Thanks Andrzej 
>> 
>> This is just a one off QA check. How do I get Luke to display
>> terms and counts?
>
>1. get Luke 0.9.9
>2. open index with Luke
>3. Look at the Overview panel, you will see the list titled "Available 
>fields and term counts per field".
>
>
Thanks,

That got me going, and I felt a little stupid after stumbling
across http://wiki.apache.org/solr/LukeRequestHandler

Regards Fergus

Re: Query filters/analyzers

2009-10-02 Thread Fergus McMenemie

>On Thu, Oct 1, 2009 at 7:59 PM, Claudio Martella > wrote:
>
>>
>> About the copyField issue in general: as it copies the content to the
>> other field, what is the sense to define analyzers for the destination
>> field? The source is already analyzed so i guess that the RESULT of the
>> analysis is copied there.
>
>
>The copy is done before analysis. The original text is sent to the copyField
>which can choose to do analysis differently from the source field.
>
I have been wondering about this as well. The WIKI is not explicit about
what happens. Is this correct:-

"The original text is sent to the copyField, before any configured
analyzers for the originating or destination field are invoked."

is so, I will tweak the wiki!

Regds Fergus.
--

Re: Error when indexing XML files

2009-10-13 Thread Fergus McMenemie

>Hi, 
>
>I am trying to index XML files using SolrJ. The original XML file contains 
>nested elements. For example, the following is the snippet of the XML file. 
>
>
>  SOMETHING 
>  SOME_OTHER_THING
> 
>
>I have added the elements "name" and "facility" in Schema.xml file to make 
>these elements indexable. I have changed the XML document above to look like - 
>
>
>
> ..
> SOMETHING 
> ..
>
>
>
Can you send us the Schema.xml file you created? I suspect that 
one of the fields should be multivalued.

-- 
Fergus.

Re: Using DIH's special commands....Help needed

2009-10-15 Thread Fergus McMenemie

Hi,

For example, my data-import.conf has the following. It allows me
to specify a parameter "single=pathname" on the url used to
invoke DIH. It allows a doc to be deleted from the index by,
in my case its pathname, which is stored in the field fileAbsolutePath.


  
 
 

   
 
   
  
  

I feel sure this can be optimised!

Fergus.

>On Thu, Oct 15, 2009 at 6:25 PM, William Pierce wrote:
>
>> Folks:
>>
>> I see in the DIH wiki that there are special commands which according to
>> the wiki
>>
>> "Special commands can be given to DIH by adding certain variables to the
>> row returned by any of the components . "
>>
>> In my use case,  my db contains rows that are marked "PendingDelete".   How
>> do I use the $deleteDocByQuery special command to delete these rows using
>> DIH?In other words,  where/how do I specify this?
>>
>>
>The $deleteDocByQuery is for deleting Solr documents by a Solr query and not
>DB rows.
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Error when indexing XML files

2009-10-15 Thread Fergus McMenemie

Hi,

Please find the schema file attached. Please let me know what I am doing wrong.

Regards
Chaitali

--- On Wed, 10/14/09, Fergus McMenemie  wrote:


From: Fergus McMenemie 
Subject: Re: Error when indexing XML files
To: solr-user@lucene.apache.org
Date: Wednesday, October 14, 2009, 2:25 AM

>Hi,
>
>I am trying to index XML files using SolrJ. The original XML file contains 
>nested elements. For example, the following is the snippet of the XML file.
>
>
> SOMETHING 
> SOME_OTHER_THING
> 
>
>I have added the elements "name" and "facility" in Schema.xml file to make 
>these elements indexable. I have changed the XML document above to look like -
>
>
>
> ..
> SOMETHING
> ..
>
>
>
Can you send us the Schema.xml file you created? I suspect that
one of the fields should be multivalued.



   
   

one or other, perhaps both your fields need to be

   
   



-- 
Fergus.

Re: Error when indexing XML files

2009-10-15 Thread Fergus McMenemie

>Hi,
>
>Please find the schema file attached. Please let me know what I am doing wrong.
>
>Regards
>Chaitali
>
>--- On Wed, 10/14/09, Fergus McMenemie  wrote:
>
>
>From: Fergus McMenemie 
>Subject: Re: Error when indexing XML files
>To: solr-user@lucene.apache.org
>Date: Wednesday, October 14, 2009, 2:25 AM
>
>>Hi,
>>
>>I am trying to index XML files using SolrJ. The original XML file contains 
>>nested 
>> elements. For example, the following is the snippet of the XML file.
>>
>>
>> SOMETHING 
>> SOME_OTHER_THING
>> 
>>
>>I have added the elements "name" and "facility" in Schema.xml file to make 
>>these 
>>elements indexable. I have changed the XML document above to look like -
>>
>>
>>
>> ..
>> SOMETHING
>> ..
>>
>>
>>
>Can you send us the Schema.xml file you created? I suspect that
>one of the fields should be multivalued.



   
   

one or other, perhaps both your fields need to be

   
   


-- 
Fergus

Re: Question about DIH execution order

2009-11-02 Thread Fergus McMenemie

Bertie,

Not sure what you are trying to do, we need a clearer description of
what "select *" returns and what you want to end up in the index. But 
to answer your question The transformations happen after DIH has
performed the SQL statement. In fact the rows output from the SQL
command are assigned to the DIH fields and then any transformations
are applied. The examples in 
http://wiki.apache.org/solr/DataImportHandler
are quite good.  

>Hi Noble,
>
>   I tried to understand your suggestions and played different variations
>according to your reply.  But none of them work. Can you explain it in  more
>details?
>   Thanks a lot!
>
>
>
>
>BTW, do you mean your solution as follows?
>
>
>   
>   template="Course:${Course.CourseId}" name="id"/>
> 
>   
> 
>  
> 
>
> But
>   1) There is no TmpCourseId field column.
>   2) Can we put two name CourseId and id in the same map? It seems not.
>
>
>
>
>
>2009/11/1 Noble Paul ?? Â Ë³Ë 
>
>> On Sun, Nov 1, 2009 at 11:59 PM, Bertie Shen 
>> wrote:
>> > Hi folks,
>> >
>> >  I have the following data-config.xml. Is there a way to
>> > let transformation take place after executing SQL "select comment from
>> > Rating where Rating.CourseId = ${Course.CourseId}"?  In MySQL database,
>> > column CourseId in table Course is integer 1, 2, etc;
>> > template transformation will make them like Course:1, Course:2; column
>> > CourseId in table Rating is also integer 1, 2, etc.
>> >
>> >  If transformation happens before executing "select comment from Rating
>> > where Rating.CourseId = ${Course.CourseId}", then there will no match for
>> > the SQL statement execution.
>> >
>> >  
>> > 
>> >  > > column="CourseId" template="Course:${Course.CourseId}" name="id"/>
>> >  
>> >
>> >  
>> >
>> >  
>> >
>>
>> keep the field as follows
>>  > column="TmpCourseId" name="CourseId"
>> template="Course:${Course.CourseId}" name="id"/>
>>
>>
>>
>>
>> --
>> -
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Trying to run solr-1.3.0 under tomcat 5.5.20 on OS X 10.5.5

2008-11-05 Thread Fergus McMenemie

(HostConfig.java:809) 
at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:698) 
at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:472) 
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1122) 
at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:310) 
at 
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
 
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1021) 
at org.apache.catalina.core.StandardHost.start(StandardHost.java:718) 
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1013) 
at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:442) 
at org.apache.catalina.core.StandardService.start(StandardService.java:450) 
at org.apache.catalina.core.StandardServer.start(StandardServer.java:709) 
at org.apache.catalina.startup.Catalina.start(Catalina.java:551) 
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 
at java.lang.reflect.Method.invoke(Method.java:585) 
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:294) 
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:432) 

So I guess the solrconfig.xml is been seen! Any help gratefully accepted!


-- 

=======
Fergus McMenemie   Email:[EMAIL PROTECTED]
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Large Data Set Suggestions

2008-11-05 Thread Fergus McMenemie

>Greetings!
> 
>I've been asked to do some indexing performance testing on Solr 1.3
>using large XML document data sets (10M-60M docs) with DIH versus SolrJ.
>
> 
>Does anyone have any suggestions where I might find a good data set this
>size?  
> 
>I saw the wikipedia dump reference in the DIH wiki, but that is only in
>the 7M+ doc range.
> 
>Any suggestions would be greatly appreciated.
> 
>Thanks,
> 
>Steve

How large should each document be?

I quite often do testing using the geonames_dd_dms_date_20081028
dataset from http://earth-info.nga.mil/gns/html/namefiles.htm. It has
6.6M Documents. It is actually a CVS separated file but it is trivial
to convert to XML.


-- 

=======
Fergus McMenemie   Email:[EMAIL PROTECTED]
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Trying to run solr-1.3.0 under tomcat 5.5.20 on OS X 10.5.5 (works with 1.2.0)

2008-11-06 Thread Fergus McMenemie

Further to last message. I downloaded and repeated everything
using Solr 1.2.0. This time everything worked fine! But I have
to confess that my system is running 10.4.11 tiger rather than
leopard, I do not know if that is significant.

So it seems the instructions for deploying solr version 1.3.0
to tomcat under OS X tiger do not work.  The only change is 
the version of solr.

Any ideas?

At 12:25 + 6/11/08, Fergus McMenemie wrote:
>Hello all,
>
>I downloaded everything and set it up as per the instructions, and while it
>does run under jetty, I can not get it to start under tomcat at all. I get 
>the following errors. This is with solrconfig.xml straight from the tgz file.
>
>HTTP Status 500 - 
>   Severe errors in solr configuration. 
>   Check your log files for more detailed information on what may be wrong. 
>   If you want solr to continue after configuration errors, 
>   change: false in 
> null 
>   - 
>java.lang.RuntimeException: java.lang.NoSuchMethodError: 
>  
> org.apache.lucene.index.IndexReader.open(Lorg/apache/lucene/store/Directory;Z)Lorg/apache/lucene/index/IndexReader;
>  
>at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:960) 
>at org.apache.solr.core.SolrCore.(SolrCore.java:470) 
>at 
>org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:119)
> 
>at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69) 
> 
>at 
>org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:223)
>  
>at 
>org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:304)
>  
>at 
>org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:77)
>  
>at 
>org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3634)
>  
>at org.apache.catalina.core.StandardContext.start(StandardContext.java:4217)  
>at 
>org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:759)
>  
>at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:739)  
>at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:524)  
>at org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:809)  
>at org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:698)  
>at org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:472)  
>at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1122)  
>at org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:310)  
>at 
>org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
>  
>at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1021)  
>at org.apache.catalina.core.StandardHost.start(StandardHost.java:718)  
>at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1013)  
>at org.apache.catalina.core.StandardEngine.start(StandardEngine.java:442)  
>at org.apache.catalina.core.StandardService.start(StandardService.java:450)  
>at org.apache.catalina.core.StandardServer.start(StandardServer.java:709)  
>at org.apache.catalina.startup.Catalina.start(Catalina.java:551)  
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
>at 
>sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)  
>at 
>sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  
>at java.lang.reflect.Method.invoke(Method.java:585)  
>at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:294)  
>at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:432).
>
>Now I retried the above with solrconfig.xml  set to 
>false and saw no change. So I was wondering if the config file was not being 
>seen. So I renamed the .solr directory to ./solr-not and retried:-
>
>HTTP Status 500 -
>   Severe errors in solr configuration. 
>   Check your log files for more detailed information on what may be wrong. 
>   If you want solr to continue after configuration errors, 
>   change: false in 
> null 
>   - 
>java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in classpath 
>or 'solr/conf/', cwd=/usr/local 
>at 
>org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:194)
> 
>at 
>org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:162)
> 
>at org.apache.solr.core.Config.(Config.java:100) 
>at org.apache.solr.core.SolrConfig.(SolrConfig.java:113) 
>at org.apache.solr.core.SolrConfig.(SolrConfig.java:70) 
>at 
>org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:1

Newbe! Trying to run solr-1.3.0 under tomcat. Please help

2008-11-14 Thread Fergus McMenemie

Hello all, 

Further to various messages. I just cannot get solr 1.3 to launch
under OS X with tomcat. Solr 1.2 works fine with tomcat and I am
OK with 1.3 under jetty.


I have tried tomcat-5.5.20 and 5.2.27. I have tried solr
1.3.0 plus the nightly build. I have tried under OS X 10.5
and 10.4 (leopard and tiger) all fail as follows. I also 
tried cutting and pasting the instructions from:-
http://wiki.apache.org/solr/SolrTomcat

Here is what I see on the browser. When I try to access 
http://localhost:8080/solr


At 14:26 + 14/11/08, Fergus McMenemie wrote:
>HTTP Status 500 - Severe errors in solr configuration. 
> Check your log files for more detailed information on what may be wrong.
> If you want solr to continue after configuration errors, change: 
> false in null 
> - 
>java.lang.RuntimeException: java.lang.NoSuchMethodError: 
>org.apache.lucene.index.IndexReader.open(Lorg/apache/lucene/store/Directory;Z)Lorg/apache/lucene/index/IndexReader;
> at org at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1065)
> at org at org.apache.solr.core.SolrCore.(SolrCore.java:553)
> at org at 
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:120)
> at org at 
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
> at org at 
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:221)
> at org at 
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:302)
> at org at 
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:78)
> at org at 
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3635)
> at org at 
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4222)
> at org at 
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:760)
> at org at 
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:740)
> at org at 
> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:544)
> at org at 
> org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626)
> at org at 
> org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
> at org at 
> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488)
> at org at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1150)
> at org at 
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
> at org at 
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120)
> at org at 
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1022)
> at org at org.apache.catalina.core.StandardHost.start(StandardHost.java:736)
> at org at 
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1014)
> at org at 
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
> at org at 
> org.apache.catalina.core.StandardService.start(StandardService.java:448)
> at org at 
> org.apache.catalina.core.StandardServer.start(StandardServer.java:700)
> at org at org.apache.catalina.startup.Catalina.start(Catalina.java:552)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>  at java.lang.reflect.Method.invoke(Method.java:585)
> at org at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:295)
> at org at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:433)
>Caused by: java.lang.NoSuchMethodError: 
>org.apache.lucene.index.IndexReader.open(Lorg/apache/lucene/store/Directory;Z)Lorg/apache/lucene/index/IndexReader;
> at org at 
> org.apache.solr.search.SolrIndexSearcher.(SolrIndexSearcher.java:109)
> at org at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1055) ... 
> 30 more 


Here is a dump from tomcat/logs/catalina.out. It suggests there
is something wrong with my solr/home property, however you can
see that earlier on it seemed ok with this property.

At 14:26 + 14/11/08, Fergus McMenemie wrote:
>Nov 14, 2008 4:55:33 AM org.apache.catalina.core.AprLifecycleListener 
>lifecycleEvent INFO:
> The Apache Tomcat Native library which allows optimal performance in 
> production environments was not found on the java.library.path: 
> /usr/local/bin:.:/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java
>Nov 14, 2008 4:55:33 AM org.apache.coyote.http11.Http11BaseProtocol init INFO: 
>Initializing Coyote HTTP/1.1 on http-8080
>Nov 14, 2008 4:55

Re: Newbe! Trying to run solr-1.3.0 under tomcat. Solved!

2008-11-15 Thread Fergus McMenemie

Erik,

Thanks for "proving" the stuff for me. I started taking my system apart
and was considering a fresh install, when I came across an old lucene jar
file /Library/Java/Extensions/lucene-core-2.3.1.jar which was there after
a bootcamp tutorial! It was on both my tiger and leopard machines as well.

I guess that explains why solr1.2 worked! But what does jetty do 
differently from tomcat?

Regards Fergus.

>To be fair, my first message was about Solr trunk + Tomcat 5.5.27, but  
>I just tried it by pointing to a Solr 1.3.0 official release and it  
>worked fine as well.
>
>   Erik
>
>On Nov 14, 2008, at 12:30 PM, Erik Hatcher wrote:
>
>> Fergus,
>>
>> I just downloaded Tomcat 5.5.27, put a solr.xml file in conf/ 
>> Catalina/localhost with the following:
>>
>>  > debug="0" crossContext="true" >
>> 
>>  
>>
>> And Solr started up just fine and it's admin, etc worked as expected.
>>
>> Oh, and on Mac OS X (of course!), version 10.5.5.
>>
>>  Erik
>>
>> On Nov 14, 2008, at 12:17 PM, Fergus McMenemie wrote:
>>
>>> Hello all,
>>>
>>> Further to various messages. I just cannot get solr 1.3 to launch
>>> under OS X with tomcat. Solr 1.2 works fine with tomcat and I am
>>> OK with 1.3 under jetty.
>>>
>>>
>>> I have tried tomcat-5.5.20 and 5.2.27. I have tried solr
>>> 1.3.0 plus the nightly build. I have tried under OS X 10.5
>>> and 10.4 (leopard and tiger) all fail as follows. I also
>>> tried cutting and pasting the instructions from:-
>>> http://wiki.apache.org/solr/SolrTomcat
>>>
>>> Here is what I see on the browser. When I try to access
>>> http://localhost:8080/solr
>>>
>>>
>>> At 14:26 + 14/11/08, Fergus McMenemie wrote:
>>>> HTTP Status 500 - Severe errors in solr configuration.
>>>> Check your log files for more detailed information on what may be  
>>>> wrong.
>>>> If you want solr to continue after configuration errors, change:  
>>>> false in null
>>>> -
>>>> java.lang.RuntimeException: java.lang.NoSuchMethodError:  
>>>> org.apache.lucene.index.IndexReader.open(Lorg/apache/lucene/store/ 
>>>> Directory;Z)Lorg/apache/lucene/index/IndexReader;
>>>> at org at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java: 
>>>> 1065)
>>>> at org at org.apache.solr.core.SolrCore.(SolrCore.java:553)
>>>> at org at org.apache.solr.core.CoreContainer 
>>>> $Initializer.initialize(CoreContainer.java:120)
>>>> at org at  
>>>> org 
>>>> .apache 
>>>> .solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
>>>> at org at  
>>>> org 
>>>> .apache 
>>>> .catalina 
>>>> .core 
>>>> .ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:221)
>>>> at org at  
>>>> org 
>>>> .apache 
>>>> .catalina 
>>>> .core 
>>>> .ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java: 
>>>> 302)
>>>> at org at  
>>>> org 
>>>> .apache 
>>>> .catalina 
>>>> .core.ApplicationFilterConfig.(ApplicationFilterConfig.java: 
>>>> 78)
>>>> at org at  
>>>> org 
>>>> .apache 
>>>> .catalina.core.StandardContext.filterStart(StandardContext.java: 
>>>> 3635)
>>>> at org at  
>>>> org 
>>>> .apache.catalina.core.StandardContext.start(StandardContext.java: 
>>>> 4222)
>>>> at org at  
>>>> org 
>>>> .apache 
>>>> .catalina.core.ContainerBase.addChildInternal(ContainerBase.java: 
>>>> 760)
>>>> at org at  
>>>> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java: 
>>>> 740)
>>>> at org at  
>>>> org.apache.catalina.core.StandardHost.addChild(StandardHost.java: 
>>>> 544)
>>>> at org at  
>>>> org 
>>>> .apache 
>>>> .catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626)
>>>> at org at  
>>>> org 
>>>> .apache 
>>>> .catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
>>>> at org at  
>>>> org.apache.catalina.startup.HostConfig.deployApps(HostCon

Upgrade from 1.2 to 1.3 gives 3x slowdown

2008-11-19 Thread Fergus McMenemie

Hello,

I have a CSV file with 6M records which took 22min to index with 
solr 1.2. I then stopped tomcat replaced the solr stuff inside 
webapps with version 1.3, wiped my index and restarted tomcat.

Indexing the exact same content now takes 69min. My machine has
2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M.

Are there any tweaks I can use to get the original index time
back. I read through the release notes and was expecting a
speed up. I saw the bit about increasing ramBufferSizeMB and set
it to 64MB; it had no effect.
-- 

===
Fergus McMenemie   Email:[EMAIL PROTECTED]
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Upgrade from 1.2 to 1.3 gives 3x slowdown

2008-11-20 Thread Fergus McMenemie

Hello Grant, 

>Were you overwriting the existing index or did you also clean out the  
>Solr data directory, too?  In other words, was it a fresh index, or an  
>existing one?  And was that also the case for the 22 minute time?

No in each case it was a new index. I store the indexes (the "data" dir)
outside the solr home directory. For the moment I, rm -rf the index dir
after each edit to the solrconfig.sml or schema.xml file and reindex
from scratch. The relaunch of tomcat recreates the index dir.

>Would it be possible to profile the two instance and see if you notice  
>anything different?
I dont understand this. Do mean run a profiler against the tomcat
image as indexing takes place, or somehow compare the indexes?

I was think of making a short script that replicates the results, 
and posting it here, would that help?

>
>Thanks,
>Grant
>
>On Nov 19, 2008, at 8:25 AM, Fergus McMenemie wrote:
>
>> Hello,
>>
>> I have a CSV file with 6M records which took 22min to index with
>> solr 1.2. I then stopped tomcat replaced the solr stuff inside
>> webapps with version 1.3, wiped my index and restarted tomcat.
>>
>> Indexing the exact same content now takes 69min. My machine has
>> 2GB of RAM and tomcat is running with $JAVA_OPTS -Xmx512M -Xms512M.
>>
>> Are there any tweaks I can use to get the original index time
>> back. I read through the release notes and was expecting a
>> speed up. I saw the bit about increasing ramBufferSizeMB and set
>> it to 64MB; it had no effect.
>> -- 
>>
>> ===
>> Fergus McMenemie   Email:[EMAIL PROTECTED]
>> Techmore Ltd   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets     Analyst Programmer
>> ===

-- 

===
Fergus McMenemie   Email:[EMAIL PROTECTED]
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: [VOTE] Community Logo Preferences

2008-11-24 Thread Fergus McMenemie

https://issues.apache.org/jira/secure/attachment/12394263/apache_solr_a_blue.jpg

Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!

2008-11-26 Thread Fergus McMenemie

Hello Grant, 

Not much good with Java profilers (yet!) so I thought I 
would send a script!

Details... details! Having decided to produce a script to 
replicate the 1.2 vis 1.3 speed problem. The required rigor 
revealed a lot more.

1) The faster version I have previously referred to as 1.2,
   was actually a "1.3-dev" I had downloaded as part of the
   solr bootcamp class at ApacheCon Europe 2008. The ID
   string in the CHANGES.txt document is:-
   $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $

2) I did actually download and speed test a version of 1.2 
   from the internet. It's CHANGES.txt id is:-
   $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
   Speed wise it was about the same as 1.3 at 64min. It also
   had lots of char set issues and is ignored from now on.

3) The version I was planning to use, till I found this,
   speed issue was the "latest" official version:-
   $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
   I also verified the behavior with a nightly build.
   $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $

Anyway, The following script indexes the content in 22min
for the 1.3-dev version and takes 68min for the newer releases
of 1.3. I took the conf directory from the 1.3dev (bootcamp) 
release and used it replace the conf directory from the
official 1.3 release. The 3x slow down was still there; it is
not a configuration issue!
=

#! /bin/bash

# This script assumes a /usr/local/tomcat link to whatever version
# of tomcat you have installed. I have "apache-tomcat-5.5.20" Also 
# /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml. 
# All the following was done as root.

# I have a directory /usr/local/ts which contains four versions of solr. The
# "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 1.3beata
# I got while attending a solr bootcamp. I indexed the same content using the
# different versions of solr as follows:
cd /usr/local/ts
if [ "" ] 
then 
   echo "Starting from a-fresh"
   sleep 5 # allow time for me to interrupt!
   cp -Rp apache-solr-bc/example/solr  ./solrbc  #bc = bootcamp
   cp -Rp apache-solr-nightly/example/solr ./solrnightly
   cp -Rp apache-solr-1.3.0/example/solr   ./solr13

   # the gaz is regularly updated and its name keeps changing :-) The page
   # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest
   # version.
   curl "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip"; 
> geonames.zip
   unzip -q geonames.zip
   # delete corrupt blips!
   perl -i -n -e 'print unless  
   ($. > 2128495 and $. < 2128505) or
   ($. > 5944254 and $. < 5944260) 
   ;' geonames_dd_dms_date_20081118.txt
   #following was used to detect bad short records
   #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" if 
(@F != 26);' geonames_dd_dms_date_20081118.txt

   # my set of fields and copyfields for the schema.xml
   fields='

   '
   copyfields='

   '

   # add in my fields and copyfields
   perl -i -p -e "print qq($fields) if s///;"   
solr*/conf/schema.xml
   perl -i -p -e "print qq($copyfields) if s[][];" 
solr*/conf/schema.xml
   # change the unique key and mark the "id" field as not required
   perl -i -p -e "s/id/UNI/i;"
solr*/conf/schema.xml
   perl -i -p -e 's/required="true"//i if m/http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip";

echo "Getting ready to index the data set using solrnightly"
/usr/local/tomcat/bin/shutdown.sh
sleep 15
if [ -n "`ps awxww | grep tomcat | grep -v grep`" ] 
   then 
   echo "Tomcat would not shutdown"
   exit
   fi
rm -r /usr/local/tomcat/webapps/solr*
rm -r /usr/local/tomcat/logs/*.out
rm -r /usr/local/tomcat/work/Catalina/localhost/solr
cp apache-solr-nightly/example/webapps/solr.war /usr/local/tomcat/webapps
rm solr # rm the symbolic link
ln -s solrnightly solr
rm -r solr/data
/usr/local/tomcat/bin/startup.sh
sleep 10 # give solr time to launch and setup
echo "Starting indexing at " `date` " with solrnightly"
time curl 
"http://localhost:8080/solr/update/csv?commit=true&stream.file=$file&escape=%00&separator=%09&skip=$skip";

>On Nov 20, 2008, at 9:18 AM, Fergus McMenemie wrote:
>
>> Hello Grant,
>>
>>> Were you overwriting the existing index or did you also clean out the
>>> Solr data directory, too?  In other words, was it a fresh index, or  
>>> an
>>> existing one?  And was that also the case for the

Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!

2008-12-01 Thread Fergus McMenemie

Hello Grant,
>
>Haven't forgotten about you, but I've been traveling and then into  
>some US Holidays here.
Happy thanks giving!

>
>To confirm I am understanding, you are seeing a slowdown between 1.3- 
>dev from April and one from September, right?
Yep.

Here are the MD5 hashes:-
fergus: md5 *.war
MD5 (solr-bc.war) = 8d4f95628d6978c959d63d304788bc25
MD5 (solr-nightly.war) = 10281455a66b0035ee1f805496d880da

This is the META-INF/MANIFEST.MF from a recent nightly build. (slow)
  Manifest-Version: 1.0
  Ant-Version: Apache Ant 1.7.0
  Created-By: 1.5.0_06-b05 (Sun Microsystems Inc.)
  Extension-Name: org.apache.solr
  Specification-Title: Apache Solr Search Server
  Specification-Version: 1.3.0.2008.11.13.08.16.12
  Specification-Vendor: The Apache Software Foundation
  Implementation-Title: org.apache.solr
  Implementation-Version: nightly exported - yonik - 2008-11-13 08:16:12
  Implementation-Vendor: The Apache Software Foundation
  X-Compile-Source-JDK: 1.5
  X-Compile-Target-JDK: 1.5

This is  war file we were given on the course
  Manifest-Version: 1.0
  Ant-Version: Apache Ant 1.7.0
  Created-By: 1.5.0_13-121 ("Apple Computer, Inc.")
  Extension-Name: org.apache.solr
  Specification-Title: Apache Solr Search Server
  Specification-Version: 1.2.2008.04.04.08.09.14
  Specification-Vendor: The Apache Software Foundation
  Implementation-Title: org.apache.solr
  Implementation-Version: 1.3-dev exported - erik - 2008-04-04 08:09:14
  Implementation-Vendor: The Apache Software Foundation
  X-Compile-Source-JDK: 1.5
  X-Compile-Target-JDK: 1.5

I have copied both war files to a web site

http://www.twig.me.uk/solr/solr-bc.war (solr 1.3 dev == bootcamp)

http://www.twig.me.uk/solr/solr-nightly.war (nightly)


Regards Fergus.

>Can you produce an MD5 hash of the WAR file or something, such that I  
>can know I have the exact bits.  Better yet, perhaps you can put those  
>files up somewhere where they can be downloaded.
>
>Thanks,
>Grant
>
>On Nov 26, 2008, at 10:54 AM, Fergus McMenemie wrote:
>
>> Hello Grant,
>>
>> Not much good with Java profilers (yet!) so I thought I
>> would send a script!
>>
>> Details... details! Having decided to produce a script to
>> replicate the 1.2 vis 1.3 speed problem. The required rigor
>> revealed a lot more.
>>
>> 1) The faster version I have previously referred to as 1.2,
>>   was actually a "1.3-dev" I had downloaded as part of the
>>   solr bootcamp class at ApacheCon Europe 2008. The ID
>>   string in the CHANGES.txt document is:-
>>   $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
>>
>> 2) I did actually download and speed test a version of 1.2
>>   from the internet. It's CHANGES.txt id is:-
>>   $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
>>   Speed wise it was about the same as 1.3 at 64min. It also
>>   had lots of char set issues and is ignored from now on.
>>
>> 3) The version I was planning to use, till I found this,
>>   speed issue was the "latest" official version:-
>>   $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
>>   I also verified the behavior with a nightly build.
>>   $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
>>
>> Anyway, The following script indexes the content in 22min
>> for the 1.3-dev version and takes 68min for the newer releases
>> of 1.3. I took the conf directory from the 1.3dev (bootcamp)
>> release and used it replace the conf directory from the
>> official 1.3 release. The 3x slow down was still there; it is
>> not a configuration issue!
>> =
>>
>>
>>
>>
>>
>>
>> #! /bin/bash
>>
>> # This script assumes a /usr/local/tomcat link to whatever version
>> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
>> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
>> # All the following was done as root.
>>
>>
>> # I have a directory /usr/local/ts which contains four versions of  
>> solr. The
>> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or  
>> a 1.3beata
>> # I got while attending a solr bootcamp. I indexed the same content  
>> using the
>> # different versions of solr as follows:
>> cd /usr/local/ts
>> if [ "" ]
>> then
>>   echo "Starting from a-fresh"
>>   sleep 5 # allow time for me to interrupt!
>>   cp -Rp apache-solr-bc/example/solr  ./solrbc  #bc = bootcamp
>>   cp -Rp apache-solr-nightly/example/solr ./solrnightly
>>   cp -Rp apache-solr-1.3.0/example/solr   ./solr13
>>
>>   # the gaz is regularly u

Re: Upgrade from 1.2 to 1.3 gives 3x slowdown + script!

2008-12-11 Thread Fergus McMenemie

Yonik

>Another thought I just had - do you have autocommit enabled?
>
No; not as far as I know!

The solrconfig.xml from the two versions are equivalent as best I can tell,
also they are exactly as provided in the download. The only changes were
made by the attached script and should not affect committing. Finally the
indexing command has commit=true, which I think means do a single commit
at the end of the file?

Regards Fergus.


>A lucene commit is now more expensive because it syncs the files for
>safety.  If you commit frequently, this could definitely cause a
>slowdown.
>
>-Yonik
>
>On Wed, Nov 26, 2008 at 10:54 AM, Fergus McMenemie <[EMAIL PROTECTED]> wrote:
>> Hello Grant,
>>
>> Not much good with Java profilers (yet!) so I thought I
>> would send a script!
>>
>> Details... details! Having decided to produce a script to
>> replicate the 1.2 vis 1.3 speed problem. The required rigor
>> revealed a lot more.
>>
>> 1) The faster version I have previously referred to as 1.2,
>>   was actually a "1.3-dev" I had downloaded as part of the
>>   solr bootcamp class at ApacheCon Europe 2008. The ID
>>   string in the CHANGES.txt document is:-
>>   $Id: CHANGES.txt 643465 2008-04-01 16:10:19Z gsingers $
>>
>> 2) I did actually download and speed test a version of 1.2
>>   from the internet. It's CHANGES.txt id is:-
>>   $Id: CHANGES.txt 543263 2007-05-31 21:19:02Z yonik $
>>   Speed wise it was about the same as 1.3 at 64min. It also
>>   had lots of char set issues and is ignored from now on.
>>
>> 3) The version I was planning to use, till I found this,
>>   speed issue was the "latest" official version:-
>>   $Id: CHANGES.txt 694377 2008-09-11 17:40:11Z klaas $
>>   I also verified the behavior with a nightly build.
>>   $Id: CHANGES.txt 712457 2008-11-09 01:24:11Z koji $
>>
>> Anyway, The following script indexes the content in 22min
>> for the 1.3-dev version and takes 68min for the newer releases
>> of 1.3. I took the conf directory from the 1.3dev (bootcamp)
>> release and used it replace the conf directory from the
>> official 1.3 release. The 3x slow down was still there; it is
>> not a configuration issue!
>> =
>>
>>
>>
>>
>>
>>
>> #! /bin/bash
>>
>> # This script assumes a /usr/local/tomcat link to whatever version
>> # of tomcat you have installed. I have "apache-tomcat-5.5.20" Also
>> # /usr/local/tomcat/conf/Catalina/localhost contains no solr.xml.
>> # All the following was done as root.
>>
>>
>> # I have a directory /usr/local/ts which contains four versions of solr. The
>> # "official" 1.2 along with two 1.3 releases and a version of 1.2 or a 
>> 1.3beata
>> # I got while attending a solr bootcamp. I indexed the same content using the
>> # different versions of solr as follows:
>> cd /usr/local/ts
>> if [ "" ]
>> then
>>   echo "Starting from a-fresh"
>>   sleep 5 # allow time for me to interrupt!
>>   cp -Rp apache-solr-bc/example/solr  ./solrbc  #bc = bootcamp
>>   cp -Rp apache-solr-nightly/example/solr ./solrnightly
>>   cp -Rp apache-solr-1.3.0/example/solr   ./solr13
>>
>>   # the gaz is regularly updated and its name keeps changing :-) The page
>>   # http://earth-info.nga.mil/gns/html/namefiles.htm has a link to the latest
>>   # version.
>>   curl 
>> "http://earth-info.nga.mil/gns/html/geonames_dd_dms_date_20081118.zip"; > 
>> geonames.zip
>>   unzip -q geonames.zip
>>   # delete corrupt blips!
>>   perl -i -n -e 'print unless
>>   ($. > 2128495 and $. < 2128505) or
>>   ($. > 5944254 and $. < 5944260)
>>   ;' geonames_dd_dms_date_20081118.txt
>>   #following was used to detect bad short records
>>   #perl -a -F\\t -n -e ' print "line $. is bad with ",scalar(@F)," args\n" 
>> if (@F != 26);' geonames_dd_dms_date_20081118.txt
>>
>>   # my set of fields and copyfields for the schema.xml
>>   fields='
>>   
>>  > required="true" />
>>  > stored="true"/>
>>  > stored="true"/>
>>  > stored="true"/>
>>  > stored="true"/>
>>  > stored="true"/>
>>  > stored="true"/>
>>  > stored="true"/>
>>  > stored="true"/>
>>  > stored="true"/>
>>

correct use of copyFields in schema.xml

2008-12-17 Thread Fergus McMenemie

Hello all,

Reviewing the various examples that comes with Solr I cant make up
mind wether the copyFields element should be nested within the
fields element or not. The http://wiki.apache.org/solr/SchemaXml
documentation makes it clear it should be outside, yet a number
of examples have it nested.

IMHO, being able to nest copyFields inside fields makes for 
more self documenting code!

Regards Fergus
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

getting DIH to read my XML files

2009-01-13 Thread Fergus McMenemie

Hello,

I am trying to use DIH with FileListEntityProcessor to to walk the
disk and read XML documents. I have a dataConfig.xml as follows:-

   

   
   0
   
   
   
   
   
   
   
   
   
   


But when I try and start the walker I get:-

   INFO: [jdocs] REMOVING ALL DOCUMENTS FROM INDEX
   Jan 13, 2009 3:38:11 PM org.apache.solr.core.SolrDeletionPolicy onInit
   INFO: SolrDeletionPolicy.onInit: commits:num=2
   
commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_1,version=1231861070710,generation=1,filenames=[segments_1]
   
commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_2,version=1231861070711,generation=2,filenames=[segments_2]
   Jan 13, 2009 3:38:11 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
   INFO: last commit = 1231861070711
   Jan 13, 2009 3:38:11 PM org.apache.solr.handler.dataimport.DocBuilder 
buildDocument
   SEVERE: Exception while processing: jcurrent document : null
   org.apache.solr.handler.dataimport.DataImportHandlerException: No dataSource 
:null available for entity :x Processing Document # 1
   at 
org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:287)
   at 
org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:86)
   at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:78)
   at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:243)
   at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:309)
   at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179)
   at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:137)
   at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:337)
   at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:397)
   at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:378)
   Jan 13, 2009 3:38:11 PM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
   SEVERE: Full Import failed
   org.apache.solr.handler.dataimport.DataImportHandlerException: No dataSource 
:null available for entity :x Processing Document # 1
   at 
org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:287)
   at 
org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:86)
   at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:78)
   at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:243)
   at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:309)
   at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179)
   at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:137)
   at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:337)
   at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:397)
   at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:378)

Anybody able to point out what I have done wrong?

Regards Fergus.
-- 
===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021
Unix/Mac/Intranets Analyst Programmer
===

Re: getting DIH to read my XML files

2009-01-13 Thread Fergus McMenemie

putFactory.java:543)
   at 
com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:604)
   at 
com.ctc.wstx.stax.WstxInputFactory.createSR(WstxInputFactory.java:660)
   at 
com.ctc.wstx.stax.WstxInputFactory.createXMLStreamReader(WstxInputFactory.java:331)
   at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:81)
   ... 10 more


>
>On Tue, Jan 13, 2009 at 9:28 PM, Fergus McMenemie  wrote:
>
>> Hello,
>>
>> I am trying to use DIH with FileListEntityProcessor to to walk the
>> disk and read XML documents. I have a dataConfig.xml as follows:-
>>
>>   
>>
>>   >   processor="FileListEntityProcessor"
>>   fileName=".*xml"
>>   newerThan="'NOW-1000DAYS'"
>>   recursive="true"
>>   rootEntity="false"
>>   dataSource="null"
>>   baseDir="/Volumes/spare/ts/j/groups">
>>   >   processor="XPathEntityProcessor"
>>   url="${jcurrent.fileAbsolutePath}"
>>   stream="false"
>>   forEach="/record"
>>   transformer="DateFormatTransformer">0
>>   
>>   > xpath="/record/metadata/subje...@qualifier='fullTitle']"/>
>>   
>>   > xpath="/record/metadata/subje...@qualifier='publication']"/>
>>   >  xpath="/record/metadata/subje...@qualifier='pubAbbrev']"/>
>>   > xpath="/record/metadata/da...@qualifier='pubDate']"/>
>>
>>   
>>   
>>   
>>
>>
>> But when I try and start the walker I get:-
>>
>>   INFO: [jdocs] REMOVING ALL DOCUMENTS FROM INDEX
>>   Jan 13, 2009 3:38:11 PM org.apache.solr.core.SolrDeletionPolicy onInit
>>   INFO: SolrDeletionPolicy.onInit: commits:num=2
>>
>> commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_1,version=1231861070710,generation=1,filenames=[segments_1]
>>
>> commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_2,version=1231861070711,generation=2,filenames=[segments_2]
>>   Jan 13, 2009 3:38:11 PM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>>   INFO: last commit = 1231861070711
>>   Jan 13, 2009 3:38:11 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>>   SEVERE: Exception while processing: jcurrent document : null
>>   org.apache.solr.handler.dataimport.DataImportHandlerException: No
>> dataSource :null available for entity :x Processing Document # 1
>>   at
>> org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:287)
>>   at
>> org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:86)
>>   at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:78)
>>   at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:243)
>>   at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:309)
>>   at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:179)
>>   at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:137)
>>   at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:337)
>>   at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:397)
>>   at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:378)
>>   Jan 13, 2009 3:38:11 PM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>>   SEVERE: Full Import failed
>>   org.apache.solr.handler.dataimport.DataImportHandlerException: No
>> dataSource :null available for entity :x Processing Document # 1
>>   at
>> org.apache.solr.handler.dataimport.DataImporter.getDataSourceInstance(DataImporter.java:287)
>>   at
>> org.apache.solr.handler.dataimport.ContextImpl.getDataSource(ContextImpl.java:86)
>>   at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:78)
>>   at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:243)
>>   at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:309)

DIH XPathEntityProcessor fails with docs containing

2009-01-16 Thread Fergus McMenemie

Hello all, as the subject says:
   DIH XPathEntityProcessor fails with docs containing 
   
This is using a solr nightly build from monday.

INFO: Server startup in 3623 ms
Jan 16, 2009 9:54:12 AM org.apache.solr.handler.dataimport.SolrWriter 
readIndexerProperties
INFO: Read dataimport.properties
Jan 16, 2009 9:54:12 AM org.apache.solr.core.SolrCore execute
INFO: [jdocs] webapp=/solr path=/walkj params={command=full-import} status=0 
QTime=13 
Jan 16, 2009 9:54:12 AM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
INFO: Starting Full Import
Jan 16, 2009 9:54:12 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [jdocs] REMOVING ALL DOCUMENTS FROM INDEX
Jan 16, 2009 9:54:12 AM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2

commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_c,version=1232026423291,generation=12,filenames=[segments_c,
 _4.fnm, _4.frq, _4.prx, _4.tis, _4.tii, _4.nrm, _4.fdx, _4.fdt]

commit{dir=/Volumes/spare/ts/solrnightlyj/data/index,segFN=segments_d,version=1232026423292,generation=13,filenames=[segments_d]
Jan 16, 2009 9:54:12 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1232026423292
Jan 16, 2009 9:54:13 AM org.apache.solr.handler.dataimport.DocBuilder 
buildDocument
SEVERE: Exception while processing: jcurrent document : null
org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed 
for xml, url:/j/dtd/jxml/data/news/2008/frp70450.xmlrows processed :0 
Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: 
(was java.io.FileNotFoundException) /../config/jml-delivery-norm-2.1.dtd (No 
such file or directory)
 at [row,col {unknown-source}]: [3,81]
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
... 9 more
Caused by: com.ctc.wstx.exc.WstxParsingException: (was 
java.io.FileNotFoundException) /../config/jml-delivery-norm-2.1.dtd (No such 
file or directory)
 at [row,col {unknown-source}]: [3,81]
at 
com.ctc.wstx.sr.StreamScanner.constructWfcException(StreamScanner.java:630)
at com.ctc.wstx.sr.StreamScanner.throwParseError(StreamScanner.java:461)
at 
com.ctc.wstx.sr.ValidatingStreamReader.findDtdExtSubset(ValidatingStreamReader.java:475)
at 
com.ctc.wstx.sr.ValidatingStreamReader.finishDTD(ValidatingStreamReader.java:358)
at 
com.ctc.wstx.sr.BasicStreamReader.skipToken(BasicStreamReader.java:3351)
at 
com.ctc.wstx.sr.BasicStreamReader.nextFromProlog(BasicStreamReader.java:1988)
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1069)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
... 10 more
Jan 16, 2009 9:54:13 AM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
SEVERE: Full Import failed

A fragment from the top of the failing document is




http://dtd.j.com/2002/Content/"; id="frp70450"  
urname="record">
  http://www.w3.org/1999/xlink"; xlink:href="" 
urname="metadata" xlink:type="simple">
http://purl.org/dc/elements/1.1/"; 
qualifier="pdate">20080131

The DTD does exist at the specified location. Removing the DOCTYPE directive
fixes everything. I know that use of DOCTYPE is out of fashion, and it does
not exist in our newer documents, however there are lots of older XML docs 
about!

Regards Fergus.
-

Re: Is it just me or multicore default is broken? Can't ping

2009-01-16 Thread Fergus McMenemie

Julian,

This is with the nightly from jan 12.

I am using mutli core and playing about with DIH. I cant its 
Interactive development mode to work properly and suspect
that to use it I need to run in single core mode.

I am still developing, so I have nothing setup within tomcat startup
files, it all depends on the directory I launch tomcat from which is
/Volumes/spare/ts:-

fergus: ls -al /Volumes/spare/ts 
   total 2657816
   drwxrwxrwx  19 rootfergus646 Jan 14 11:06 .
   drwxrwxr-x  18 rootadmin 680 Jan 13 10:46 ..
   -rw-rw-rw-@  1 fergus  fergus   6148 Jan 16 14:58 .DS_Store
   drwxr-xr-x  16 fergus  fergus544 Apr  8  2008 apache-solr-bc
   drwxr-xr-x@ 15 fergus  fergus510 Jan 14 11:06 apache-solr-nightly
   drwxr-xr-x   3 fergus  fergus102 Jan 13 11:06 solr
   -rw-r--r--@  1 fergus  fergus   57874925 Jan 12 22:31 solr-2009-01-12.tgz
   drwxr-xr-x   8 fergus  fergus272 Dec 16 17:53 solrbc
   drwxr-xr-x   7 fergus  fergus238 Jan 16 12:08 solrnightlyjanes

fergus: ls -al /Volumes/spare/ts/solr
   total 8
   drwxr-xr-x   3 fergus  fergus  102 Jan 13 11:06 .
   drwxrwxrwx  19 rootfergus  646 Jan 14 11:06 ..
   -rw-rw-rw-@  1 fergus  fergus  500 Jan 13 11:07 solr.xml

fergus: more /Volumes/spare/ts/solr/solr.xml
   
   
  
 


 


 
  


Here is a fragment from the top of one of my solrconfig.xml file. Note the use 
of 
solr.data.dir.

fergus: more /Volumes/spare/ts/solrnightlyjanes/conf/solrconfig.xml file. 
   
   
 
 
${solr.abortOnConfigurationError:true}
   
 
 ${solr.data.dir:./solr/data}






fergus: get 'http://localhost:8080/solr/admin/cores' | perl -p -e 
's[()][$1\n  ]g;'
  
  
  0
 2
 
 gazetteer
 solr/../solrbc/
 solrbc/data/
 2009-01-16T12:08:56.033Z
 3078174
 6705364
 6705364
 1229202899164
 false
 true
 false
 org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/Volumes/spare/ts/solrbc/data/index
 2008-12-13T21:39:08Z
 
 
 janesdocs
 solr/../solrnightlyjanes/
 solrnightlyjanes/data/
 2009-01-16T12:08:56.613Z
 3077596
 269
 269
 1232107736664
 true
 true
 false
 org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/Volumes/spare/ts/solrnightlyjanes/data/index
 2009-01-16T12:57:40Z
 
 
   
 
 
fergus: get 'http://localhost:8080/solr/janesdocs/admin/ping' | perl -p -e 
's[()][$1\n  ]g;'
  
  
  0
 2
 all
 all
 solrpingquery
 standard
 
 
 OK
 
 
fergus: get 'http://localhost:8080/solr/gazetteer/admin/ping' | perl -p -e 
's[()][$1\n  ]g;'
 
 
 0
 2
 all
 all
 solrpingquery
 standard
 
 
 OK
 
 
Hope this helps.


>I gave few new shots today:
>- with jetty and nightly build 16 Jan  - same problem null pointer exception
>- Then I decided I am not using solr multicore but rather tomcat to
>handle this. So I get latets tomcat and again with using 1.3.0 solr.war
>I setup all
>as explained
>http://wiki.apache.org/solr/SolrTomcat#head-024d7e11209030f1dbcac9974e55106abae837ac
>Links again are all smooth for admin and all but I still get 500 on pings :(
>
>Is everyone using solr with single index(core)   ?
>
>Cheers
>
>All setup is smooth, working
>
>Julian Davchev wrote:
>> Hi,
>>
>> I am trying with 1.3.0from 
>> http://apache.cbox.biz/lucene/solr/1.3.0/apache-solr-1.3.0.tgz
>>
>> which I supposed is stable release.
>>
>> Otis Gospodnetic wrote:
>>   
>>> Not sure, I'd have to try it.  But you didn't mention which version of Solr 
>>> you are using.  Nightly build?
>>>
>>>
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>>>
>>> - Original Message 
>>>   
>>> 
>>>> From: Julian Davchev 
>>>> To: solr-user@lucene.apache.org
>>>> Sent: Thursday, January 15, 2009 9:53:37 AM
>>>> Subject: Is it just me or multicore default is broken? Can't ping
>>>>
>>>> Hi,
>>>> I am trying to setup multicore solr. So I just download default one with
>>>> jetty...goto example/
>>>> and run
>>>> java -Dsolr.solr.home=multicore -jar start.jar
>>>>
>>>>
>>>> All looks smooth without errors on startup.
>>>> Also can can open admin at
>>>>
>>>> http://localhost:8983/solr/core1/admin/
>>>>
>>>&

Re: getting DIH to read my XML files: solved

2009-01-19 Thread Fergus McMenemie

Shalin, thanks for the pointer.

The following data-config.xml worked. The trick was realising
that EVERY entity tag needs to have its own datasource, I guess
I had been assuming that it was implicit for certain processors.

The whole thing is confusing in that there is both the dataSource
element(s), which is to all intents and purposes required, and an
optional dataSource attribute of the entity element. If the entity
dataSource attribute is missing it defaults to one of the defined
ones??? Unless you are using FileListEntityProcessor where you have
to explicitly state you are not using a dataSource.

As a newbie I think my lesson learnt, is to name every dataSource
element I define and to reference named dataSources from every
entity element I add, except for FileListEntityProcessor where
is has to be set to null.

   

  
  
   0
   
   
   
   
   
   
   
   
   



Regards Fergus.
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-19 Thread Fergus McMenemie

Hello all,

I have the following DIH data-config.xml file. Adding 
HTMLStripTransformer and the associated stripHTML on the 
para tag seems to have broke things. I am using a nightly 
build from 12-jan-2009

The /record/sect1/para contains HTML sub tags which need
to be discarded. Is my use of stripHTML correct?


 
  
 



   
   
   
   
   
   
   
   

 
  

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-19 Thread Fergus McMenemie

extRow(XPathEntityProcessor.java:197)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.NullPointerException
at java.io.StringReader.(StringReader.java:33)
at 
org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
at 
org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
... 9 more
Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback

>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie  wrote:
>
>> Hello all,
>>
>> I have the following DIH data-config.xml file. Adding
>> HTMLStripTransformer and the associated stripHTML on the
>> para tag seems to have broke things. I am using a nightly
>> build from 12-jan-2009
>>
>> The /record/sect1/para contains HTML sub tags which need
>> to be discarded. Is my use of stripHTML correct?
>>
>> 
>>  
>>  
>> >processor="FileListEntityProcessor"
>>fileName=".*xml"
>>newerThan="'NOW-1000DAYS'"
>>recursive="true"
>>rootEntity="false"
>>dataSource="null"
>>baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>>
>>>   dataSource="myfilereader"
>>   processor="XPathEntityProcessor"
>>   url="${jcurrent.fileAbsolutePath}"
>>   stream="false"
>>   forEach="/record"
>>
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>>
>>   > template="${jcurrent.fileAbsolutePath}" />
>>   > replaceWith="$1" sourceColName="fileAbsePath"/>
>>   
>>   > stripHTML="true" />
>>   >  xpath="/record/metadata/subje...@qualifier='fullTitle']"   />
>>   >  xpath="/record/metadata/subje...@qualifier='publication']" />
>>   >  xpath="/record/metadata/da...@qualifier='pubDate']"
>> dateTimeFormat="MMdd"   />
>>   
>>
>> 
>>  
>>
>> --
>>
>> ===
>> Fergus McMenemie   
>> Email:fer...@twig.me.uk
>> Techmore Ltd   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets Analyst Programmer
>> ===
>>
>
>
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-19 Thread Fergus McMenemie

AM org.apache.solr.handler.dataimport.DataImporter 
>doFullImport
>SEVERE: Full Import failed
>org.apache.solr.handler.dataimport.DataImportHandlerException: 
>java.lang.NullPointerException
>   at 
> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
>   at 
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:203)
>   at 
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>   at 
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>   at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>   at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>   at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>   at 
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>Caused by: java.lang.NullPointerException
>   at java.io.StringReader.(StringReader.java:33)
>   at 
> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>   at 
> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>   at 
> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>   ... 9 more
>Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback
>INFO: start rollback
>Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2 rollback
>INFO: end_rollback
>
>
>>On Mon, Jan 19, 2009 at 4:14 PM, Fergus McMenemie  wrote:
>>
>>> Hello all,
>>>
>>> I have the following DIH data-config.xml file. Adding
>>> HTMLStripTransformer and the associated stripHTML on the
>>> para tag seems to have broke things. I am using a nightly
>>> build from 12-jan-2009
>>>
>>> The /record/sect1/para contains HTML sub tags which need
>>> to be discarded. Is my use of stripHTML correct?
>>>
>>> 
>>>  
>>>  
>>> >>processor="FileListEntityProcessor"
>>>fileName=".*xml"
>>>newerThan="'NOW-1000DAYS'"
>>>recursive="true"
>>>rootEntity="false"
>>>dataSource="null"
>>>baseDir="/Volumes/spare/ts/jxml/data/news/groups">
>>>
>>>>>   dataSource="myfilereader"
>>>   processor="XPathEntityProcessor"
>>>   url="${jcurrent.fileAbsolutePath}"
>>>   stream="false"
>>>   forEach="/record"
>>>
>>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer,HTMLStripTransformer">
>>>
>>>   >> template="${jcurrent.fileAbsolutePath}" />
>>>   >> replaceWith="$1" sourceColName="fileAbsePath"/>
>>>   
>>>   >> stripHTML="true" />
>>>   >>  xpath="/record/metadata/subje...@qualifier='fullTitle']"   />
>>>       >>  xpath="/record/metadata/subje...@qualifier='publication']" />
>>>   >>  xpath="/record/metadata/da...@qualifier='pubDate']"
>>> dateTimeFormat="MMdd"   />
>>>   
>>>
>>> 
>>>  
>>>
>>> --
>>>
>>> ===
>>> Fergus McMenemie   
>>> Email:fer...@twig.me.uk
>>> Techmore Ltd   Phone:(UK) 07721 376021
>>>
>>> Unix/Mac/Intranets Analyst Programmer
>>> ===
>>>
>>
>>
>>
>>-- 
>>Regards,
>>Shalin Shekhar Mangar.
>
>-- 
>
>===
>Fergus McMenemie   Email:fer...@twig.me.uk
>Techmore Ltd   Phone:(UK) 07721 376021
>
>Unix/Mac/Intranets Analyst Programmer
>===

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Fergus McMenemie

(DataImporter.java:321)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
... 9 more
Caused by: java.util.NoSuchElementException
at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
at 
org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
at 
org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
... 10 more
Jan 21, 2009 12:07:40 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback



>Ah, it needs a null check for multi valued fields. I've committed a fix to
>trunk. The next nightly build should have it. You can checkout and build
>from the trunk if need this immediately.
>
>On Mon, Jan 19, 2009 at 7:02 PM, Fergus McMenemie  wrote:
>
>> Hmmm,
>>
>> Just to clarify I retested the thing using the nightly as of today
>> 18-jan-2009. The problem is still there and this traceback is from
>> that nightly.
>>
>> >>This looks fine. Can you post the stack trace?
>> >>
>> >Yep, here is the juicy bit. Let me know if you need more.
>> >
>> >Jan 19, 2009 11:08:03 AM org.apache.catalina.startup.Catalina start
>> >INFO: Server startup in 2390 ms
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrCore execute
>> >INFO: [janesdocs] webapp=/solr path=/dataimport
>> params={command=full-import} status=0 QTime=12
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> >INFO: Read dataimport.properties
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> >INFO: Starting Full Import
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.update.DirectUpdateHandler2
>> deleteAll
>> >INFO: [janesdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy onInit
>> >INFO: SolrDeletionPolicy.onInit: commits:num=2
>> >
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1232363283058,generation=1,filenames=[segments_1]
>> >
>> commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1232363283059,generation=2,filenames=[segments_2]
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> >INFO: last commit = 1232363283059
>> >Jan 19, 2009 11:14:06 AM
>> org.apache.solr.handler.dataimport.EntityProcessorBase applyTransformer
>> >WARNING: transformer threw error
>> >java.lang.NullPointerException
>> >   at java.io.StringReader.(StringReader.java:33)
>> >   at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.stripHTML(HTMLStripTransformer.java:71)
>> >   at
>> org.apache.solr.handler.dataimport.HTMLStripTransformer.transformRow(HTMLStripTransformer.java:54)
>> >   at
>> org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
>> >   at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
>> >   at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>> >   at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>> >   at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>> >   at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>> >   at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>> >   at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>> >   at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>> >   at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> >Jan 19, 2009 11:14:06 AM org.apache.solr.handler.dataimport.DocBuilder
&

Re: DIH XPathEntityProcessor fails with docs containing

2009-01-21 Thread Fergus McMenemie

   at 
> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:82)
>   ... 10 more
>Jan 16, 2009 9:54:13 AM org.apache.solr.handler.dataimport.DataImporter 
>doFullImport
>SEVERE: Full Import failed
>
>A fragment from the top of the failing document is
>
>
>href="../../../../config/support/j-deliver.xsl"?>
>
>http://dtd.j.com/2002/Content/"; id="frp70450"  
>urname="record">
>  http://www.w3.org/1999/xlink"; xlink:href="" 
> urname="metadata" xlink:type="simple">
>    http://purl.org/dc/elements/1.1/"; 
> qualifier="pdate">20080131
>
>The DTD does exist at the specified location. Removing the DOCTYPE directive
>fixes everything. I know that use of DOCTYPE is out of fashion, and it does
>not exist in our newer documents, however there are lots of older XML docs 
>about!

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Cant get HTMLStripTransformer's stripHTML to work in DIH.

2009-01-21 Thread Fergus McMenemie

>Hi Fergus,
>
>It seems a field it is expecting is missing from the XML.

You mean there is some field in the document we are indexing
that is missing?

>
>sourceColName="*fileAbsePath*"/>
>
>I guess "fileAbsePath" is a typo? Can you check if that is the cause?
Well spotted. I had made a mess of sanitizing the config file I sent
to you. I will in future make sure the stuff I am messing with matches
what I send to the list. However there is no typo in the underlying file;
at least not on that line:-) 


>
>
>On Wed, Jan 21, 2009 at 5:40 PM, Fergus McMenemie  wrote:
>
>> Shalin
>>
>> Downloaded nightly for 21jan and tried DIH again. Its better but
>> still broken. Dozens of embeded tags are stripped from documents
>> but it now fails every few documents for no reason I can see. Manually
>> removing embeded tags causes a given problem document to be indexed,
>> only to have a it fail on one of the next few documents. I think the
>> problem is still in stripHTML
>>
>> Here is the traceback.
>>
>> Jan 21, 2009 12:06:53 PM org.apache.catalina.startup.Catalina start
>> INFO: Server startup in 3377 ms
>> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.SolrWriter
>> readIndexerProperties
>> INFO: Read dataimport.properties
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrCore execute
>> INFO: [fdocs] webapp=/solr path=/dataimport params={command=full-import}
>> status=0 QTime=13
>> Jan 21, 2009 12:07:39 PM org.apache.solr.handler.dataimport.DataImporter
>> doFullImport
>> INFO: Starting Full Import
>> Jan 21, 2009 12:07:39 PM org.apache.solr.update.DirectUpdateHandler2
>> deleteAll
>> INFO: [fdocs] REMOVING ALL DOCUMENTS FROM INDEX
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy onInit
>> INFO: SolrDeletionPolicy.onInit: commits:num=2
>>
>>  
>> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_1,version=1232539612130,generation=1,filenames=[segments_1]
>>
>>  
>> commit{dir=/Volumes/spare/ts/solrnightlyf/data/index,segFN=segments_2,version=1232539612131,generation=2,filenames=[segments_2]
>> Jan 21, 2009 12:07:39 PM org.apache.solr.core.SolrDeletionPolicy
>> updateCommits
>> INFO: last commit = 1232539612131
>> Jan 21, 2009 12:07:40 PM org.apache.solr.handler.dataimport.DocBuilder
>> buildDocument
>> SEVERE: Exception while processing: jc document : null
>> org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing
>> failed for xml, url:/Volumes/spare/ts/ftic/groups/j0036.xmlrows processed :0
>> Processing Document # 9
>>at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
>>at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:252)
>>at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:177)
>> at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:313)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:339)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:202)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:147)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:321)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:381)
>>at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:362)
>> Caused by: java.lang.RuntimeException: java.util.NoSuchElementException
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader.streamRecords(XPathRecordReader.java:85)
>>at
>> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:242)
>>... 9 more
>> Caused by: java.util.NoSuchElementException
>>at
>> com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1083)
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:141)
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.parse(XPathRecordReader.java:174)
>>at
>> org.apache.solr.handler.dataimport.XPathRecordReader$Node.access$000(XPathRecordReader.java:89)
>>at
>> or

Re: DIH XPathEntityProcessor fails with docs containing

2009-01-23 Thread Fergus McMenemie

Seems to work fin on this mornings 23-jan-2009 nightly.

Thanks very much.



>On Wed, Jan 21, 2009 at 6:05 PM, Fergus McMenemie  wrote:
>
>>
>> After looking looking at http://issues.apache.org/jira/browse/SOLR-964,
>> where
>> it seems this issue has been addressed, I had another go at indexing
>> documents
>> containing DOCTYPE. It failed as follows.
>>
>>
>That patch has not been committed to the trunk yet. I'll take it up.
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: How to make Relationships work for Multi-valued Index Fields?

2009-01-24 Thread Fergus McMenemie

Hello,

I am also a newbie and was wanting to do almost the exact same thing.
I was planning on doing the equivalent of:-




  
 
  
***change**
 
  
  
  
  


  



ID is no longer unique within Solr, There would be multiple "documents"
with a given ID; one for each address. You can then search on ID and get 
the three addresses, you can also search on an address more sensibly.

I have not been able to try this yet as other issues are still to be
dealt with.

Comments?

>Hi
>I may be completely off on this being new to SOLR but I am not sure  
>how to index related groups of fields in a document and preserver  
>their 'grouping'.   I  would appreciate any help on this.Detailed  
>description of the problem below.
>
>I am trying to index an entity that can have multiple occurrences in  
>the same document - e.g. Address.  The address could be Shipping,  
>Home, Office etc.   Each address element has multiple values in it  
>like street, state etc.Thus each address element is a group with  
>the state and street in one address element being related to each other.
>
>It looks like this in my source xml
>
>
>
>
>
>
>
>
>I have setup my DIH to treat these as entities as below
>
>
>
>
>baseDir="***"
>  fileName=".*xml"
>  rootEntity="false"
>  dataSource="null" >
> name="record"
>  processor="XPathEntityProcessor"
>  stream="false"
>  forEach="/record"
>url="${f.fileAbsolutePath}">
> 
>
> 
>name="record_adr"
>processor="XPathEntityProcessor"
>stream="false"
>forEach="/record/address"
>url="${f.fileAbsolutePath}">
>  
> xpath="/record/address//@state" />
>  
>   
>
>  
>
>
>
>
>The problem is as follows.  DIH seems to treat these as entities but  
>solr seems to flatten them out on indexing to fields in a document  
>(losing the entity part).
>
>So when I search for the an ID - in the response all the street fields  
>are bunched to-gather, followed by all the state fields type etc.   
>Thus I can't associate which street address corresponds to which  
>address type in the response.
>
>What seems harder is this - say I need to query on 'Street' = XYZ1 and  
>type="Office".  This should NOT return a document since the street for  
>the office address is "XY2" and not "XYZ1".  However when I query for  
>address_state:"XYZ1" and address_type:"Office" I get back this document.
>
>The problem seems to be that while DIH allows 'entities' within a  
>document  the SOLR schema does not preserve them - it 'flattens' all  
>of them out as indices for the document.
>
>I could work around the problem by creating SOLR fields like  
>"home_address_street" and "office_address_street" and do some xpath  
>mapping.  However I don't want to do it as we can have multiple  
>'other' addresses.  Also I have other fields whose type is not easily  
>distinguished like address.
>
>As I mentioned being new to SOLR I might have completely goofed on a  
>way to set it up - much appreciate any direction on it. I am using  
>SOLR 1.3
>
>Regards,
>Guna

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

DIH FileListEntityProcessor recursion and fileName clash

2009-02-01 Thread Fergus McMenemie

Hello

I have been trying to find out why DIH in FileListEntityProcessor
mode did not appear to be recursing into subdirectories. Going through
FileListEntityProcessor.java I eventually tumbled to the fact that my
filename filter setting from data-config.xml also applied to directory
names.



Now, I feel that the fieldName filter should be applied to files fed
into the parser, it should not be applied to the directory names we are
recursing through. I bodged the code as follows to adjust the behavior
so  that the "FileName" and "excludes" attributes of "entity" only
apply to filenames and not directory names.

It now recurses though my directory tree only indexing the appropriate
files! I think the new behavior is more standard.

Is this a change valid?

Regards Fergus.

--- 
/Volumes/spare/ts/apache-solr-nightlyjan23/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/FileListEntityProcessor.java
  2009-02-01 18:19:38.0 +
+++ 
/Volumes/spare/ts/apache-solr-nightlyjan29/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/FileListEntityProcessor.java
  2008-10-02 20:38:30.0 +0100
@@ -85,10 +85,11 @@
 if (r != null)
   recursive = Boolean.parseBoolean(r);
 excludes = context.getEntityAttribute(EXCLUDES);
-if (excludes != null) {
+if (excludes != null)
   excludes = resolver.replaceTokens(excludes);
+if (excludes != null)
   excludesPattern = Pattern.compile(excludes);
-}
+
   }
 
   private Date getDate(String dateStr) {
@@ -139,41 +140,42 @@
   return getFromRowCache();
 while (true) {
   Map r = getNext();
-  if (r != null) r = applyTransformer(r);
-  return r;
+  if (r == null)
+return null;
+  r = applyTransformer(r);
+  if (r != null)
+return r;
 }
   }
 
   private void getFolderFiles(File dir,
   final List> fileDetails) {
-// Fetch an array of file objects that pass the filter, however the
-// returned array is never populated; accept() always returns false.
-// Rather we make use of the fileDetails array which is populated as
-// a side affect of the accept method.
 dir.list(new FilenameFilter() {
   public boolean accept(File dir, String name) {
-File fileObj = new File(dir, name);
-LOG.info("Testing acceptance of dir:"+dir +" name:"+name);
-   if (fileObj.isDirectory()) {
- LOG.info("   Recursing into directory "+fileObj);
- if (recursive) getFolderFiles(fileObj, fileDetails);
- }
-else if (fileNamePattern == null) {
+if (fileNamePattern == null) {
   addDetails(fileDetails, dir, name);
-  }
-else if (fileNamePattern.matcher(name).find()) {
-  if (excludesPattern != null && excludesPattern.matcher(name).find()) 
return false;
+  return false;
+}
+if (fileNamePattern.matcher(name).find()) {
+  if (excludesPattern != null && excludesPattern.matcher(name).find())
+return false;
   addDetails(fileDetails, dir, name);
-  }
-return false;
 }
-  });
-}
+
+return false;
+  }
+});
+  }
 
   private void addDetails(List> files, File dir, String 
name) {
 Map details = new HashMap();
 File aFile = new File(dir, name);
-if (aFile.isDirectory()) return;
+if (aFile.isDirectory()) {
+  if (!recursive)
+return;
+  getFolderFiles(aFile, files);
+  return;
+}
 long sz = aFile.length();
 Date lastModified = new Date(aFile.lastModified());
 if (biggerThan != -1 && sz <= biggerThan)

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

DIH using values from solrconfig.xml inside data-config.xml

2009-02-02 Thread Fergus McMenemie

Hello

As per several postings I noted that I can define variables
inside an invariants list section of the DIH handler of
solrconfig.xml:-

  

   data-config.xml
   

   /Volumes/spare/ts
   
  


I can also reference these variables within data-config.xml. This
works,  the solr field "test" is nicely populated. However how do
I use this variable within my regex transformer? Here is my 
data-config.xml:-

   
   

   
  

   
   
   
   
   
   
 
   
   


indexing my content I get an error as follows:-


INFO: SolrDeletionPolicy.onInit: commits:num=2

commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_7,version=1233583868834,generation=7,filenames=[_7.frq,
 _4.fdt, _7.tii, _7.fnm, _4.fdx, _7.tis, segments_7, _7.nrm, _7.prx]

commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_8,version=1233583868835,generation=8,filenames=[segments_8]
Feb 2, 2009 5:00:50 PM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1233583868835
Feb 2, 2009 5:00:57 PM org.apache.solr.handler.dataimport.EntityProcessorBase 
applyTransformer
WARNING: transformer threw error
java.util.regex.PatternSyntaxException: Illegal repetition near index 0
${dataimporter.request.finstalldir}(.*)
^
at java.util.regex.Pattern.error(Pattern.java:1650)
at java.util.regex.Pattern.closure(Pattern.java:2706)
at java.util.regex.Pattern.sequence(Pattern.java:1798)
at java.util.regex.Pattern.expr(Pattern.java:1687)
at java.util.regex.Pattern.compile(Pattern.java:1397)
at java.util.regex.Pattern.(Pattern.java:1124)
at java.util.regex.Pattern.compile(Pattern.java:817)
at 
org.apache.solr.handler.dataimport.RegexTransformer.getPattern(RegexTransformer.java:129)
at 
org.apache.solr.handler.dataimport.RegexTransformer.process(RegexTransformer.java:88)
at 
org.apache.solr.handler.dataimport.RegexTransformer.transformRow(RegexTransformer.java:74)
at 
org.apache.solr.handler.dataimport.RegexTransformer.transformRow(RegexTransformer.java:42)
at 
org.apache.solr.handler.dataimport.EntityProcessorBase.applyTransformer(EntityProcessorBase.java:187)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:197)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:160)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:333)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:359)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:222)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:155)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:324)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:384)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:365)


Is there some simple escape or other syntax to be used or is
this an enhancement?

Regards Fergus.
-- 

=======
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH FileListEntityProcessor recursion and fileName clash

2009-02-02 Thread Fergus McMenemie

Shalin,

OK!

I got myself a JIRA account and opened solr-1000 and followed the
wiki instructions on creating a patch which I have now uploaded! Only
problem is that while the fix seems fine the test case I added to
TestFileListEntityProcessor.java fails. I need somebody who knows 
what they are doing to point out what I am doing wrong and/or how
to debug test failures.

It would also be nice if I knew how to run or debug one Junit
test rather than all of them, which takes almost 8min.



  @Test
  public void testRECURSION() throws IOException {
long time = System.currentTimeMillis();
File childdir = new File("." + time + "/child" );
childdir.mkdirs();
childdir.deleteOnExit();
createFile(childdir, "a.xml", "a.xml".getBytes(), true);
createFile(childdir, "b.xml", "b.xml".getBytes(), true);
createFile(childdir, "c.props", "c.props".getBytes(), true);
Map attrs = AbstractDataImportHandlerTest.createMap(
FileListEntityProcessor.FILE_NAME, "^.*\\.xml$",
FileListEntityProcessor.BASE_DIR, childdir.getAbsolutePath(),
FileListEntityProcessor.RECURSIVE, true);
Context c = AbstractDataImportHandlerTest.getContext(null,
new VariableResolverImpl(), null, 0, Collections.EMPTY_LIST, attrs);
FileListEntityProcessor fileListEntityProcessor = new 
FileListEntityProcessor();
fileListEntityProcessor.init(c);
List fList = new ArrayList();
while (true) {
  // add the documents to the index
  Map f = fileListEntityProcessor.nextRow();
  if (f == null)
break;
  fList.add((String) f.get(FileListEntityProcessor.ABSOLUTE_FILE));
}
System.out.println("List of files indexed -- " + fList);
    Assert.assertEquals(3, fList.size());
  }

Regards Fergus.

>On Mon, Feb 2, 2009 at 2:36 AM, Fergus McMenemie  wrote:
>
>> Hello
>>
>> I have been trying to find out why DIH in FileListEntityProcessor
>> mode did not appear to be recursing into subdirectories. Going through
>> FileListEntityProcessor.java I eventually tumbled to the fact that my
>> filename filter setting from data-config.xml also applied to directory
>> names.
>
>
>Hmm, not good.
>
>
>>
>>
>>>   processor="FileListEntityProcessor"
>>   fileName=".*\.xml"
>>   newerThan="'NOW-1000DAYS'"
>>   recursive="true"
>>   rootEntity="false"
>>   dataSource="null"
>>   baseDir="/Volumes/spare/ts/stuff/ford">
>>
>> Now, I feel that the fieldName filter should be applied to files fed
>> into the parser, it should not be applied to the directory names we are
>> recursing through. I bodged the code as follows to adjust the behavior
>> so  that the "FileName" and "excludes" attributes of "entity" only
>> apply to filenames and not directory names.
>
>
>I agree with you.
>
>Perhaps we can have separate filters for directories and files but let's
>hold on till the need comes up.
>
>>
>>
>> It now recurses though my directory tree only indexing the appropriate
>> files! I think the new behavior is more standard.
>>
>> Is this a change valid?
>
>
>Absolutely. Can you please create an issue and attach the patch? Thanks!
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH using values from solrconfig.xml inside data-config.xml

2009-02-02 Thread Fergus McMenemie

The solr data field is populated properly. So I guess that bit works. 
I really wish I could use xpath="//para"

>A separate problem: when I used the DIH in December, the xpath
>implementation had few features.  '[...@qualifier='Date']' may not be
>supported.
>
>  dateTimeFormat="MMdd"   />
>
>
>On Mon, Feb 2, 2009 at 9:24 AM, Noble Paul ?? Â Ë³Ë <
>noble.p...@gmail.com> wrote:
>
>> this patch must help
>>
>> On Mon, Feb 2, 2009 at 10:49 PM, Shalin Shekhar Mangar
>>  wrote:
>> > On Mon, Feb 2, 2009 at 10:34 PM, Fergus McMenemie 
>> wrote:
>> >
>> >>
>> >> Is there some simple escape or other syntax to be used or is
>> >> this an enhancement?
>> >>
>> >
>> > I guess the problem is that we are creating the regex Pattern without
>> first
>> > resolving the variable. So we need to call VariableResolver.resolve on
>> the
>> > 'regex' attribute's value before creating the Pattern object.
>> >
>> > Please raise an issue for this change. Nice use-case though. I guess we
>> > never thought someone would need to use a variable in the regex attribute
>> :)
>> >
>> > --
>> > Regards,
>> > Shalin Shekhar Mangar.
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>
>
>
>-- 
>Lance Norskog
>goks...@gmail.com
>650-922-8831 (US)

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH using values from solrconfig.xml inside data-config.xml

2009-02-04 Thread Fergus McMenemie

>: > The solr data field is populated properly. So I guess that bit works.
>: > I really wish I could use xpath="//para"
>
>: The limitation comes from streaming the XML instead of creating a DOM.
>: XPathRecordReader is a custom streaming XPath parser implementation and
>: streaming is easy only because we limit the syntax. You can use
>: PlainTextEntityProcessor which gives the XML as a string to a  custom
>: Transformer. This Transformer can create a DOM, run your XPath query and
>: populate the fields. It's more expensive but it is an option.
>
>Maybe it's just me, but it seems like i'm noticing that as DIH gets used 
>more, many people are noting that the XPath processing in DIH doesn't work 
>the way they expect because it's a custom XPath parser/engine designed for 
>streaming.  
>
>It seems like it would be helpful to have an alternate processor for 
>people who don't need the streaming support (ie: are dealing with small 
>enough docs that they can load the full DOM tree into memory) that would 
>use the default Java XPath engine (and have less caveats/suprises) ... i 
>wou think it would probably even make sense for this new XPath processor 
>to be the one we suggest for new users, and only suggest the existing 
>(stream based) processor if they have really big xml docs to deal with.
>
>(In hindsight XPathEntityProcessor and XPathRecordReader should probably 
>have been named StreamingXPathEntityProcessor and 
>StreamingXPathRecordReader)
>
Four thoughts!

1) My use case involves a few million XML documents ranging in size
   from a few K to 500K. 95% of the documents are under 25KBytes, 
   5 of the documents are around 0.5Mbytes. So.. sod it, I think I
   need a streaming parser.

2) "streaming XPath parser"? I only half understand all this stuff,
   but, and this is based on the little bit of SAX stuff I have written,
   I would have thought that //para was trivial for any kind of
   streaming XML parser.

3) Much of the confusion may be arising because the DIH wiki page is
   not to clear on what is and is not allowed. We need better,
   more explicit examples. What seems to be allowed is:-
 


   I will add these to the wiki. Just to be sure, I tested 
   xpath="//para". It does not work!

4) XML documents are ether well structured with good separation of 
   data and presentation in which case absolute xpaths work fine.
   Or older, in my case text documents, which have been forced into
   XML format with poor structure where the data and presentation 
   is all mixed up. I suspect that the addition of //para would
   cover many of the use cases, and what was left could be covered
   by a preceding XSLT transform. 
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH, assigning multiple xpaths to the same solr field: solved

2009-02-04 Thread Fergus McMenemie

Thanks Shalin,

Using the following appears to work properly!
   
   
   
   

Regards Fergus

>On Wed, Feb 4, 2009 at 1:35 AM, Fergus McMenemie  wrote:
>
>>   >  dataSource="myfilereader"
>>  processor="XPathEntityProcessor"
>>  url="${jc.fileAbsolutePath}"
>>  stream="false"
>>  forEach="/record">
>>   
>>   
>>   
>>   
>>
>> Below is the line from my schema.xml
>>
>>   >  multiValued="true"/>
>>
>> Now a given document will only have one style of layout, and of course
>> the /a/b/c /d/e/f/g  stuff is made up. For a document that has a single
>> Hello world element I see search results as follows, the
>> one  string seems to have been entered into the index four times.
>> I only saw duplicate results before adding the extra made-up stuff.
>>
>>
>I think there is something fishy with the XPathEntityProcessor. For now, I
>think you can work around by giving each field a different 'column' and
>attribute 'name=para' on each of them.
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

DIH fails to import after svn update

2009-02-11 Thread Fergus McMenemie

Hello,

I had a nice working version of SOLR building from trunk, I think
it was from about 2-4th Feb, On the 7th I performed a "svn update"
and it now fails as follows when performing 

get 'http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import'

I have performed a "svn update" on the 11th (today) again. It still
fails.

Feb 11, 2009 4:27:34 AM org.apache.solr.update.DirectUpdateHandler2 deleteAll
INFO: [] REMOVING ALL DOCUMENTS FROM INDEX
Feb 11, 2009 4:27:34 AM org.apache.solr.core.SolrDeletionPolicy onInit
INFO: SolrDeletionPolicy.onInit: commits:num=2

commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_1,version=1234326438927,generation=1,filenames=[segments_1]

commit{dir=/Volumes/spare/ts/solrnightlyjanes/data/index,segFN=segments_2,version=1234326438928,generation=2,filenames=[segments_2]
Feb 11, 2009 4:27:34 AM org.apache.solr.core.SolrDeletionPolicy updateCommits
INFO: last commit = 1234326438928
Feb 11, 2009 4:27:34 AM org.apache.solr.handler.dataimport.DataImporter 
doFullImport
SEVERE: Full Import failed
java.lang.NoSuchFieldError: docCount
at 
org.apache.solr.handler.dataimport.SolrWriter.getDocCount(SolrWriter.java:231)
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.(DataImportHandlerException.java:42)
at 
org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:81)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:293)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:222)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:155)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:324)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:384)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:365)
Feb 11, 2009 4:27:34 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Feb 11, 2009 4:27:34 AM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback
Feb 11, 2009 4:27:34 AM org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=false,waitSearcher=true)
Feb 11, 2009 4:27:34 AM org.apache.solr.search.SolrIndexSearcher 


Regards to all.
-- 

=======
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

"ant dist" of a nightly download fails

2009-02-11 Thread Fergus McMenemie

Hi,

I have been looking at the nightly downloads, trying to work
backwards through the nightly's till my code starts working 
again!

I have downloaded all the available nightly's and they all fail
to "ant dist" as follows:-


>root: ant dist
>Buildfile: build.xml
>
>init-forrest-entities:
>
>compile-solrj:
>
>make-manifest:
>
>dist-solrj:
>  [jar] Building jar: 
> /Volumes/spare/ts/apache-solr-nightly/dist/apache-solr-solrj-1.4-dev.jar
>
>compile:
>
>dist-jar:
>  [jar] Building jar: 
> /Volumes/spare/ts/apache-solr-nightly/dist/apache-solr-core-1.4-dev.jar
>
>dist-contrib:
>
>init:
>
>init-forrest-entities:
>
>compile-solrj:
>
>compile:
>
>make-manifest:
>
>compile:
>
>build:
>  [jar] Building jar: 
> /Volumes/spare/ts/apache-solr-nightly/contrib/dataimporthandler/target/apache-solr-dataimporthandler-1.4-dev.jar
>
>dist:
> [copy] Copying 2 files to /Volumes/spare/ts/apache-solr-nightly/build/web
>[mkdir] Created dir: 
> /Volumes/spare/ts/apache-solr-nightly/build/web/WEB-INF/lib
> [copy] Copying 1 file to 
> /Volumes/spare/ts/apache-solr-nightly/build/web/WEB-INF/lib
> [copy] Copying 1 file to /Volumes/spare/ts/apache-solr-nightly/dist
>
>init:
>
>init-forrest-entities:
>
>compile-solrj:
>
>compile:
>
>make-manifest:
>
>compile:
>
>build:
>  [jar] Building jar: 
> /Volumes/spare/ts/apache-solr-nightly/contrib/extraction/build/apache-solr-cell-1.4-dev.jar
>
>dist:
> [copy] Copying 1 file to /Volumes/spare/ts/apache-solr-nightly/dist
>
>clean:
>   [delete] Deleting directory 
> /Volumes/spare/ts/apache-solr-nightly/contrib/javascript/dist
>
>create-dist-folder:
>[mkdir] Created dir: 
> /Volumes/spare/ts/apache-solr-nightly/contrib/javascript/dist
>
>concat:
>
>docs:
>[mkdir] Created dir: 
> /Volumes/spare/ts/apache-solr-nightly/contrib/javascript/dist/doc
> [java] Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/mozilla/javascript/tools/shell/Main
> [java]at JsRun.main(Unknown Source)
>
>BUILD FAILED
>/Volumes/spare/ts/apache-solr-nightly/common-build.xml:338: The following 
>error occurred while executing this line:
>/Volumes/spare/ts/apache-solr-nightly/common-build.xml:215: The following 
>error occurred while executing this line:
>/Volumes/spare/ts/apache-solr-nightly/contrib/javascript/build.xml:74: Java 
>returned: 1
>
>Total time: 3 seconds
>root: 

Performing "ant test" is fine. Removing the javascript contrib directory
allows the "ant dist" to complete and I have a usable war file. However
I suspect this may not represent best practise; however "ant test" is still
fine. 


What does removal of the this contrib function loose me? I was wondering if
it went with the DIH ScriptTransformer?
 
Regards Fergus.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH fails to import after svn update

2009-02-11 Thread Fergus McMenemie

Thanks, 

That fixed it.

>On Wed, Feb 11, 2009 at 4:19 PM, Fergus McMenemie  wrote:
>
>
>> java.lang.NoSuchFieldError: docCount
>>at
>> org.apache.solr.handler.dataimport.SolrWriter.getDocCount(SolrWriter.java:231)
>>at
>> org.apache.solr.handler.dataimport.DataImportHandlerException.(DataImportHandlerException.java:42)
>>at
>> org.apache.solr.handler.dataimport.FileListEntityProcessor.init(FileListEntityProcessor.java:81)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:293)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:222)
>>at
>> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:155)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:324)
>>at
>> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:384)
>>at
>> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:365)
>>
>
>Seems like this was not a clean compile. The AtomicInteger field docCount
>was changed to a AtomicLong.
>
>Can you please do a "ant clean dist"?
>
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Is this DIH entity forEach expression OK?

2009-02-12 Thread Fergus McMenemie

Hello,

I am having bother with forEach. I have XML source documents containing
many embedded images within mediaBlock elements. Each image has a an
associated caption. I want to implement a separate image search function
which searches the captions and brings back the associated image.

 

 
 

Is is OK to have an xpath expression within forEach which is a child 
of another of the forEach xpath expressions?

Or.. is there a better way of doing this?

Regards
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Is this DIH entity forEach expression OK? ... yes

2009-02-13 Thread Fergus McMenemie

>Hello,
>
>I am having bother with forEach. I have XML source documents containing
>many embedded images within mediaBlock elements. Each image has a an
>associated caption. I want to implement a separate image search function
>which searches the captions and brings back the associated image.
>
> dataSource="myfilereader"
>processor="XPathEntityProcessor"
>url="${jc.fileAbsolutePath}"
>stream="false"
>forEach="/record | /record/mediaBlock"
>>
>
>  xpath="/record/mediaBlock/mediaObject/@vurl" />
>  xpath="/record/mediaBlock/caption"  />
>
>Is is OK to have an xpath expression within forEach which is a child 
>of another of the forEach xpath expressions?
>
Yes. It works fine, duplicate "uniqueKey"s were making it appear otherwise.

But
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Problem using DIH templatetransformer to create uniqueKey

2009-02-13 Thread Fergus McMenemie

Hello,

templatetransformer behaves rather ungracefully if one of the replacement
fields is missing.

I am parsing a single XML document into multiple separate solr documents.
It turns out that none of the source documents fields can be used to create
a uniqueKey alone. I need to combine two, using template transformer as
follows:



  
  
  
  

The trouble is that vurl is only defined as a child of "/record/mediaBlock"
so my attempt to create id, the uniqueKey fails for the parent document 
"/record"

I am hacking around with "TemplateTransformer.java" to sort this but was
wondering if there was a good reason for this behavior.

Regards.
-- 

=======
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Problem using DIH templatetransformer to create uniqueKey

2009-02-13 Thread Fergus McMenemie

>Hello,
>
>templatetransformer behaves rather ungracefully if one of the replacement
>fields is missing.

Looking at TemplateString.java I see that left to itself fillTokens would 
replace a missing variable with "". It is an extra check in TemplateTransformer
that is throwing the warning and stopping the row being returned. Commenting
out the check seems to solve my problem.

Having done this, an undefined replacement string in TemplateTransformer
is replaced with "". However a neater fix would probably involve making 
use of the default value which can be assigned to a row? in schema.xml. 

>I am parsing a single XML document into multiple separate solr documents.
>It turns out that none of the source documents fields can be used to create
>a uniqueKey alone. I need to combine two, using template transformer as
>follows:
>
>  dataSource="myfilereader"
>  processor="XPathEntityProcessor"
>  url="${jc.fileAbsolutePath}"
>  rootEntity="true"
>  stream="false"
>  forEach="/record | /record/mediaBlock"
>  transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer"
>   >
>
>  
>   regex="${dataimporter.request.installdir}(.*)" replaceWith="/ford$1" 
> sourceColName="fileAbsolutePath"/>
>   template="${jc.fileAbsolutePath}${x.vurl}" />
>   xpath="/record/mediaBlock/mediaObject/@vurl" />
>
>The trouble is that vurl is only defined as a child of "/record/mediaBlock"
>so my attempt to create id, the uniqueKey fails for the parent document 
>"/record"
>
>I am hacking around with "TemplateTransformer.java" to sort this but was
>wondering if there was a good reason for this behavior.
>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Problem using DIH templatetransformer to create uniqueKey

2009-02-13 Thread Fergus McMenemie

Paul,

Following up your usenet sussgetion:

 

and to add more to what I was thinking...

if the field is undefined in the input document, but the schema.xml
does allow a default value, then TemplateTransformer can use the 
default value. If there is no default value defined in schema.xml 
then it can fail as at present. This would allow "" or any other
value to be fed into TemplateTransformer, and still enable avoidance
of the partial strings you referred to.

Regards Fergus.

>>Hello,
>>
>>templatetransformer behaves rather ungracefully if one of the replacement
>>fields is missing.
>
>Looking at TemplateString.java I see that left to itself fillTokens would 
>replace a missing variable with "". It is an extra check in TemplateTransformer
>that is throwing the warning and stopping the row being returned. Commenting
>out the check seems to solve my problem.
>
>Having done this, an undefined replacement string in TemplateTransformer
>is replaced with "". However a neater fix would probably involve making 
>use of the default value which can be assigned to a row? in schema.xml. 
>
>>I am parsing a single XML document into multiple separate solr documents.
>>It turns out that none of the source documents fields can be used to create
>>a uniqueKey alone. I need to combine two, using template transformer as
>>follows:
>>
>>>  dataSource="myfilereader"
>>  processor="XPathEntityProcessor"
>>  url="${jc.fileAbsolutePath}"
>>  rootEntity="true"
>>  stream="false"
>>  forEach="/record | /record/mediaBlock"
>>  transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer"
>>   >
>>
>>  
>>  > regex="${dataimporter.request.installdir}(.*)" replaceWith="/ford$1" 
>> sourceColName="fileAbsolutePath"/>
>>  > template="${jc.fileAbsolutePath}${x.vurl}" />
>>  > xpath="/record/mediaBlock/mediaObject/@vurl" />
>>
>>The trouble is that vurl is only defined as a child of "/record/mediaBlock"
>>so my attempt to create id, the uniqueKey fails for the parent document 
>>"/record"
>>
>>I am hacking around with "TemplateTransformer.java" to sort this but was
>>wondering if there was a good reason for this behavior.
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Problem using DIH templatetransformer to create uniqueKey

2009-02-13 Thread Fergus McMenemie


Hmmm. Just gave that a go! No luck
But how many layers of defaults do we need?


Rgds Fergus

>What about having the template transformer support ${field:default}  
>syntax?  I'm assuming it doesn't support that currently right?  The  
>replace stuff in the config files does though.
>
>   Erik
>
>
>On Feb 13, 2009, at 8:17 AM, Fergus McMenemie wrote:
>
>> Paul,
>>
>> Following up your usenet sussgetion:
>>
>> > ignoreMissingVariables="true"/>
>>
>> and to add more to what I was thinking...
>>
>> if the field is undefined in the input document, but the schema.xml
>> does allow a default value, then TemplateTransformer can use the
>> default value. If there is no default value defined in schema.xml
>> then it can fail as at present. This would allow "" or any other
>> value to be fed into TemplateTransformer, and still enable avoidance
>> of the partial strings you referred to.
>>
>> Regards Fergus.
>>
>>>> Hello,
>>>>
>>>> templatetransformer behaves rather ungracefully if one of the  
>>>> replacement
>>>> fields is missing.
>>>
>>> Looking at TemplateString.java I see that left to itself fillTokens  
>>> would
>>> replace a missing variable with "". It is an extra check in  
>>> TemplateTransformer
>>> that is throwing the warning and stopping the row being returned.  
>>> Commenting
>>> out the check seems to solve my problem.
>>>
>>> Having done this, an undefined replacement string in  
>>> TemplateTransformer
>>> is replaced with "". However a neater fix would probably involve  
>>> making
>>> use of the default value which can be assigned to a row? in  
>>> schema.xml.
>>>
>>>> I am parsing a single XML document into multiple separate solr  
>>>> documents.
>>>> It turns out that none of the source documents fields can be used  
>>>> to create
>>>> a uniqueKey alone. I need to combine two, using template  
>>>> transformer as
>>>> follows:
>>>>
>>>> >>> dataSource="myfilereader"
>>>> processor="XPathEntityProcessor"
>>>> url="${jc.fileAbsolutePath}"
>>>> rootEntity="true"
>>>> stream="false"
>>>> forEach="/record | /record/mediaBlock"
>>>> transformer 
>>>> ="DateFormatTransformer,TemplateTransformer,RegexTransformer"
>>>>>
>>>>
>>>> 
>>>> >>> sourceColName="fileAbsolutePath"/>
>>>> 
>>>> 
>>>>
>>>> The trouble is that vurl is only defined as a child of "/record/ 
>>>> mediaBlock"
>>>> so my attempt to create id, the uniqueKey fails for the parent  
>>>> document "/record"
>>>>
>>>> I am hacking around with "TemplateTransformer.java" to sort this  
>>>> but was
>>>> wondering if there was a good reason for this behavior.
>>>>
>>
>> -- 
>>
>> ===
>> Fergus McMenemie   Email:fer...@twig.me.uk
>> Techmore Ltd   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets Analyst Programmer
>> ===

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

DIH transformers

2009-02-16 Thread Fergus McMenemie

Hello.

I have been beating my head around the data-config.xml listed
at the end of this message. It breaks in a few different ways.

  1) I have bodged TemplateTransformer to allow it to return 
 when one of the variables is undefined. This ensures my
 uniqueKey is always defined. But thinking more on
 Nobel's comments there is use in having it work both ways.
 ie leaving the column undefined or replacing the variable
 with "". I still like my idea about using the default
 value of a solr field from schema.xml, but I cant figure
 out how/where to best implement it. 

  2) Having used TemplateTransformer to assign a value to an 
 entity column that column cannot be used in other 
 TemplateTransformer operations. In my project I am 
 attempting to reuse "x.fileWebPath". To fix this, the 
 last line of transformRow() in TemplateTransformer.java
 needs replaced with the following which as well as 
 'putting' the templated-ed string in 'row' also saves it
 into the 'resolver'.

 **originally**
  row.put(column, resolver.replaceTokens(expr));
  }

 **new**
  String columnName = map.get(DataImporter.COLUMN);
  expr=resolver.replaceTokens(expr);
  row.put(columnName, expr);
  resolverMapCopy.put(columnName, expr);
  }

 As an aside I think I ran into the issues covered by 
 SOLR-993. It took a while to figure out I could not a
 a single columnname/value to the resolver. I had instead
 to add to the map that was already stored within the
 resolver.

  3) No entity column names can be used within RegexTransformer.
 I guess all the stuff that was added to TemplateTransformer
 to allow column names to be used in templates needs re-added
 into RegexTransformer. I am doing that now... but am confused
 by the fragment of code which copies from resolverMap into
 resolverMapCopy. As best I can see resolverMap is always 
 empty; but I am barely able to follow the code! Can somebody
 explain when/why resolverMap would be populated.

 Also, I begin to understand comments made by Noble in
 SOL-1001 about resolving "entity attributes in 
 ContextImpl.getEntityAttribute" and I guess Shalin was
 right as well. However it also seems wrong that at the
 top of every transformer we are going to repeat the
 same code to load the resolver with information about the 
 entity.

  4) In that I am reusing template output within other templates
 the order of execution becomes important. Can I assume that
 the explicitly listed columns in an entity are processed by
 the various transformers in the order they appear within
 data-config.xml. I *think* that the list of columns within
 an entity as returned by getAllEntityFields() is actually
 an ArrayList which I think or order dependent. IS this
 correct?

  5) Should I raise this as a single JIRA issue?

  6) Having played with this stuff, I was going to add a bit
 more to the wiki highlighting some of the possibilities
 and issues with transformers. But want to check with the 
 list first!


   
   




















   
   
   


Regards Fergus.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH transformers - sect 2

2009-02-17 Thread Fergus McMenemie

>On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie  wrote:
>>
>>  2) Having used TemplateTransformer to assign a value to an
>> entity column that column cannot be used in other
>> TemplateTransformer operations. In my project I am
>> attempting to reuse "x.fileWebPath". To fix this, the
>> last line of transformRow() in TemplateTransformer.java
>> needs replaced with the following which as well as
>> 'putting' the templated-ed string in 'row' also saves it
>> into the 'resolver'.
>>
>> **originally**
>>  row.put(column, resolver.replaceTokens(expr));
>>  }
>>
>> **new**
>>  String columnName = map.get(DataImporter.COLUMN);
>>  expr=resolver.replaceTokens(expr);
>>  row.put(columnName, expr);
>>  resolverMapCopy.put(columnName, expr);
>>  }
>
>isn't it better to write a custom transformer to achieve this. I did
>not want a standard component to change the state of the
>VariableResolver .
>
>I am not sure what is the best way.
>

Noble, (Good to have email working :-)

Hmm not sure why this requires a custom transformer. Why is this not 
more in the nature of a bug fix? Also the current behavior temporarily
adds all the column names into the resolver for the duration of the 
TemplateTransformer's operation, removing them again at the end. I
do not think there is any permanent change to the state of the 
VariableResolver.

Surely if we have defined a value for a column, that value should be
temporarily available in subsequent template or regexp operations?

Fergus.

>>
>>
>>   
>>   
>>
>>>   processor="FileListEntityProcessor"
>>   fileName="^.*\.xml$"
>>   newerThan="'NOW-1000DAYS'"
>>   recursive="true"
>>   rootEntity="false"
>>   dataSource="null"
>>   baseDir="/Volumes/spare/ts/solr/content"
>>   >
>>>  dataSource="myfilereader"
>>  processor="XPathEntityProcessor"
>>  url="${jc.fileAbsolutePath}"
>>  rootEntity="true"
>>  stream="false"
>>  forEach="/record | /record/mediaBlock"
>>  
>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
>>
>> 
>> > replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
>> 
>> 
>> 
>> > xpath="/record/metadata/da...@qualifier='pubDate']" 
>> dateTimeFormat="MMdd"   />
>>
>> > xpath="/record/mediaBlock/mediaObject/@vurl" />
>> > template="${dataimporter.request.fordinstalldir}" />
>> 
>>
>> > template="${dataimporter.request.contentinstalldir}" />
>> 
>> > replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
>> > replaceWith="$1/imagery/${x.vurl}.jpg"  sourceColName="fileWebPath"/>
>> > template="${jc.fileAbsolutePath}#${x.vurl}" />
>>   
>>   
>>   
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH transformers - sect 2 - SOLR-1033

2009-02-21 Thread Fergus McMenemie

I have created SOLR-1033 in JIRA to address this issue.

At 13:32 + 21/2/09, Fergus McMenemie wrote:
>>On Mon, Feb 16, 2009 at 3:22 PM, Fergus McMenemie  wrote:
>>>
>>>  2) Having used TemplateTransformer to assign a value to an
>>> entity column that column cannot be used in other
>>> TemplateTransformer operations. In my project I am
>>> attempting to reuse "x.fileWebPath". To fix this, the
>>> last line of transformRow() in TemplateTransformer.java
>>> needs replaced with the following which as well as
>>> 'putting' the templated-ed string in 'row' also saves it
>>> into the 'resolver'.
>>>
>>> **originally**
>>>  row.put(column, resolver.replaceTokens(expr));
>>>  }
>>>
>>> **new**
>>>  String columnName = map.get(DataImporter.COLUMN);
>>>  expr=resolver.replaceTokens(expr);
>>>  row.put(columnName, expr);
>>>  resolverMapCopy.put(columnName, expr);
>>>  }
>>
>>isn't it better to write a custom transformer to achieve this. I did
>>not want a standard component to change the state of the
>>VariableResolver .
>>
>>I am not sure what is the best way.
>>
>
>Noble, (Good to have email working :-)
>
>Hmm not sure why this requires a custom transformer. Why is this not 
>more in the nature of a bug fix? Also the current behavior temporarily
>adds all the column names into the resolver for the duration of the 
>TemplateTransformer's operation, removing them again at the end. I
>do not think there is any permanent change to the state of the 
>VariableResolver.
>
>Surely if we have defined a value for a column, that value should be
>temporarily available in subsequent template or regexp operations?
>
>Fergus.
>
>>>
>>>
>>>   
>>>   
>>>
>>>>>   processor="FileListEntityProcessor"
>>>   fileName="^.*\.xml$"
>>>   newerThan="'NOW-1000DAYS'"
>>>   recursive="true"
>>>   rootEntity="false"
>>>   dataSource="null"
>>>   baseDir="/Volumes/spare/ts/solr/content"
>>>   >
>>>>>  dataSource="myfilereader"
>>>  processor="XPathEntityProcessor"
>>>  url="${jc.fileAbsolutePath}"
>>>  rootEntity="true"
>>>  stream="false"
>>>  forEach="/record | /record/mediaBlock"
>>>  
>>> transformer="DateFormatTransformer,TemplateTransformer,RegexTransformer">
>>>
>>> 
>>> >> replaceWith="/ford$1" sourceColName="fileAbsolutePath"/>
>>> 
>>> 
>>> 
>>> >> xpath="/record/metadata/da...@qualifier='pubDate']" 
>>> dateTimeFormat="MMdd"   />
>>>
>>> >> xpath="/record/mediaBlock/mediaObject/@vurl" />
>>> >> template="${dataimporter.request.fordinstalldir}" />
>>> >> />
>>>
>>> >> template="${dataimporter.request.contentinstalldir}" />
>>> 
>>> >> replaceWith="$1/imagery/${x.vurl}s.jpg" sourceColName="fileWebPath"/>
>>> >> replaceWith="$1/imagery/${x.vurl}.jpg"  sourceColName="fileWebPath"/>
>>> >> template="${jc.fileAbsolutePath}#${x.vurl}" />
>>>   
>>>   
>>>   
>>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

passing parameters into the XSLTResponseWriter: particularly hostname

2009-02-27 Thread Fergus McMenemie

Hello all,

I was wondering if there was a way of passing parameters into 
the XSLTResponseWriter writer.

I always like the option of formatting my search results as an 
RSS feed. Users can therefore configure their phone, browser etc
to automatically redo a search every so often and have new items
in the result set highlighted to them.

However many RSS clients require links to the underlying content 
to be absolute. So I need to pass in the full hostname, of the
machine serving the results, to the transform generating my RSS
feed. How do I do this?

Regards Fergus
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: passing parameters into the XSLTResponseWriter: particularly hostname

2009-03-09 Thread Fergus McMenemie

>: I was wondering if there was a way of passing parameters into 
>: the XSLTResponseWriter writer.
>
>I don't think there's anyway to pass input in the traditional  
>sense, but you can set default/invariant params along with echoParams=all 
>to get the values you want into the XML doc itself where your stylesheet 
>has access to it.
>
>
>-Hoss

Doh! of course.

Thanks.
-- 

=======
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

a new DIH manifestEnityProcessor

2009-03-09 Thread Fergus McMenemie

Hello,

I have almost finished a new DIH EntityProcessor which
I am calling the manifestEnityProcessor. It is designed
around the idea that whatever demon is used to maintain
your set of a few 100,000 xml documents it is likely to
drop a report or log file explaining what has been changed
within your content store. This assumes a file based
content repository.

The manifestEnityProcessor is used as follows

   

The idea is you have a log file or other report, perhaps
from tar or zip, and you wish to use this to control the
indexing of the new content. The new entity fields are as
follows.
 
manifestFileName is the name of the manifest file. If
 this value is relative, it assumed to
 be relative to baseDir. Required.

manifestAddRegex is a required regex to identify lines 
 which when matched should cause docs to
 be added to the index.

manifestDelRegex is an optional value of a regex to
 identify documents which when matched should
 be deleted from the index **PLANNED**

allowRegex   a required regex to identify the portion
 of the ADD/DELete line identified above
 which contains the file or pathname to
 ADDed or DELeted. If the resulting value
 relative, it assumed to be relative to
 baseDir.

What do I do next?
   Raise a JIRA issue and add the code?
   Is DIH the right place to add this?
   Suggestions for a different name?
   Suggestions on how to do the delete bitty from within an entity?

Regards Fergus.



-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: a new DIH manifestEnityProcessor

2009-03-09 Thread Fergus McMenemie

>manifest processing has a very limited usecase. Why can't it be
>processed using a PlainTextEntityProcessor and write a Tranformer to
>read lines using regex?
>
Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough
insight to see how this could be used to index each of the files
listed by a 'tar xvf' report. Can you explain further?

About the limited usecase. Verity thought it was useful enough
to have there own "bulk insert file" or bif file format that
did the same and was far less flexible.

In my experience we generally start off with some kind of
file walker or crawler looking after file repositories. But
these always proved slow and unreliable and over time they
were always replaced it with some kind of manifest based
control of the indexer. Where we could get a report of changes
we always used it, and only relied on walkers or crawlers
where we had to.

Fergus

>
>--Noble
>
>On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie  wrote:
>> Hello,
>>
>> I have almost finished a new DIH EntityProcessor which
>> I am calling the manifestEnityProcessor. It is designed
>> around the idea that whatever demon is used to maintain
>> your set of a few 100,000 xml documents it is likely to
>> drop a report or log file explaining what has been changed
>> within your content store. This assumes a file based
>> content repository.
>>
>> The manifestEnityProcessor is used as follows
>>
>>       >               processor="ManifestEntityProcessor"
>>               baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>>               rootEntity="false"
>>               dataSource="null"
>>
>>               allowRegex="^.*\.xml$"
>>               manifestFileName="/Volumes/ts/man-find.txt"
>>               manifestAddRegex="(.*)$"
>>               >
>>
>> The idea is you have a log file or other report, perhaps
>> from tar or zip, and you wish to use this to control the
>> indexing of the new content. The new entity fields are as
>> follows.
>>
>> manifestFileName is the name of the manifest file. If
>>                 this value is relative, it assumed to
>>                 be relative to baseDir. Required.
>>
>> manifestAddRegex is a required regex to identify lines
>>                 which when matched should cause docs to
>>                 be added to the index.
>>
>> manifestDelRegex is an optional value of a regex to
>>                 identify documents which when matched should
>>                 be deleted from the index **PLANNED**
>>
>> allowRegex       a required regex to identify the portion
>>                 of the ADD/DELete line identified above
>>                 which contains the file or pathname to
>>                 ADDed or DELeted. If the resulting value
>>                 relative, it assumed to be relative to
>>                 baseDir.
>>
>> What do I do next?
>>   Raise a JIRA issue and add the code?
>>   Is DIH the right place to add this?
>>   Suggestions for a different name?
>>   Suggestions on how to do the delete bitty from within an entity?
>>
>> Regards Fergus.
>--Noble Paul

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: a new DIH manifestEnityProcessor

2009-03-09 Thread Fergus McMenemie

>Hi Fergus,
>The idea is that we have something generic which can be applicable to
>a large set of users. If the manifest is a text file it can be read in
>somestandard way (say line by line). So we can have an EntityProcessor
>which reads a text file line and filer it by a regex like the way
>'grep' works.
Yes. That is what I have written. It is just an alternate form of the
FileListEntityProcessor except that rather than walking the file system
it reads from a file, line by line, and identifies the portion of the
line containing the filename using a regexp.


>
>On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie  wrote:
>>>manifest processing has a very limited usecase. Why can't it be
>>>processed using a PlainTextEntityProcessor and write a Tranformer to
>>>read lines using regex?
>>>
>> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough
>> insight to see how this could be used to index each of the files
>> listed by a 'tar xvf' report. Can you explain further?
>>
>> About the limited usecase. Verity thought it was useful enough
>> to have there own "bulk insert file" or bif file format that
>> did the same and was far less flexible.
>>
>> In my experience we generally start off with some kind of
>> file walker or crawler looking after file repositories. But
>> these always proved slow and unreliable and over time they
>> were always replaced it with some kind of manifest based
>> control of the indexer. Where we could get a report of changes
>> we always used it, and only relied on walkers or crawlers
>> where we had to.
>>
>> Fergus
>>
>>>
>>>--Noble
>>>
>>>On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie  wrote:
>>>> Hello,
>>>>
>>>> I have almost finished a new DIH EntityProcessor which
>>>> I am calling the manifestEnityProcessor. It is designed
>>>> around the idea that whatever demon is used to maintain
>>>> your set of a few 100,000 xml documents it is likely to
>>>> drop a report or log file explaining what has been changed
>>>> within your content store. This assumes a file based
>>>> content repository.
>>>>
>>>> The manifestEnityProcessor is used as follows
>>>>
>>>>       >>>               processor="ManifestEntityProcessor"
>>>>               baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>>>>               rootEntity="false"
>>>>               dataSource="null"
>>>>
>>>>               allowRegex="^.*\.xml$"
>>>>               manifestFileName="/Volumes/ts/man-find.txt"
>>>>               manifestAddRegex="(.*)$"
>>>>               >
>>>>
>>>> The idea is you have a log file or other report, perhaps
>>>> from tar or zip, and you wish to use this to control the
>>>> indexing of the new content. The new entity fields are as
>>>> follows.
>>>>
>>>> manifestFileName is the name of the manifest file. If
>>>>                 this value is relative, it assumed to
>>>>                 be relative to baseDir. Required.
>>>>
>>>> manifestAddRegex is a required regex to identify lines
>>>>                 which when matched should cause docs to
>>>>                 be added to the index.
>>>>
>>>> manifestDelRegex is an optional value of a regex to
>>>>                 identify documents which when matched should
>>>>                 be deleted from the index **PLANNED**
>>>>
>>>> allowRegex       a required regex to identify the portion
>>>>                 of the ADD/DELete line identified above
>>>>                 which contains the file or pathname to
>>>>                 ADDed or DELeted. If the resulting value
>>>>                 relative, it assumed to be relative to
>>>>                 baseDir.
>>>>
>>>> What do I do next?
>>>>   Raise a JIRA issue and add the code?
>>>>   Is DIH the right place to add this?
>>>>   Suggestions for a different name?
>>>>   Suggestions on how to do the delete bitty from within an entity?
>>>>
>>>> Regards Fergus.
>>>--Noble Paul
>>
>> --
>>
>> ===
>> Fergus McMenemie               Email:fer...@twig.me.uk
>> Techmore Ltd                   Phone:(UK) 07721 376021
>>
>> Unix/Mac/Intranets             Analyst Programmer
>> ===
>>
>
>
>
>-- 
>--Noble Paul

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH with a list of changed documents?

2009-03-09 Thread Fergus McMenemie

>Hello List,
>
>how would I implement entity-processor if I were able to get the list  
>of recently changed documents of our sites?
>
>thanks for hints.
>
>paul
>
>Attachment converted: OSX:smime 65.p7s (/) (00213A09)



H, this sounds like a job for my manifestEnityProcessor
see if you can find the thread titled:-
 
   "a new DIH manifestEnityProcessor"

is your list of changed documents a list of additions and
updates only, or does it contain deletes as well?

Fergus.

-- 

=======
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH with a list of changed documents?

2009-03-09 Thread Fergus McMenemie

>Le 09-mars-09 à 22:29, Fergus McMenemie a écrit :
>>> how would I implement entity-processor if I were able to get the list
>>> of recently changed documents of our sites?
>>
>> H, this sounds like a job for my manifestEnityProcessor
>> see if you can find the thread titled:-
>>
>>   "a new DIH manifestEnityProcessor"
>>
>> is your list of changed documents a list of additions and
>> updates only, or does it contain deletes as well?
>
>Fergus,
>
>I think you should then rename it... Manifest is not the right name to  
>me (manifest refers to something such as the manifest of a jar or of  
>an IMS-content-package, both are a metadata of the data).

Its all in the jargon, I guess. Our content repositories are changed
by update kits, some of the kits come with manifests or in other cases
we capture the output from un-tar or un-zip commands and we call these
manifests. The name is up for grabs if a better suggestion comes along;
I would have used FileListEntityProcessor except the name was taken;-)


>I looked at your original description and I could not read anything  
>about the changed files.
>The regex approach is a nice one for sure...

Yep, our "manifest"s quite often include jpegs, avis etc which we
do not want indexed. And if it's a tar output it will contain
directory stubs as well.

>I think a useful DIH Entity-processor that would maintain its deltas  
>well would have as parameters, url to a list of recently updated urls,  
>url to a list of recently deleted urls. Is this yours?

urls hu! Never thought of that, i was just assuming it would be a local
file. However I guess that could be added... so "manifestFileName" would
become "manifestURL"? In my use cases some of the "manifests" are along
the  lines of 

   ADD -checksum-xxx  --pathname_1--
   DEL --pathname_b--

Hence "manifestAddRegex" and "manifestDelRegex". I also, in other 
cases, have separate files, one for adding another for deleting.
This I was going to deal with as two separate DIH imports.

>I would have one for URLs with the list of recent things basically  
>from an RSS; the transformer is custom in all cases.

The output from my manifestEnityProcessor is fed to an
XPathEntityProcessor

>
>paul
>
Fergus.
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: a new DIH manifestEnityProcessor SOLR-1060 on jira

2009-03-10 Thread Fergus McMenemie

OK, SOLR-1060 created.

>To this requirement I would add the basic requirement that this file  
>(what Fergus calls the manifest to which I still don't agree)  
>represents a update-set and that there should be a delete-set as well.
>
>ChangeSetEntityProcessor, on there I would jump with two feet.
>
>paul
>
>
>Le 10-mars-09 à 05:40, Noble Paul ??  
>Â Ë³Ë a écrit :
>
>> Hi Fergus open a JIRA issue anyway. put in your thoughts and we can
>> refine the requirements as a part of the discussion.
>>
>> Basically the requirements are ,
>> 1)read a file line by line
>> 2) filter out lines (include or exclude ) based on a regex
>> 3) extract parts (named parts) from the line using another regex
>>
>> Noble
>>
>>
>> On Tue, Mar 10, 2009 at 1:50 AM, Fergus McMenemie  
>>  wrote:
>>>> Hi Fergus,
>>>> The idea is that we have something generic which can be applicable  
>>>> to
>>>> a large set of users. If the manifest is a text file it can be  
>>>> read in
>>>> somestandard way (say line by line). So we can have an  
>>>> EntityProcessor
>>>> which reads a text file line and filer it by a regex like the way
>>>> 'grep' works.
>>> Yes. That is what I have written. It is just an alternate form of the
>>> FileListEntityProcessor except that rather than walking the file  
>>> system
>>> it reads from a file, line by line, and identifies the portion of the
>>> line containing the filename using a regexp.
>>>
>>>
>>>>
>>>> On Mon, Mar 9, 2009 at 10:44 PM, Fergus McMenemie  
>>>>  wrote:
>>>>>> manifest processing has a very limited usecase. Why can't it be
>>>>>> processed using a PlainTextEntityProcessor and write a  
>>>>>> Tranformer to
>>>>>> read lines using regex?
>>>>>>
>>>>> Ehmmm Ok. The PlainTextEntityProcessor docs do not give me enough
>>>>> insight to see how this could be used to index each of the files
>>>>> listed by a 'tar xvf' report. Can you explain further?
>>>>>
>>>>> About the limited usecase. Verity thought it was useful enough
>>>>> to have there own "bulk insert file" or bif file format that
>>>>> did the same and was far less flexible.
>>>>>
>>>>> In my experience we generally start off with some kind of
>>>>> file walker or crawler looking after file repositories. But
>>>>> these always proved slow and unreliable and over time they
>>>>> were always replaced it with some kind of manifest based
>>>>> control of the indexer. Where we could get a report of changes
>>>>> we always used it, and only relied on walkers or crawlers
>>>>> where we had to.
>>>>>
>>>>> Fergus
>>>>>
>>>>>>
>>>>>> --Noble
>>>>>>
>>>>>> On Mon, Mar 9, 2009 at 8:30 PM, Fergus McMenemie >>>>> > wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> I have almost finished a new DIH EntityProcessor which
>>>>>>> I am calling the manifestEnityProcessor. It is designed
>>>>>>> around the idea that whatever demon is used to maintain
>>>>>>> your set of a few 100,000 xml documents it is likely to
>>>>>>> drop a report or log file explaining what has been changed
>>>>>>> within your content store. This assumes a file based
>>>>>>> content repository.
>>>>>>>
>>>>>>> The manifestEnityProcessor is used as follows
>>>>>>>
>>>>>>>   >>>>>>   processor="ManifestEntityProcessor"
>>>>>>>   baseDir="/Volumes/Techmore/ts/aaa/schema/data"
>>>>>>>   rootEntity="false"
>>>>>>>   dataSource="null"
>>>>>>>
>>>>>>>   allowRegex="^.*\.xml$"
>>>>>>>   manifestFileName="/Volumes/ts/man-find.txt"
>>>>>>>   manifestAddRegex="(.*)$"
>>>>>>>   >
>>>>>>>
>>>>>>> The idea is you have a log file or other

Re: Problem using DIH templatetransformer to create uniqueKey: solved

2009-03-12 Thread Fergus McMenemie

Folks,

Template transformer will fail to return if a variable if undefined,
however the regex transformer does still return. So where the
following would fail:-



This can be used instead:-



So I guess we have the best of both worlds!

Fergus.

>Hmmm. Just gave that a go! No luck
>But how many layers of defaults do we need?
>
>
>Rgds Fergus
>
>>What about having the template transformer support ${field:default}  
>>syntax?  I'm assuming it doesn't support that currently right?  The  
>>replace stuff in the config files does though.
>>
>>  Erik
>>
>>
>>On Feb 13, 2009, at 8:17 AM, Fergus McMenemie wrote:
>>
>>> Paul,
>>>
>>> Following up your usenet sussgetion:
>>>
>>> >> ignoreMissingVariables="true"/>
>>>
>>> and to add more to what I was thinking...
>>>
>>> if the field is undefined in the input document, but the schema.xml
>>> does allow a default value, then TemplateTransformer can use the
>>> default value. If there is no default value defined in schema.xml
>>> then it can fail as at present. This would allow "" or any other
>>> value to be fed into TemplateTransformer, and still enable avoidance
>>> of the partial strings you referred to.
>>>
>>> Regards Fergus.
>>>
>>>>> Hello,
>>>>>
>>>>> templatetransformer behaves rather ungracefully if one of the  
>>>>> replacement
>>>>> fields is missing.
>>>>
>>>> Looking at TemplateString.java I see that left to itself fillTokens  
>>>> would
>>>> replace a missing variable with "". It is an extra check in  
>>>> TemplateTransformer
>>>> that is throwing the warning and stopping the row being returned.  
>>>> Commenting
>>>> out the check seems to solve my problem.
>>>>
>>>> Having done this, an undefined replacement string in  
>>>> TemplateTransformer
>>>> is replaced with "". However a neater fix would probably involve  
>>>> making
>>>> use of the default value which can be assigned to a row? in  
>>>> schema.xml.
>>>>
>>>>> I am parsing a single XML document into multiple separate solr  
>>>>> documents.
>>>>> It turns out that none of the source documents fields can be used  
>>>>> to create
>>>>> a uniqueKey alone. I need to combine two, using template  
>>>>> transformer as
>>>>> follows:
>>>>>
>>>>> >>>> dataSource="myfilereader"
>>>>> processor="XPathEntityProcessor"
>>>>> url="${jc.fileAbsolutePath}"
>>>>> rootEntity="true"
>>>>> stream="false"
>>>>> forEach="/record | /record/mediaBlock"
>>>>> transformer 
>>>>> ="DateFormatTransformer,TemplateTransformer,RegexTransformer"
>>>>>>
>>>>>
>>>>> 
>>>>> >>>> sourceColName="fileAbsolutePath"/>
>>>>> 
>>>>> 
>>>>>
>>>>> The trouble is that vurl is only defined as a child of "/record/ 
>>>>> mediaBlock"
>>>>> so my attempt to create id, the uniqueKey fails for the parent  
>>>>> document "/record"
>>>>>
>>>>> I am hacking around with "TemplateTransformer.java" to sort this  
>>>>> but was
>>>>> wondering if there was a good reason for this behavior.
>>>>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

DIH use of the ?command=full-import entity= command option

2009-03-12 Thread Fergus McMenemie

Hello,

Can anybody describe the intended purpose, or provide a 
few examples, of how the DIH entity= command option works.

Am I supposed to build a data-conf.xml file which contains
many different alternate entities.. or 

Regards 
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH use of the ?command=full-import entity= command option

2009-03-12 Thread Fergus McMenemie

If my data-config.xml contains multiple root level entities
what is the expected action if I call full-import without an
entity=XXX sub-command?

Does it process all entities one after the other or only the
first? (It would be useful IMHO if it only did the first.)

>On Fri, Mar 13, 2009 at 3:17 AM, Fergus McMenemie  wrote:
>
>> Hello,
>>
>> Can anybody describe the intended purpose, or provide a
>> few examples, of how the DIH entity= command option works.
>>
>> Am I supposed to build a data-conf.xml file which contains
>> many different alternate entities.. or 
>>
>
>With the entity parameter you can specify the name of any root entity and
>import only that one. You can specify multiple entity parameters too. For
>example:
>/dataimport?command=full-import&entity=x&entity=y
>
>You may need to specify preImportDeleteQuery separately on each entity to
>make sure all documents are not deleted.
>-- 
>Regards,
>Shalin Shekhar Mangar.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Problem encoding ':' char in a solr query

2009-03-18 Thread Fergus McMenemie

Hello 

I have a solr field:-



which an unrelated query reveals is populated with:-


file:///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml


however when I try and query for that exact document explicitly:-

http://localhost:8080/apache-solr-1.4-dev/select?q=fileAbsolutePath:file%3a///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml&wt=xml

it fails. 

HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 
'fileAbsolutePath:file:///Volumes/spare/ts/ford/schema/data/news/fdw2008/jn71796.xml':
 Encountered " ":" ": "" at line 1, column 21. Was expecting one of:  
 ...  ...  ... "+" ... "-" ... "(" ... "*" ... "^" ...  
...  ...  ...  ...  ... "[" ... "{" ... 
 ... 

My encoding did not work! Help!
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH - read datasource param values from property file or configure JNDI datasource

2009-03-19 Thread Fergus McMenemie

>I am looking for a implementation of DIH feature: It also takes in a 
>properties file for the data source configuration 
>(http://issues.apache.org/jira/browse/SOLR-469)
>
>I want to externalize the data source parameters like driver, url, user and 
>password to property file outside the solr. My aim to hide the parameters from 
>developer code in Production environment. So that admin can enter these values.
>
>Or else can DIH read JNDI data source from server environment.
>
>Let me know the best practice to follow in production environment?
>
>Thanks
>Shyamsunder
>

This is an idea rather than a recommendation.

But, as per the DIH FAQ, you can pass in extra arguments on the URL
used to invoke DIH and use these arguments within data-config.xml.
Not too sure the extent they can be used with the various dataSource's.
But  if you are happy to pass the information in via the URL or even 
solconfig.xml then this may the route to go down.

Fergus.
--

Re: Scheduling DIH

2009-03-26 Thread fergus mcmenemie


H, my tuppence worth!

IMHO I do not think this should be built into solr. Doing it properly 
leads to all kinds of nasty platform dependent issues... will we then 
want to add notification features on success/failure? via email?


Ideally, all the scheduled activities on a system should be centralised 
in one place such as cron, or as few places as possible. From a system 
administration point of view there is then a single locations from where 
everything can be viewed and controlled. There are generally 
dependencies between different activities and having to chase around and 
configure many separate proprietary schedulers is a nuisance as well as 
being error prone.


Fergus.


Tricia Williams wrote:

Hello,

   Is there a best way to schedule the DataImportHandler?  The idea 
being to schedule a delta-import every Sunday morning at 7am or perhaps 
every hour without human intervention.  Writing a cron job to do this 
wouldn't be difficult.  I'm just wondering is this a built in feature?


Tricia

Clarifying use of

2009-03-27 Thread fergus mcmenemie

Hello,

Due to limitations with the way my content is organised and DIH I have
to add “-imgCaption:[* TO *]” to some of my queries. I discovered the
name=”appends” functionality tucked away inside solconfig.xml. This
looks a very useful feature, and I created a new requestHandler to deal
with my problem queries. I tried adding the following to my alternate
requestHandler:-

 -imgCaption:[* TO *]

which did not work; however

 -imgCaption:[* TO *]

worked fine and is also more efficient. I guess I was caught by the
“identify values which should be appended to the list of ***multi-val
params from the query” portion of the comment within solconfig.xml.
I am now wondering how do I know which query params are "multi-val" or
not? Is this documented anywhere?

Regards Fergus.

Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-03-30 Thread Fergus McMenemie

Grant,

After all my playing about at boot camp, I gave things a rest. It
was not till months later that got back to looking at solr again.
So after 643465 (2008-Apr-01)  the next version I tried was 694377 
from (2008-Sep-11). Nothing in between. Yep so 643465 is the latest
version I tried that still performs. Every later revision is slower.

However I need to repeat the tests using 643465, 694377 and whatever
is the latest version. On my macbook I am only seeing a 2x slowdown
of 643465 vis today, where as I had been seeing a 3x slowdown using
my Imac.

Fergus


>Fregus,
>
>Is rev 643465 the absolute latest you tried that still performs?  i.e.  
>every revision after is slower?
>
>-Grant
>
>On Mar 30, 2009, at 12:45 PM, Grant Ingersoll wrote:
>
>> Fergus,
>>
>> I think the problem may actually be due to something that was  
>> introduced by a change to Solr's StopFilterFactory and the way it  
>> loads the stop words set.  See 
>> https://issues.apache.org/jira/browse/SOLR-1095
>>
>> I am in the process of testing it out and will let you know.
>>
>> -Grant
>>
>> On Mar 28, 2009, at 11:00 AM, Grant Ingersoll wrote:
>>
>>> Hey Fergus,
>>>
>>> Finally got a chance to run your scripts, etc. per the thread:
>>> http://www.lucidimagination.com/search/document/5c3de15a4e61095c/upgrade_from_1_2_to_1_3_gives_3x_slowdown_script#8324a98d8840c623
>>>
>>> I can reproduce your slowdown.
>>>
>>> One oddity with rev 643465 is:
>>>
>>> On the old version, there is an exception during startup:
>>> Mar 28, 2009 10:44:31 AM org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.NullPointerException
>>>   at  
>>> org 
>>> .apache 
>>> .solr 
>>> .handler 
>>> .component.SearchHandler.handleRequestBody(SearchHandler.java:129)
>>>   at  
>>> org 
>>> .apache 
>>> .solr 
>>> .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 
>>> 125)
>>>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:953)
>>>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:968)
>>>   at  
>>> org 
>>> .apache 
>>> .solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java: 
>>> 50)
>>>   at org.apache.solr.core.SolrCore$3.call(SolrCore.java:797)
>>>   at java.util.concurrent.FutureTask 
>>> $Sync.innerRun(FutureTask.java:303)
>>>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>   at java.util.concurrent.ThreadPoolExecutor 
>>> $Worker.runTask(ThreadPoolExecutor.java:885)
>>>   at java.util.concurrent.ThreadPoolExecutor 
>>> $Worker.run(ThreadPoolExecutor.java:907)
>>>   at java.lang.Thread.run(Thread.java:637)
>>>
>>> I see two things in CHANGES.txt that might apply, but I'm not sure:
>>> 1. I think commons-csv was upgraded
>>> 2. The CSV loader stuff was refactored to share common code
>>>
>>> I'm still investigating.
>>>
>>> -Grant
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-03-30 Thread Fergus McMenemie

>Can you verify that rev 701485 still performs reasonably well?  This  
>is from October 2008 and I get similar results to the earlier rev. 
>Am now trying some other versions between October and when you first  
>reported the issue in November.

OK. Can you tell me how to get a hold of revision 701485. What is the
magic svn line?


>On Mar 30, 2009, at 3:37 PM, Grant Ingersoll wrote:
>
>> Fregus,
>>
>> Is rev 643465 the absolute latest you tried that still performs?   
>> i.e. every revision after is slower?
>>
>> -Grant
>>
>> On Mar 30, 2009, at 12:45 PM, Grant Ingersoll wrote:
>>
>>> Fergus,
>>>
>>> I think the problem may actually be due to something that was  
>>> introduced by a change to Solr's StopFilterFactory and the way it  
>>> loads the stop words set.  See 
>>> https://issues.apache.org/jira/browse/SOLR-1095
>>>
>>> I am in the process of testing it out and will let you know.
>>>
>>> -Grant
>>>
>>> On Mar 28, 2009, at 11:00 AM, Grant Ingersoll wrote:
>>>
>>>> Hey Fergus,
>>>>
>>>> Finally got a chance to run your scripts, etc. per the thread:
>>>> http://www.lucidimagination.com/search/document/5c3de15a4e61095c/upgrade_from_1_2_to_1_3_gives_3x_slowdown_script#8324a98d8840c623
>>>>
>>>> I can reproduce your slowdown.
>>>>
>>>> One oddity with rev 643465 is:
>>>>
>>>> On the old version, there is an exception during startup:
>>>> Mar 28, 2009 10:44:31 AM org.apache.solr.common.SolrException log
>>>> SEVERE: java.lang.NullPointerException
>>>>  at  
>>>> org 
>>>> .apache 
>>>> .solr 
>>>> .handler 
>>>> .component.SearchHandler.handleRequestBody(SearchHandler.java:129)
>>>>  at  
>>>> org 
>>>> .apache 
>>>> .solr 
>>>> .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java: 
>>>> 125)
>>>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:953)
>>>>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:968)
>>>>  at  
>>>> org 
>>>> .apache 
>>>> .solr 
>>>> .core.QuerySenderListener.newSearcher(QuerySenderListener.java:50)
>>>>  at org.apache.solr.core.SolrCore$3.call(SolrCore.java:797)
>>>>  at java.util.concurrent.FutureTask 
>>>> $Sync.innerRun(FutureTask.java:303)
>>>>  at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>>>  at java.util.concurrent.ThreadPoolExecutor 
>>>> $Worker.runTask(ThreadPoolExecutor.java:885)
>>>>  at java.util.concurrent.ThreadPoolExecutor 
>>>> $Worker.run(ThreadPoolExecutor.java:907)
>>>>  at java.lang.Thread.run(Thread.java:637)
>>>>
>>>> I see two things in CHANGES.txt that might apply, but I'm not sure:
>>>> 1. I think commons-csv was upgraded
>>>> 2. The CSV loader stuff was refactored to share common code
>>>>
>>>> I'm still investigating.
>>>>
>>>> -Grant
>>>
>>> --
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>
>--
>Grant Ingersoll
>http://www.lucidimagination.com/
>
>Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>using Solr/Lucene:
>http://www.lucidimagination.com/search

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH; Hardcode field value/replacement based on source column

2009-03-31 Thread Fergus McMenemie

Hmmm, I am sure I have seen this as well!

 

I get the #${x.imgvurl} added twice.

Fergus.

>On 3/31/09 11:50 AM, "Wesley Small"  wrote:
>
>> I am trying to find a clean way to *hardcode* a field/column to a specific
>> value during the DIH process.  It does seems to be possible but I am getting
>> an slightly invalid constant value in my index.
>> 
>> > replaceWith="Video" />
>> 
>> However, the value in the index was set to "VideoVideo" for all documents.
>> 
>> Any idea why this DIH instruction would see constant value appear twice??
>> 

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-03-31 Thread Fergus McMenemie

Grant,

I am messing with the script, and with your tip I expect I can
make it recurse over as many releases as needed.

I did run it again using the full file, this time using my Imac:-
643465took  22min 14sec 2008-04-01
734796  73min 58sec 2009-01-15
758795  70min 55sec 2009-03-26
I then ran it again using only the first 1M records:-
643465took  2m51.516s   2008-04-01
734796  7m29.326s   2009-01-15
758795  8m18.403s   2009-03-26
this time with commit=true.
643465took  2m49.200s   2008-04-01
734796  8m27.414s   2009-01-15
758795  9m32.459s   2009-03-26
this time with commit=false&overwrite=false.
643465took  2m46.149s   2008-04-01
734796  3m29.909s   2009-01-15
758795  3m26.248s   2009-03-26

Just read your latest post. I will apply the patches and retest
the above.

>Can you try adding &overwrite=false and running against the latest  
>version?  My current working theory is that Solr/Lucene has changed  
>how deletes are handled such that work that was deferred before is now  
>not deferred as often.  In fact, you are not seeing this cost paid (or  
>at least not noticing it) because you are not committing, but I  
>believe you do see it when you are closing down Solr, which is why it  
>takes so long to exit.
It can take ages! (>15min to get tomcat to quit). Also my script does
have the separate commit step, which does not take any time!

>I also think that Lucene adding fsync() into  
>the equation may cause some slow down, but that is a penalty we are  
>willing to pay as it gives us higher data integrity.
Data integrity is always good. However if performance seems
unreasonable, user/customers tend to take things into their
own hands and kill the process or machine. This tends to be
very bad for data integrity.

>So, depending on how you have your data, I think a workaround is to:
>Add a field that contains a single term identifying the data type for  
>this particular CSV file, i.e. something like field: type, value:  
>fergs-csv
>Then, before indexing, you can issue a Delete By Query: type:fergs-csv  
>and then add your CSV file using overwrite=false.  This amounts to a  
>batch delete followed by a batch add, but without the add having to  
>issue deletes for each add.
Ok.. but... for these test cases I am starting off with an empty
index. The script does a "rm -rf solr/data" before tomcat is launched.
So I do not understand how the above helps. UNLESS there are duplicate
gaz entries.

>In the meantime, I'm trying to see if I can pinpoint down a specific  
>change and see if there is anything that might help it perform better.
>
>-Grant
>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-04-01 Thread Fergus McMenemie

Grant,

Redoing the work with your patch applied does not seem to 
make a difference! Is this the expected result?

I did run it again using the full file, this time using my Imac:-
643465took  22min 14sec 2008-04-01
734796  73min 58sec 2009-01-15
758795  70min 55sec 2009-03-26
Again using only the first 1M records with commit=false&overwrite=true:-
643465took  2m51.516s   2008-04-01
734796  7m29.326s   2009-01-15
758795  8m18.403s   2009-03-26
SOLR-1095   7m41.699s  
this time with commit=true&overwrite=true.
643465took  2m49.200s   2008-04-01
734796  8m27.414s   2009-01-15
758795  9m32.459s   2009-03-26
SOLR-1095   7m58.825s
this time with commit=false&overwrite=false.
643465took  2m46.149s   2008-04-01
734796  3m29.909s   2009-01-15
758795  3m26.248s   2009-03-26
SOLR-1095   2m49.997s  


-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-04-02 Thread Fergus McMenemie

>On Apr 1, 2009, at 9:39 AM, Fergus McMenemie wrote:
>
>> Grant,
>>
>> Redoing the work with your patch applied does not seem to
>
>>
>> make a difference! Is this the expected result?
>
>No, I didn't expect Solr 1095 to fix the problem. Overwrite = false +  
>1095, does, however, AFAICT by your last line, right?
>
>>
>>
>> I did run it again using the full file, this time using my Imac:-
>>  643465took  22min 14sec 2008-04-01
>>  734796  73min 58sec 2009-01-15
>>  758795  70min 55sec 2009-03-26
>> Again using only the first 1M records with  
>> commit=false&overwrite=true:-
>>  643465took  2m51.516s   2008-04-01
>>  734796  7m29.326s   2009-01-15
>>  758795  8m18.403s   2009-03-26
>>  SOLR-1095   7m41.699s
>> this time with commit=true&overwrite=true.
>>  643465took  2m49.200s   2008-04-01
>>  734796  8m27.414s   2009-01-15
>>  758795  9m32.459s   2009-03-26
>>  SOLR-1095   7m58.825s
>> this time with commit=false&overwrite=false.
>>  643465took  2m46.149s   2008-04-01
>>  734796  3m29.909s   2009-01-15
>>  758795  3m26.248s   2009-03-26
>>  SOLR-1095   2m49.997s
>>
Grant,

Hmmm, the big difference is made by &overwrite=false. But,
can you explain why &overwrite=false makes such a difference.
I am starting off with an empty index and I have checked the
content there are no duplicates in the uniqueKey field.

I guess if &overwrite=false then a few checks can be removed
from the indexing process, and if I am confident that my content
contains no duplicates then this is a good speed up. 

http://wiki.apache.org/solr/UpdateCSV says that if overwrite 
is true (the default) then overwrite documents based on the
uniqueKey. However what will solr/lucene do if the uniqueKey
is not unique and overwrite=false?  

fergus: perl -nlaF"\t" -e 'print "$F[2]";' geonames.txt | wc -l
 100
fergus: perl -nlaF"\t" -e 'print "$F[2]";' geonames.txt | sort -u | wc -l
 100
fergus: /usr/bin/head geonames.txt
RC  UFI UNI LAT LONGDMS_LAT DMS_LONGMGRSJOG 
FC  DSG PC  CC1 ADM1ADM2POP ELEVCC2 NT  
LC  SHORT_FORM  GENERIC SORT_NAME   FULL_NAME   FULL_NAME_ND
MODIFY_DATE
1   -130782860524   12.47   -69.9   122800  -695400 
19PDP0219578323 ND19-14 T   MT  AA  00  
PALUMARGA   Palu Marga  Palu Marga  1995-03-23
1   -1307756-189172012.5-70.016667  123000  -700100 
19PCP8952982056 ND19-14 P   PPLX

PS. do you want me to do some kind of chop through the
different versions to see where the slow down happened
or are you happy you have nailed it?
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Problem using ExtractingRequestHandler with tomcat

2009-04-02 Thread Fergus McMenemie

Hello all,

I cant get ExtractingRequestHandler to work with tomcat. Using the
latest version from svn and then a "make clean dist" and copying the
war file to a clean tomcat does not work.
 
Adding the following to solconfig.xml ands restarting tomcat i get

>   class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>
>  last_modified
>  true
>  
>


>Apr 2, 2009 9:20:02 AM org.apache.solr.util.plugin.AbstractPluginLoader load
>INFO: created /update/javabin: 
>org.apache.solr.handler.BinaryUpdateRequestHandler
>Apr 2, 2009 9:20:02 AM org.apache.solr.common.SolrException log
>SEVERE: org.apache.solr.common.SolrException: Error loading class 
>'org.apache.solr.handler.extraction.ExtractingRequestHandler'
>   at 
> org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:310)
>   at 
> org.apache.solr.core.SolrResourceLoader.newInstance(SolrResourceLoader.java:325)
>   at 
> org.apache.solr.util.plugin.AbstractPluginLoader.create(AbstractPluginLoader.java:84)
>   at 
> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:154)
>   at 
> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:163)

Any ideas?
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Problem using ExtractingRequestHandler with tomcat

2009-04-02 Thread Fergus McMenemie

>On Apr 2, 2009, at 4:26 AM, Fergus McMenemie wrote:
>> I cant get ExtractingRequestHandler to work with tomcat. Using the
>> latest version from svn and then a "make clean dist" and copying the
>> war file to a clean tomcat does not work.
>
>make?!  :)
Oops!

>
>try "ant example" to see if that gets it working - it copies the  
>ExtractingRequestHandler JAR and dependencies to /lib
>
>   Erik
>
Thanks. Copying all those jar files to my solr/lib directory was
the trick. But why do I have to do this; is it by design or 
because ExtractingRequestHandler is yet to be fully incorporated 
into Solr?

Regards Fergus.
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Using ExtractingRequestHandler to index a large PDF

2009-04-02 Thread Fergus McMenemie

Hello,

Sorry if this is a FAQ; I suspect it could be. But how do I work around the 
following:-

INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract 
params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/oceania.pdf} 
status=0 QTime=318 
Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log
SEVERE: 
org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the 
request was rejected because its size (4585774) exceeds the configured maximum 
(2097152)
at 
org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.(FileUploadBase.java:914)
at 
org.apache.commons.fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
at 
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:349)
at 
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
at 
org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:343)
at 
org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:396)
at 
org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)

Although the PDF is big, it contains very little text; it is a map. 

   "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother with it.

Fergus...
-- 

=======
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-04-02 Thread Fergus McMenemie

Grant,



>I should note, however, that the speed difference you are seeing may  
>not be as pronounced as it appears.  If I recall during ApacheCon, I  
>commented on how long it takes to shutdown your Solr instance when  
>exiting it.  That time it takes is in fact Solr doing the work that  
>was put off by not committing earlier and having all those deletes  
>pile up.
>
I am confused about "work that was put off" vs committing. My script
was doing a commit right after the CVS import, and you are right
about the massive times required to shut tomcat down. But in my tests
the time taken to do the commit was under a second, yet I had to allow
300secs for tomcat shutdown. Also I dont have any duplicates. So 
what sort of work was being done at shutdown that was not being done
by a commit? Optimise!

Thanks for the all the help.

Fergus.
-- 

=======
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Additive filter queries

2009-04-03 Thread Fergus McMenemie

>I have a design question for all of those who might be willing to provide an
>answer.
>
>We are looking for a way to do a type of additive filters.  Our documents
>are comprised of a single item of a specified color.  We will use shoes as
>an example.  Each document contains a multivalued ³size² field with all
>sizes and a multivalued ³width² field for all widths available for a given
>color.  Our issue is that the values are not linked to each other.  This
>issue can be seen when a user chooses a size (e.g. 7) and we filter the
>options down to only size 7.  When the width facet is displayed it will have
>all widths available for all documents that match on size 7 even though most
>don¹t come in a wide width.  We are looking for strategies to filter facets
>based on other facets in separate queries.
>
>-- 
>Jeff Newburn
>Software Engineer, Zappos.com
>jnewb...@zappos.com - 702-943-7562

Ditto!

As best I understand, you somehow need to arrange for each different 
combination of colour, size and width to be indexed as a separate sol
document.

-- 

=======
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH API for specifying a either specific or all configurations imported

2009-04-06 Thread Fergus McMenemie

>Good Morning,
>
>Is there any way to specify or debug a specific DIH configuration via the
>API/http request?
>
>I have the following:
>
>
>dih_pc_default_feed.xml
>
>
>dih_pc_cms_article_feed.xml
>
>
>dih_pc_local_event_feed.xml
>
>
>For example, is there any to specific only the "pc_local_event" be process
>(imported)?
>
>Another questions, if command=full-import, this should effectively mean that
>all DIH configuration are executed in sequential order.  Is that correct?  I
>am not seeing that behaviour at present.
>

Wesley,

I do not think the above is valid syntactically.

I am a still coming up to speed on DIH, however I have taken to storing all
my DIH import configurations in a single file. Each of your different
configurations would be within its own top level entity tag. Each of which
MUST be named. It is also a good idea to explicitly name each of your
datasource descriptions, and then have the entities reference there datasource
by name. I can then invoke only that entity from the URL as follows:-

http://localhost:8080/apache-solr-1.4-dev/dataimport?command=full-import&entity=jc

See the docs at:-

http://wiki.apache.org/solr/DataImportHandler#head-1582242c1bfc1f3e89f4025bf2055791848acefb

Fergus.

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Using ExtractingRequestHandler to index a large PDF ~solved

2009-04-06 Thread Fergus McMenemie

Hmmm,

Not sure how this all hangs together. But editing my solrconfig.xml as follows
sorted the problem:-


to 



Also, my initial report of the issue was misled by the log messages. The mention
of "oceania.pdf" refers to a previous successful tika extract. There no mention 
of the filename that was rejected in the logs or any information that would help
me identify it! 

Regards Fergus.

>Sorry if this is a FAQ; I suspect it could be. But how do I work around the 
>following:-
>
>INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract 
>params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/oceania.pdf}
> status=0 QTime=318 
>Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log
>SEVERE: 
>org.apache.commons.fileupload.FileUploadBase$SizeLimitExceededException: the 
>request was rejected because its size (4585774) exceeds the configured maximum 
>(2097152)
>   at 
> org.apache.commons.fileupload.FileUploadBase$FileItemIteratorImpl.(FileUploadBase.java:914)
>   at 
> org.apache.commons.fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
>   at 
> org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:349)
>   at 
> org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
>   at 
> org.apache.solr.servlet.MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:343)
>   at 
> org.apache.solr.servlet.StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:396)
>   at 
> org.apache.solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114)
>   at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:217)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
>   at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
>   at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>   at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
>   at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
>   at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
>
>Although the PDF is big, it contains very little text; it is a map. 
>
>   "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother with it.
>
>Fergus...
>-- 
>
>===
>Fergus McMenemie   Email:fer...@twig.me.uk
>Techmore Ltd   Phone:(UK) 07721 376021
>
>Unix/Mac/Intranets     Analyst Programmer
>===

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Searching on mulit-core Solr

2009-04-06 Thread Fergus McMenemie

vivek,

404 from the URL you provided in the message! Similar URLs work
OK for me.

hmm try http://localhost:8080/solr/admin/cores?action=status and see 
if that gives a 404.

Also are you running a nightly build or a svn checkout? Using tomcat?
Perhaps it should be

http://localhost:8080/apache-solr-1.4-dev/admin/cores?action=status

Fergus.

>Hi,
>
>  Any help on this. I've looked at DistributedSearch on Wiki, but that
>doesn't seem to be working for me on multi-core and multiple Solr
>instances on the same box.
>
>Scenario,
>
>1) Two boxes (localhost, 10.4.x.x)
>2) Two Solr instances on each box (8080 and 8085 ports)
>3) Two cores on each instance (core0, core1)
>
>I'm not sure how to construct my search on the above setup if I need
>to search across all the cores on all the boxes. Here is what I'm
>trying,
>
>http://localhost:8080/solr/core0/select?shards=localhost:8080/solr/core0,localhost:8085/solr/core0,localhost:8080/solr/core1,localhost:8085/solr/core1,10.4.x.x:8080/solr/core0,10.4.x.x:8085/solr/core0,10.4.x.x:8080/solr/core1,10.4.x.x:8085/solr/core1&indent=true&q=vivek+japan
>
>I get 404 error. Is this the right URL construction for my setup? How
>else can I do this?
>
>Thanks,
>-vivek
>
>On Fri, Apr 3, 2009 at 1:02 PM, vivek sar  wrote:
>> Hi,
>>
>>  I've a multi-core system (one core per day), so there would be around
>> 30 cores in a month on a box running one Solr instance. We have two
>> boxes running the Solr instance and input data is feeded to them in
>> round-robin fashion. Each box can have up to 30 cores in a month. Here
>> are questions,
>>
>>  1) How would I search for a term in multiple cores on same box?
>>
>>  Single core I'm able to search like,
>>   http://localhost:8080/solr/20090402/select?q=*:*
>>
>> 2) How would I search for a term in multiple cores on both boxes at
>> the same time?
>>
>> 3) Is it possible to have two Solr instances on one box with one doing
>> the indexing and other perform only searches on that index? The idea
>> is have two JVMs with each doing its own task - I'm not sure whether
>> the indexer process needs to know about searcher process - like do
>> they need to have the same solr.xml (for multicore etc). We don't want
>> to replicate the indexes also (we got very light search traffic, but
>> very high indexing traffic) so they need to use the same index.
>>
>>
>> Thanks,
>> -vivek
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: How could I avoid reindexing same files?

2009-04-07 Thread Fergus McMenemie

Veselin,

Well, as far as solr is concerned, there is two issues here:-

1) To stop the same document ending up in the indexes twice, use the document
   pathname as the unique ID. Then if you do index it twice, the previous index
   information will be discarded. Not very efficient, but it may be tolerable.
   IMHO using pathname as the unique ID is often best practice.

2) To stop a document even being submitted to solr. You need to implement some
   middle ware that either performs a search/lookup using a documents pathname
   to see if it is already indexed. Or, after examining timestampts, only 
submits
   documents which have changed since the last folder scan.

Fergus.
>Hello Paul,
>I'm indexing with "curl http://localhost... -F myfi...@file.pdf" 
>
>Regards,
>Veselin K
>
>
>On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble Paul ?  
>?? wrote:
>> how are you indexing?
>> 
>> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
>>  wrote:
>> > Hello,
>> > apologies for the basic question.
>> >
>> > How can I avoid double indexing files?
>> >
>> > In case all my files are in one folder which is scanned frequently, is
>> > there a Solr feature of checking and skipping a file if it has already 
>> > been indexed
>> > and not changed since?
>> >
>> >
>> > Thank you.
>> >
>> > Regards,
>> > Veselin K
>> >
>> >
>> 
>> 
>> 
>> -- 
>> --Noble Paul

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: How could I avoid reindexing same files?

2009-04-07 Thread Fergus McMenemie

>Thank you much Fergus,
>
>I was considering implementing a database which would hold a path name
>and an MD5 sum of each file.
Snap. That is close to what we did. However due to our pervious
duff full text search engine we had to hold this information in
a separate checksums file. Solr is much better at allowing you
to add extra meta information as the document is being submitted
for indexing.

curl http://localhost...update/extract 
   -F "myfi...@file.pdf;ext.literal.id=file.pdg;ext.literal.chksum=X"

>Then as a part of Solr indexing, one could check against the DB if a
>file path exists, if Yes, then compare MD5 and only index if different.
Using solr you could hold the checksum and pathname as solr fields,
then rather than looking up a DB you would look up solr. Having every
thing in the one place is better for consistency and quality. You
could also dump all checksums and pathnames from solr if/when you wanted
to validate your folder structure and or indexes.

>Regards,
>Veselin K
>
>On Tue, Apr 07, 2009 at 09:01:31AM +0100, Fergus McMenemie wrote:
>> Veselin,
>> 
>> Well, as far as solr is concerned, there is two issues here:-
>> 
>> 1) To stop the same document ending up in the indexes twice, use the document
>>pathname as the unique ID. Then if you do index it twice, the previous 
>> index
>>information will be discarded. Not very efficient, but it may be 
>> tolerable.
>>IMHO using pathname as the unique ID is often best practice.
>> 
>> 2) To stop a document even being submitted to solr. You need to implement 
>> some
>>middle ware that either performs a search/lookup using a documents 
>> pathname
>>to see if it is already indexed. Or, after examining timestampts, only 
>> submits
>>documents which have changed since the last folder scan.
>> 
>> Fergus.
>> >Hello Paul,
>> >I'm indexing with "curl http://localhost... -F myfi...@file.pdf" 
>> >
>> >Regards,
>> >Veselin K
>> >
>> >
>> >On Mon, Apr 06, 2009 at 02:56:20PM +0530, Noble Paul ?  
>> >?? wrote:
>> >> how are you indexing?
>> >> 
>> >> On Mon, Apr 6, 2009 at 2:54 PM, Veselin Kantsev
>> >>  wrote:
>> >> > Hello,
>> >> > apologies for the basic question.
>> >> >
>> >> > How can I avoid double indexing files?
>> >> >
>> >> > In case all my files are in one folder which is scanned frequently, is
>> >> > there a Solr feature of checking and skipping a file if it has already 
>> >> > been indexed
>> >> > and not changed since?
>> >> >
>> >> >
>> >> > Thank you.
>> >> >
>> >> > Regards,
>> >> > Veselin K

>> >> --Noble Paul
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: DIH; Hardcode field value/replacement based on source column

2009-04-08 Thread Fergus McMenemie

>: Indeed. I wrote the following test:
>: 
>: Pattern p = Pattern.compile("(.*)");
>: Matcher m = p.matcher("xyz");
>: Assert.assertEquals("", "Video", m.replaceAll("Video"));
>: 
>: The test fails. It gives "VideoVideo" as the actual result. I guess there is
>: something about Matcher.replaceAll that I don't know. Off to read the
>: javadocs then.
>
>".*" matches the empty string (for that matter any regex clause with the 
>"*" modifier applied matches the empty string), and iterating over pattern 
>matches (ie: what happens if you call Matcher.find() or 
>Matcher.replaceAll()) always advances to "first character not matched by 
>[the previous] match." (ie: let prev = m.end(); if (m.find) then prev <= 
>m.start()).
>
>So ".*" always matches twice on any given String x ... once when it 
>matches from 0 to x.length()-1, and one when it matches the empty string 
>starting and ending at x.length()-1.
>
>That's why using "^.*" doesn't have this problem ... "*" is greedy so it 
>only matches once at the start of the string and then there can't be any 
>more matches.  Conversly: ".*$" and ".*\z" will still have this problem, 
>because any number of matches can have the same ending offset.
>
>
>-Hoss

Hmmm, given the chance perl behaves the same. Although attempting
to use  /*/ fails. Another lesson learnt!

#! /usr/local/bin/perl
use strict;
my($s)="cat mat rat hat";
my($c)=0;

print " a-match", ++$c, "='$1'\n" while( $s =~ m/(at)/g ); 
$c=0;
print " b-match", ++$c, "='$1'\n" while( $s =~ m/(.*)/g );
$c=0;
print " c-match", ++$c, "='$1'\n" while( $s =~ m/^(.*)/g );
$c=0;
print " d-match", ++$c, "='$1'\n" while( $s =~ m/(.*)$/g );

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: How could I avoid reindexing same files?

2009-04-08 Thread Fergus McMenemie

>Hi Fergus,
>
>On Tue, Apr 07, 2009 at 05:06:23PM +0100, Fergus McMenemie wrote:
>> >Thank you much Fergus,
>> >
>> >I was considering implementing a database which would hold a path name
>> >and an MD5 sum of each file.
>> Snap. That is close to what we did. However due to our pervious
>> duff full text search engine we had to hold this information in
>> a separate checksums file. Solr is much better at allowing you
>> to add extra meta information as the document is being submitted
>> for indexing.
>> 
>> curl http://localhost...update/extract 
>>-F "myfi...@file.pdf;ext.literal.id=file.pdf;ext.literal.chksum=X"
>
>- Great idea, simpler and cleaner!
>
> 
>> >Then as a part of Solr indexing, one could check against the DB if a
>> >file path exists, if Yes, then compare MD5 and only index if different.
>> Using solr you could hold the checksum and pathname as solr fields,
>> then rather than looking up a DB you would look up solr. Having every
>> thing in the one place is better for consistency and quality. You
>> could also dump all checksums and pathnames from solr if/when you wanted
>> to validate your folder structure and or indexes.
>
>- What kind of query could I use with Solr, to check for a specific
>  filename/checksum and get an answer as close to "TRUE or FALSE" as possible?

Some thought needs to be given to this to make sure that
the performance is adequate. But at its simplest:-

curl http://localhost.../select?id=file.pdf&fl=id,chksum
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Searching on mulit-core Solr

2009-04-09 Thread Fergus McMenemie

Valve.invoke(StandardContextValve.java:191)
>>        at 
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>>        at 
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>        at 
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>        at 
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>>        at 
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:845)
>>        at 
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>>        at 
>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>>        at java.lang.Thread.run(Thread.java:637)
>>
>>
>> Any tips on how can I search on multicore on same solr instance?
>>
>> Thanks,
>> -vivek
>>
>> On Mon, Apr 6, 2009 at 2:40 PM, Fergus McMenemie  wrote:
>>> vivek,
>>>
>>> 404 from the URL you provided in the message! Similar URLs work
>>> OK for me.
>>>
>>> hmm try http://localhost:8080/solr/admin/cores?action=status and see
>>> if that gives a 404.
>>>
>>> Also are you running a nightly build or a svn checkout? Using tomcat?
>>> Perhaps it should be
>>>
>>> http://localhost:8080/apache-solr-1.4-dev/admin/cores?action=status
>>>
>>> Fergus.
>>>
>>>>Hi,
>>>>
>>>>  Any help on this. I've looked at DistributedSearch on Wiki, but that
>>>>doesn't seem to be working for me on multi-core and multiple Solr
>>>>instances on the same box.
>>>>
>>>>Scenario,
>>>>
>>>>1) Two boxes (localhost, 10.4.x.x)
>>>>2) Two Solr instances on each box (8080 and 8085 ports)
>>>>3) Two cores on each instance (core0, core1)
>>>>
>>>>I'm not sure how to construct my search on the above setup if I need
>>>>to search across all the cores on all the boxes. Here is what I'm
>>>>trying,
>>>>
>>>>http://localhost:8080/solr/core0/select?shards=localhost:8080/solr/core0,localhost:8085/solr/core0,localhost:8080/solr/core1,localhost:8085/solr/core1,10.4.x.x:8080/solr/core0,10.4.x.x:8085/solr/core0,10.4.x.x:8080/solr/core1,10.4.x.x:8085/solr/core1&indent=true&q=vivek+japan
>>>>
>>>>I get 404 error. Is this the right URL construction for my setup? How
>>>>else can I do this?
>>>>
>>>>Thanks,
>>>>-vivek
>>>>
>>>>On Fri, Apr 3, 2009 at 1:02 PM, vivek sar  wrote:
>>>>> Hi,
>>>>>
>>>>>  I've a multi-core system (one core per day), so there would be around
>>>>> 30 cores in a month on a box running one Solr instance. We have two
>>>>> boxes running the Solr instance and input data is feeded to them in
>>>>> round-robin fashion. Each box can have up to 30 cores in a month. Here
>>>>> are questions,
>>>>>
>>>>>  1) How would I search for a term in multiple cores on same box?
>>>>>
>>>>>  Single core I'm able to search like,
>>>>>   http://localhost:8080/solr/20090402/select?q=*:*
>>>>>
>>>>> 2) How would I search for a term in multiple cores on both boxes at
>>>>> the same time?
>>>>>
>>>>> 3) Is it possible to have two Solr instances on one box with one doing
>>>>> the indexing and other perform only searches on that index? The idea
>>>>> is have two JVMs with each doing its own task - I'm not sure whether
>>>>> the indexer process needs to know about searcher process - like do
>>>>> they need to have the same solr.xml (for multicore etc). We don't want
>>>>> to replicate the indexes also (we got very light search traffic, but
>>>>> very high indexing traffic) so they need to use the same index.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> -vivek
>>>>>
>>>
>>> --
>>>
>>> ===
>>> Fergus McMenemie               Email:fer...@twig.me.uk
>>> Techmore Ltd                   Phone:(UK) 07721 376021
>>>
>>> Unix/Mac/Intranets             Analyst Programmer
>>> ===
>>>
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: Using ExtractingRequestHandler to index a large PDF ~solved

2009-04-14 Thread Fergus McMenemie

>On Apr 6, 2009, at 10:16 AM, Fergus McMenemie wrote:
>
>> Hmmm,
>>
>> Not sure how this all hangs together. But editing my solrconfig.xml  
>> as follows
>> sorted the problem:-
>>
>>> multipartUploadLimitInKB="2048" />
>> to
>>
>>> multipartUploadLimitInKB="20048" />
>>
>
>We should document this on the wiki or in the config, if it isn't  
>already.

As best I could tell it is not documented. I stumbled across
the idea of changing multipartUploadLimitInKB after reviewing 
http://wiki.apache.org/solr/UpdateRichDocuments. But this leads
onto wondering if streaming files from a local disk was in some
way also available via enableRemoteStreaming for the solr-cell
feature? With 20:20 hindsight I see that 
 http://wiki.apache.org/solr/SolrConfigXml does briefly refer
to "file upload size"

I feel that the requestDispatcher section of solrconfig.xml
needs a more complete description. I get the impression it
acts a filter on *any* URL sent to SOLR? What does it do?

I will mark up the wiki when this is clarified


>
>> Also, my initial report of the issue was misled by the log messages.  
>> The mention
>> of "oceania.pdf" refers to a previous successful tika extract. There  
>> no mention
>> of the filename that was rejected in the logs or any information  
>> that would help
>> me identify it!
>
>We should fix this so it at least spits out a meaningful message.  Can  
>you open a JIRA?
>

OK SOLR-1113 raised.

>>
>> Regards Fergus.
>>
>>> Sorry if this is a FAQ; I suspect it could be. But how do I work  
>>> around the following:-
>>>
>>> INFO: [] webapp=/apache-solr-1.4-dev path=/update/extract  
>>> params={ext.def.fl=text&ext.literal.id=factbook/reference_maps/pdf/ 
>>> oceania.pdf} status=0 QTime=318
>>> Apr 2, 2009 11:17:46 AM org.apache.solr.common.SolrException log
>>> SEVERE: org.apache.commons.fileupload.FileUploadBase 
>>> $SizeLimitExceededException: the request was rejected because its  
>>> size (4585774) exceeds the configured maximum (2097152)
>>> at org.apache.commons.fileupload.FileUploadBase 
>>> $FileItemIteratorImpl.(FileUploadBase.java:914)
>>> at  
>>> org 
>>> .apache 
>>> .commons 
>>> .fileupload.FileUploadBase.getItemIterator(FileUploadBase.java:331)
>>> at  
>>> org 
>>> .apache 
>>> .commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java: 
>>> 349)
>>> at  
>>> org 
>>> .apache 
>>> .commons 
>>> .fileupload 
>>> .servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:126)
>>> at  
>>> org 
>>> .apache 
>>> .solr 
>>> .servlet 
>>> .MultipartRequestParser 
>>> .parseParamsAndFillStreams(SolrRequestParsers.java:343)
>>> at  
>>> org 
>>> .apache 
>>> .solr 
>>> .servlet 
>>> .StandardRequestParser 
>>> .parseParamsAndFillStreams(SolrRequestParsers.java:396)
>>> at  
>>> org 
>>> .apache 
>>> .solr.servlet.SolrRequestParsers.parse(SolrRequestParsers.java:114)
>>> at  
>>> org 
>>> .apache 
>>> .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java: 
>>> 217)
>>> at  
>>> org 
>>> .apache 
>>> .catalina 
>>> .core 
>>> .ApplicationFilterChain 
>>> .internalDoFilter(ApplicationFilterChain.java:202)
>>> at  
>>> org 
>>> .apache 
>>> .catalina 
>>> .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java: 
>>> 173)
>>> at  
>>> org 
>>> .apache 
>>> .catalina 
>>> .core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
>>> at  
>>> org 
>>> .apache 
>>> .catalina 
>>> .core.StandardContextValve.invoke(StandardContextValve.java:178)
>>> at  
>>> org 
>>> .apache 
>>> .catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
>>> at  
>>> org 
>>> .apache 
>>> .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
>>>
>>> Although the PDF is big, it contains very little text; it is a map.
>>>
>>>  "java -jar solr/lib/tika-0.3.jar -g" appears to have no bother  
>>> with it.
>>>
>>> Fergus...
>
>--
>Grant Ingersoll
>http://www.lucidimagination.com/
>
>Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>using Solr/Lucene:
>http://www.lucidimagination.com/search

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: indexing txt file

2009-04-15 Thread Fergus McMenemie

>Hi all,
>I'm trying to use solr1.3 and trying to index a text file.  I wrote a
>schema.xsd and a xml file.

Just to make sure I understand things

Do you just have one of these text files, containing many reports?
   Or
Do you have many of these text files each containing one report?

Also, is the report a single line, that has been wrapped for email?

Fergus.

>
>*The content of my text file is *
>#src   dstprotook
>sportdportpktsbytesflowsfirst
>atest
>192.168.220.13526.147.238.1466  13283980
>6  463  1  1237333861.4657640001237333861.664701000
>
>*schema file is *
>
>
>http://www.w3.org/2001/XMLSchema";>
>
>
>
>
>
>type="xs:string" use="required"/>
>use="required"/>
>use="required"/>
>type="xs:string" use="required"/>
>use="required"/>
>use="required"/>
>type="xs:string" use="required"/>
>use="required"/>
>use="required"/>
>type="xs:string" use="required"/>
>use="required"/>
>
>
>
>
>
>
>
>
>*and my xml file is *
>
>
>http://www.w3.org/2001/XMLSchema-instance";
>xsi:noNamespaceSchemaLocation="C:\DOCUME~1\tpham\Desktop\networkTraffic.xsd">
>protocolPortNumber="6" ok="1" sourcePort="32439" destinationPort="80"
>packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000"
>terminationTimestamp="1237963861.664701000"/>
>protocolPortNumber="17" ok="1" sourcePort="32439" destinationPort="80"
>packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000"
>terminationTimestamp="1237963861.664701000"/>
>protocolPortNumber="6" ok="1" sourcePort="32139" destinationPort="80"
>packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000"
>terminationTimestamp="1237963861.664701000"/>
>protocolPortNumber="6" ok="1" sourcePort="32839" destinationPort="80"
>packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000"
>terminationTimestamp="1237963861.664701000"/>
>protocolPortNumber="17" ok="1" sourcePort="32839" destinationPort="80"
>packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000"
>terminationTimestamp="1237963861.664701000"/>
>protocolPortNumber="17" ok="1" sourcePort="32439" destinationPort="80"
>packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000"
>terminationTimestamp="1237963861.664701000"/>
>protocolPortNumber="6" ok="1" sourcePort="36839" destinationPort="80"
>packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000"
>terminationTimestamp="1237963861.664701000"/>
>protocolPortNumber="6" ok="1" sourcePort="32839" destinationPort="80"
>packets="6" bytes="463" flows="1" initialTimestamp="1237963861.465764000"
>terminationTimestamp="1237963861.664701000"/>
>
>
>
>
>Can someone please show me where do I put these files?  I'm aware that the
>schema.xsd file goes into the directory conf. What about my xml file, and
>txt file?
>
>Thank you,
>Alex
>
>
>On Tue, Apr 14, 2009 at 12:37 AM, Alejandro Gonzalez <
>alejandrogonzalezd...@gmail.com> wrote:
>
>> you should construct the xml containing the fields defined in your
>> schema.xml and give them the values from the text files. for example if you
>> have an schema defining two fields "title" and "text" you should construct
>> an xml with a field "title" and its value and another called "text"
>> containing the body of your doc. then you can post it to Solr you have
>> deployed and make a commit an it's done. it's possible to construct an xml
>> defining more than jus t a doc
>>
>>
>> 
>> 
>> "doc1 title"
>> "doc1 text"
>> 
>> .
>> .
>> .
>> 
>> "docn title"
>> "docn text"
>> 
>> 
>>
>>
>>
>> 2009/4/14 Noble Paul ?? Â Ë³Ë 
>>
>> > what is the cntent of your text file?
>> > Solr does not directly index files
>> > --Noble
>> >
>> > On Tue, Apr 14, 2009 at 3:54 AM, Alex Vu  wrote:
>> > > Hi all,
>> > >
>> > > Currently I wrote an xml file and schema.xml file.  What is the next
>> step
>> > to
>> > > index a txt file?  Where should I put my txt file I want to index?
>> > >
>> > > thank you,
>> > > Alex V.
>> > >
>> >
>> >
>> >
>> > --
>> > --Noble Paul
>> >
>>

-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

Re: [solr-user] Upgrade from 1.2 to 1.3 gives 3x slowdown

2009-04-15 Thread Fergus McMenemie

>On Apr 2, 2009, at 9:23 AM, Fergus McMenemie wrote:
>
>> Grant,
>>
>>
>>
>>> I should note, however, that the speed difference you are seeing may
>>> not be as pronounced as it appears.  If I recall during ApacheCon, I
>>> commented on how long it takes to shutdown your Solr instance when
>>> exiting it.  That time it takes is in fact Solr doing the work that
>>> was put off by not committing earlier and having all those deletes
>>> pile up.
>>>
>> I am confused about "work that was put off" vs committing. My script
>> was doing a commit right after the CVS import, and you are right
>> about the massive times required to shut tomcat down. But in my tests
>> the time taken to do the commit was under a second, yet I had to allow
>> 300secs for tomcat shutdown. Also I dont have any duplicates. So
>> what sort of work was being done at shutdown that was not being done
>> by a commit? Optimise!
>>
>
>The work being done is addressing the deletes, AIUI, but of course  
>there are other things happening during shutdown, too.
There are no deletes to do. It was a clean index to begin with
and there were no duplicates.

>How long is the shutdown if you do a commit first and then a shutdown?
Still very long, sometimes 300sec. My script always did a commit!

>At any rate, I don't know that there is a satisfying answer to the  
>larger issue due to the things like the fsync stuff, which is an  
>overall win for Lucene/Solr despite it being more slower.  Have you  
>tried running the tests on other machines (non-Mac?)
Nope. Although next week I will have real "PC" running vista, so 
I could try it there.

I think we should knock this on the head and move on. I rarely
need to index this content and I can take the performance hit,
and of course your work around provides a good speed up. 

Regards Fergus.
-- 

===
Fergus McMenemie   Email:fer...@twig.me.uk
Techmore Ltd   Phone:(UK) 07721 376021

Unix/Mac/Intranets Analyst Programmer
===

1 2 >

1 - 100 of 124 matches

Mail list logo