from:"Andrew Clegg"

Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg


Hi,

I'm after a bit of clarification about the 'limitations' section of the
distributed search page on the wiki.

The first two limitations say:

* Documents must have a unique key and the unique key must be stored
(stored="true" in schema.xml)

* When duplicate doc IDs are received, Solr chooses the first doc and
discards subsequent ones

Does 'doc ID' in the second point refer to the unique key in the first
point, or does it refer to the internal Lucene document ID?

Cheers,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-items-in-distributed-search-tp942408p942408.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg



Mark Miller-3 wrote:
> 
> The 'doc ID' in the second point refers to the unique key in the first
> point.
> 

I thought so but thanks for clarifying. Maybe a wording change on the wiki
would be good?

Cheers,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-items-in-distributed-search-tp942408p942554.html
Sent from the Solr - User mailing list archive at Nabble.com.

Using symlinks to alias cores

2010-07-04 Thread Andrew Clegg


Another question...

I have a series of cores representing historical data, only the most recent
of which gets indexed to.

I'd like to alias the most recent one to 'current' so that when they roll
over I can just change the alias, and the cron jobs etc. which manage
indexing don't have to change.

However, the wiki recommends against using the ALIAS command in CoreAdmin in
a couple of places, and SOLR-1637 says it's been removed now anyway.

If I can't use ALIAS safely, is it okay to just symlink the most recent
core's instance (or data) directory to 'current', and bring it up in Solr as
a separate core? Will this be safe, as long as all index writing happens via
the 'current' core?

Or will it cause Solr to get confused and do horrible things to the index?

Thanks!

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-symlinks-to-alias-cores-tp942567p942567.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Duplicate items in distributed search

2010-07-04 Thread Andrew Clegg



Mark Miller-3 wrote:
> 
> On 7/4/10 12:49 PM, Andrew Clegg wrote:
>> I thought so but thanks for clarifying. Maybe a wording change on the
>> wiki
> 
> Sounds like a good idea - go ahead and make the change if you'd like.
> 

That page seems to be marked immutable...
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-items-in-distributed-search-tp942408p942984.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Using symlinks to alias cores

2010-07-10 Thread Andrew Clegg



Chris Hostetter-3 wrote:
> 
> a cleaner way to deal with this would be do use something like 
> RewriteRule -- either in your appserver (if it supports a feature like 
> that) or in a proxy sitting in front of Solr.
> 

I think we'll go with this -- seems like the most bulletproof way.

Cheers,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Using-symlinks-to-alias-cores-tp942567p956394.html
Sent from the Solr - User mailing list archive at Nabble.com.

SolrCloud in production?

2010-07-24 Thread Andrew Clegg


Is anyone using ZooKeeper-based Solr Cloud in production yet? Any war
stories? Any problematic missing features?

Thanks,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-in-production-tp991995p991995.html
Sent from the Solr - User mailing list archive at Nabble.com.

maxMergeDocs and performance tuning

2010-08-15 Thread Andrew Clegg


Hi,

I'm a little confused about how the tuning params in solrconfig.xml actually
work.

My index currently has mergeFactor=25 and maxMergeDocs=2147483647.

So this means that up to 25 segments can be created before a merge happens,
and each segment can have up to 2bn docs in, right?

But this page:

http://www.ibm.com/developerworks/java/library/j-solr2/

says "Smaller values [of maxMergeDocs] (< 10,000) are best for applications
with a large number of updates." Our system does indeed have frequent
updates.

But if we set maxMergeDocs=1, what happens when we reach 25 segments
with 1 docs in each? Is the mergeFactor just ignored, so we start a new
segment anyway?

More generally, what would be reasonable params for a large index consisting
of many small docs, updated frequently?

I think a few different use-case examples like this would be a great
addition to the wiki.

Thanks!

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/maxMergeDocs-and-performance-tuning-tp1162695p1162695.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: maxMergeDocs and performance tuning

2010-08-17 Thread Andrew Clegg


Okay, thanks Marc. I don't really have any complaints about performance
(yet!) but I'm still wondering how the mechanics work, e.g. when you have a
number of segments equal to mergeFactor, and each contains maxMergeDocs
documents.

The docs are a bit fuzzy on this...
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/maxMergeDocs-and-performance-tuning-tp1162695p1183064.html
Sent from the Solr - User mailing list archive at Nabble.com.

Duplicate docs when mergin

2010-08-21 Thread Andrew Clegg



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-docs-when-mergin-tp1261979p1261979.html
Sent from the Solr - User mailing list archive at Nabble.com.

Duplicate docs when merging indices?

2010-08-21 Thread Andrew Clegg


Hi,

First off, sorry about previous accidental post, had a sausage-fingered
moment.

Anyway...

If I merge two indices with CoreAdmin, as detailed here...

http://wiki.apache.org/solr/MergingSolrIndexes

What happens to duplicate documents between the two? i.e. those that have
the same unique key.

What decides which copy takes precedence? Will documents get indexed
multiple times, or will the second one just get skipped?

Also, does the behaviour vary between CoreAdmin and IndexMergeTool? This
thread from a couple of years ago:

http://web.archiveorange.com/archive/v/AAfXfQIiBU7vyQBt6qdk

suggests that IndexMergeTool can result in dupes, unless I'm
misinterpreting.

Thanks!

Andrew.



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Duplicate-docs-when-merging-indices-tp1262043p1262043.html
Sent from the Solr - User mailing list archive at Nabble.com.

Replication snapshot, tar says "file changed as we read it"

2011-01-16 Thread Andrew Clegg

(Many apologies if this appears twice, I tried to send it via Nabble
first but it seems to have got stuck, and is fairly urgent/serious.)

Hi,

I'm trying to use the replication handler to take snapshots, then
archive them and ship them off-site.

Just now I got a message from tar that worried me:

tar: snapshot.20110115035710/_70b.tis: file changed as we read it
tar: snapshot.20110115035710: file changed as we read it

The relevant bit of script that does it looks like this (error
checking removed):

curl 'http://localhost:8983/solr/core/1replication?command=backup'
PREFIX=''
if [[ "$START_TIME" =~ 'Sun' ]]
then
PREFIX='weekly.'
fi
cd $SOLR_DATA_DIR
for snapshot in `ls -d -1 snapshot.*`
do
TARGET="${LOCAL_BACKUP_DIR}/${PREFIX}${snapshot}.tar.bz2"
echo "Archiving ${snapshot} into $TARGET"
tar jcf $TARGET $snapshot
echo "Deleting ${snapshot}"
rm -rf $snapshot
done

I was under the impression that files in the snapshot were guaranteed
to never change, right? Otherwise what's the point of the replication
backup command?

I tried putting in a 30-second sleep after the snapshot and before the
tar, but the error occurred again anyway.

There was a message from Lance N. with a similar error in, years ago:

http://www.mail-archive.com/solr-user@lucene.apache.org/msg06104.html

but that would be pre-replication anyway, right?

This is on Ubuntu 10.10 using java 1.6.0_22 and Solr 1.4.0.

Thanks,

Andrew.


-- 

:: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::

Re: Replication snapshot, tar says "file changed as we read it"

2011-01-16 Thread Andrew Clegg

PS one other point I didn't mention is that this server has a very
fast autocommit limit (2 seconds max time).

But I don't know if this is relevant -- I thought the files in the
snapshot wouldn't be committed to again. Please correct me if this is
a huge misunderstanding.

On 16 January 2011 12:30, Andrew Clegg  wrote:
> (Many apologies if this appears twice, I tried to send it via Nabble
> first but it seems to have got stuck, and is fairly urgent/serious.)
>
> Hi,
>
> I'm trying to use the replication handler to take snapshots, then
> archive them and ship them off-site.
>
> Just now I got a message from tar that worried me:
>
> tar: snapshot.20110115035710/_70b.tis: file changed as we read it
> tar: snapshot.20110115035710: file changed as we read it
>
> The relevant bit of script that does it looks like this (error
> checking removed):
>
> curl 'http://localhost:8983/solr/core/1replication?command=backup'
> PREFIX=''
> if [[ "$START_TIME" =~ 'Sun' ]]
> then
>        PREFIX='weekly.'
> fi
> cd $SOLR_DATA_DIR
> for snapshot in `ls -d -1 snapshot.*`
> do
>        TARGET="${LOCAL_BACKUP_DIR}/${PREFIX}${snapshot}.tar.bz2"
>        echo "Archiving ${snapshot} into $TARGET"
>        tar jcf $TARGET $snapshot
>        echo "Deleting ${snapshot}"
>        rm -rf $snapshot
> done
>
> I was under the impression that files in the snapshot were guaranteed
> to never change, right? Otherwise what's the point of the replication
> backup command?
>
> I tried putting in a 30-second sleep after the snapshot and before the
> tar, but the error occurred again anyway.
>
> There was a message from Lance N. with a similar error in, years ago:
>
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg06104.html
>
> but that would be pre-replication anyway, right?
>
> This is on Ubuntu 10.10 using java 1.6.0_22 and Solr 1.4.0.
>
> Thanks,
>
> Andrew.
>
>
> --
>
> :: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::
>



-- 

:: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::

Re: Replication snapshot, tar says "file changed as we read it"

2011-03-23 Thread Andrew Clegg

Sorry to re-open an old thread, but this just happened to me again,
even with a 30 second sleep between taking the snapshot and starting
to tar it up. Then, even more strangely, the snapshot was removed
again before tar completed.

Archiving snapshot.20110320113401 into
/var/www/mesh/backups/weekly.snapshot.20110320113401.tar.bz2
tar: snapshot.20110320113401/_neqv.fdt: file changed as we read it
tar: snapshot.20110320113401/_neqv.prx: File removed before we read it
tar: snapshot.20110320113401/_neqv.fnm: File removed before we read it
tar: snapshot.20110320113401: Cannot stat: No such file or directory
tar: Exiting with failure status due to previous errors

Has anybody seen this before, or been able to replicate it themselves?
(no pun intended)

Or, is anyone else using replication snapshots for backup? Have I
misunderstood them? I thought the point of a snapshot was that once
taken it was immutable.

If it's important, this is on a machine configured as a replication
master, but with no slaves attached to it (it's basically a failover
and backup machine).

  
  
  startup
  commit
  admin-extra.html,elevate.xml,protwords.txt,schema.xml,scripts.conf,solrconfig_slave.xml:solrconfig.xml,stopwords.txt,synonyms.txt
  00:00:10
  
  

Thanks,

Andrew.


On 16 January 2011 12:55, Andrew Clegg  wrote:
> PS one other point I didn't mention is that this server has a very
> fast autocommit limit (2 seconds max time).
>
> But I don't know if this is relevant -- I thought the files in the
> snapshot wouldn't be committed to again. Please correct me if this is
> a huge misunderstanding.
>
> On 16 January 2011 12:30, Andrew Clegg  wrote:
>> (Many apologies if this appears twice, I tried to send it via Nabble
>> first but it seems to have got stuck, and is fairly urgent/serious.)
>>
>> Hi,
>>
>> I'm trying to use the replication handler to take snapshots, then
>> archive them and ship them off-site.
>>
>> Just now I got a message from tar that worried me:
>>
>> tar: snapshot.20110115035710/_70b.tis: file changed as we read it
>> tar: snapshot.20110115035710: file changed as we read it
>>
>> The relevant bit of script that does it looks like this (error
>> checking removed):
>>
>> curl 'http://localhost:8983/solr/core/1replication?command=backup'
>> PREFIX=''
>> if [[ "$START_TIME" =~ 'Sun' ]]
>> then
>>        PREFIX='weekly.'
>> fi
>> cd $SOLR_DATA_DIR
>> for snapshot in `ls -d -1 snapshot.*`
>> do
>>        TARGET="${LOCAL_BACKUP_DIR}/${PREFIX}${snapshot}.tar.bz2"
>>        echo "Archiving ${snapshot} into $TARGET"
>>        tar jcf $TARGET $snapshot
>>        echo "Deleting ${snapshot}"
>>        rm -rf $snapshot
>> done
>>
>> I was under the impression that files in the snapshot were guaranteed
>> to never change, right? Otherwise what's the point of the replication
>> backup command?
>>
>> I tried putting in a 30-second sleep after the snapshot and before the
>> tar, but the error occurred again anyway.
>>
>> There was a message from Lance N. with a similar error in, years ago:
>>
>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg06104.html
>>
>> but that would be pre-replication anyway, right?
>>
>> This is on Ubuntu 10.10 using java 1.6.0_22 and Solr 1.4.0.
>>
>> Thanks,
>>
>> Andrew.
>>
>>
>> --
>>
>> :: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::
>>
>
>
>
> --
>
> :: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::
>



-- 

:: http://biotext.org.uk/ :: http://twitter.com/andrew_clegg/ ::

NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg


First of all, apologies if you get this twice. I posted it by email an hour
ago but it hasn't appeared in any of the archives, so I'm worried it's got
junked somewhere.

I'm trying to use a DataImportHandler to merge some data from a database
with some other fields from a collection of XML files, rather like the
example in the Architecture section here:

http://wiki.apache.org/solr/DataImportHandler

... so a given document is built from some fields from the database and some
from the XML.

My dataconfig.xml looks like this:



   

   

   

   

   
   
   

   

   



This works if I comment out the inner entity, but when I uncomment it, I get
this error:


30-Jul-2009 14:32:50 org.apache.solr.handler.dataimport.DocBuilder
buildDocument
SEVERE: Exception while processing: domain document :
SolrInputDocument[{id=id(1.0)={1s32D00}, title=title(1.0)={PDB code
1s32, chain D, domain 00}, keywords=keywords(1.0)={some ke
ywords go here}, pdb_code=pdb_code(1.0)={1s32},
doc_type=doc_type(1.0)={domain}, related_ids=related_ids(1.0)={1s32
1s32D}}]
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.NullPointerException
   at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:64)
   at
org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:71)
   at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
   at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:344)
   at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:372)
   at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
   at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
   at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
   at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
   at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
Caused by: java.lang.NullPointerException
   at java.io.File.(File.java:222)
   at
org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:75)
   at
org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:44)
   at
org.apache.solr.handler.dataimport.SqlEntityProcessor.initQuery(SqlEntityProcessor.java:58)
   ... 9 more


I have checked that the file
/cath/people/cathdata/v3_3_0/pdb-XML-noatom/1s32-noatom.xml is readable, so
maybe the full path to the file isn't being constructed properly or
something?

I also tried with the full path template for the file in the entity url
attribute, instead of using a basePath in the dataSource, but I get exactly
the same exception.

This is with the 2009-07-30 nightly build. See attached for schema. 
http://www.nabble.com/file/p24739580/schema.xml schema.xml 

Any ideas? Thanks in advance!

Andrew.


--
:: http://biotext.org.uk/ ::
-- 
View this message in context: 
http://www.nabble.com/NullPointerException-in-DataImportHandler-tp24739580p24739580.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg

Chantal Ackermann wrote:
> 
> Hi Andrew,
> 
> your inner entity uses an XML type datasource. The default entity 
> processor is the SQL one, however.
> 
> For your inner entity, you have to specify the correct entity processor 
> explicitly. You do that by adding the attribute "processor", and the 
> value is the classname of the processor you want to use.
> 
> e.g.  processor="XPathEntityProcessor" 
> 

Thanks -- I was also missing a forEach expression -- in my case, just "/"
since each XML file contains the information for no more than one document.

However, I'm now getting a different exception:

30-Jul-2009 16:48:52 org.apache.solr.handler.dataimport.DocBuilder
buildDocument
SEVERE: Exception while processing: domain document :
SolrInputDocument[{id=id(1.0)={1udaA02}, title=title(1.0)={PDB code 1uda,
chain A, domain 02}, pdb_code=pdb_code(1.0)={1uda}, 
doc_type=doc_type(1.0)={domain}, related_ids=related_ids(1.0)={1uda,1udaA}}]
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception
while reading xpaths for fields Processing Document # 1
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initXpathReader(XPathEntityProcessor.java:135)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.init(XPathEntityProcessor.java:76)
at
org.apache.solr.handler.dataimport.EntityProcessorWrapper.init(EntityProcessorWrapper.java:71)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:307)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:372)
at
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:225)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:167)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.LinkedList.entry(LinkedList.java:365)
at java.util.LinkedList.get(LinkedList.java:315)
at
org.apache.solr.handler.dataimport.XPathRecordReader.addField0(XPathRecordReader.java:71)
at
org.apache.solr.handler.dataimport.XPathRecordReader.(XPathRecordReader.java:50)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initXpathReader(XPathEntityProcessor.java:121)
... 9 more

My data config now looks like this:

Thanks in advance, again :-)

Andrew.

-- 
View this message in context: 
http://www.nabble.com/NullPointerException-in-DataImportHandler-tp24739580p24741292.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg



Erik Hatcher wrote:
> 
> 
> On Jul 30, 2009, at 11:54 AM, Andrew Clegg wrote:
>>> url="${domain.pdb_code}-noatom.xml" processor="XPathEntityProcessor"
>> forEach="/">
>>> xpath="//*[local-name()='structCategory']/*[local-name()='struct']/ 
>> *[local-name()='title']"
>> />
> 
> The XPathEntityProcessor doesn't support that fancy of an xpath - it  
> supports only a limited subset.  Try /structCategory/struct/title  
> perhaps?
> 
> 

Sadly not...

I tried with:



(full path from root)

and



Same ArrayIndex error each time.

Doesn't it use javax.xml then? I was using the complex local-name
expressions to make it namespace-agnostic -- is it agnostic anyway?

Thanks,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/NullPointerException-in-DataImportHandler-tp24739580p24741696.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: NullPointerException in DataImportHandler

2009-07-30 Thread Andrew Clegg




Chantal Ackermann wrote:
> 
> 
> my experience with XPathEntityProcessor is non-existent. ;-)
> 
> 

Don't worry -- your hints put me on the right track :-)

I got it working with:





Now, to get it to ignore missing files without an error... Hmm...

Cheers,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/NullPointerException-in-DataImportHandler-tp24739580p24741772.html
Sent from the Solr - User mailing list archive at Nabble.com.

Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg


A couple of questions about the DIH XPath syntax...

The docs say it supports:

   xpath="/a/b/subje...@qualifier='fullTitle']"
   xpath="/a/b/subject/@qualifier"
   xpath="/a/b/c"

Does the second one mean "select the value of the attribute called qualifier
in the /a/b/subject element"?

e.g. For this document:


 
  
 


... I would get the result "some text"?

Also... Can I select a non-leaf node and get *ALL* the text underneath it?
e.g. /a/b in this example?

Thanks!

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Questions-about-XPath-in-data-import-handler-tp24954223p24954223.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg




Andrew Clegg wrote:
> 
>   
> 

Sorry, Nabble swallowed my XML example. That was supposed to be

[a]
 [b]
  [subject qualifier="some text" /]
 [/b]
[/a]

... but in XML.

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Questions-about-XPath-in-data-import-handler-tp24954223p24954263.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Questions about XPath in data import handler

2009-08-13 Thread Andrew Clegg




Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
> 
> On Thu, Aug 13, 2009 at 6:35 PM, Andrew Clegg
> wrote:
> 
>> Does the second one mean "select the value of the attribute called
>> qualifier
>> in the /a/b/subject element"?
> 
> yes you are right. Isn't that the semantics of standard xpath syntax?
> 

Yes, just checking since the DIH XPath engine is a little different.

Do you know what I would get in this case?

> > Also... Can I select a non-leaf node and get *ALL* the text underneath
> it?
> > e.g. /a/b in this example?

Cheers,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Questions-about-XPath-in-data-import-handler-tp24954223p24954869.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Questions about XPath in data import handler

2009-08-14 Thread Andrew Clegg




Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
> 
> yes. look at the 'flatten' attribute in the field. It should give you
> all the text (not attributes) under a given node.
> 
> 

I missed that one -- many thanks.

Andrew.
-- 
View this message in context: 
http://www.nabble.com/Questions-about-XPath-in-data-import-handler-tp24954223p24968349.html
Sent from the Solr - User mailing list archive at Nabble.com.

'Connection reset' in DataImportHandler Development Console

2009-08-17 Thread Andrew Clegg


Hi folks,

I'm trying to use the Debug Now button in the development console to test
the effects of some changes in my data import config (see attached).

However, each time I click it, the right-hand frame fails to load -- it just
gets replaced with the standard 'connection reset' message from Firefox, as
if the server's dropped the HTTP connection.

Everything else seems okay -- I can run queries in Solr Admin without any
problems, and all the other buttons in the dev console work -- status,
document count, reload config etc.

There's nothing suspicious in Tomcat's catalina.out either. If I hit Reload
Config, then Status, then Debug Now, I get this:


17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImportHandler
processConfiguration
INFO: Processing configuration from solrconfig.xml: {config=dataconfig.xml}
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
loadDataConfig
INFO: Data Configuration loaded successfully
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: id is a required field in SolrSchema . But not found in DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: title is a required field in SolrSchema . But not found in DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: doc_type is a required field in SolrSchema . But not found in
DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: id is a required field in SolrSchema . But not found in DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: title is a required field in SolrSchema . But not found in DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: doc_type is a required field in SolrSchema . But not found in
DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: id is a required field in SolrSchema . But not found in DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: title is a required field in SolrSchema . But not found in DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: doc_type is a required field in SolrSchema . But not found in
DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: id is a required field in SolrSchema . But not found in DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: title is a required field in SolrSchema . But not found in DataConfig
17-Aug-2009 13:12:12 org.apache.solr.handler.dataimport.DataImporter
verifyWithSchema
INFO: doc_type is a required field in SolrSchema . But not found in
DataConfig
17-Aug-2009 13:12:12 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={clean=false&command=reload-config&commit=true&qt=/dataimport}
status=0 QTime=5 
17-Aug-2009 13:12:21 org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={clean=false&command=status&commit=true&qt=/dataimport} status=0
QTime=0 


(The warnings are because the doc_type field comes out of the JDBC result
set automatically by column name -- this isn't a problem.)

Also, there's no entry in the Tomcat access log for the debug request
either, just the first two:


[17/Aug/2009:13:12:12 +0100] HTTP/1.1 cookie:- request:-  GET /solr/select
200 ?clean=false&commit=true&qt=%2Fdataimport&command=reload-config GET
/solr/select?clean=false&commit=t
rue&qt=%2Fdataimport&command=reload-config HTTP/1.1
[17/Aug/2009:13:12:21 +0100] HTTP/1.1 cookie:- request:-  GET /solr/select
200 ?clean=false&commit=true&qt=%2Fdataimport&command=status GET
/solr/select?clean=false&commit=true&qt=
%2Fdataimport&command=status HTTP/1.1


PS... Nightly build, 30th of July.

Thanks,

Andrew.

http://www.nabble.com/file/p25005850/dataconfig.xml dataconfig.xml 
-- 
View this message in context: 
http://www.nabble.com/%27Connection-reset%27-in-DataImportHandler-Development-Console-tp25005850p25005850.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: 'Connection reset' in DataImportHandler Development Console

2009-08-17 Thread Andrew Clegg

Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
> 
> apparently I do not see any command full-import, delta-import being
> fired. Is that true?
> 

It seems that way -- they're not appearing in the logs. I've tried Debug Now
with both full and delta selected from the dropdown, no difference either
way.

If I click the Full Import button it starts an import okay. I don't have to
Full Import manually every time I want to debug a config change do I? That's
not what the docs say. (A full import takes about 6 or 7 hours...)

Thanks,

Andrew.
-- 
View this message in context: 
http://www.nabble.com/%27Connection-reset%27-in-DataImportHandler-Development-Console-tp25005850p25006284.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Adding a prefix to fields

2009-08-20 Thread Andrew Clegg




ahammad wrote:
> 
> Is it possible to add a prefix to the data in a Solr field? For example,
> right now, I have a field called "id" that gets data from a DB through the
> DataImportHandler. The DB returns a 4-character string like "ag5f". Would
> it be possible to add a prefix to the data that is received?
> 
> In this specific case, the data relates to articles. So effectively, if
> the DB has "ag5f" as an ID, I want it to be stored as "Article_ag5f". Is
> there a way to define a prefix of "Article_" for a certain field?
> 

I have exactly this situation and I just handle it by adding the prefixes in
the SQL query.

select 'Article_' || id as id
from articles
etc.

I wrap all these up as views and store them in the DB, so Solr just has to
select * from each view.

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Adding-a-prefix-to-fields-tp25062226p25062356.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Range Query Anomalities

2009-08-20 Thread Andrew Clegg


Try a sdouble or sfloat field type?

Andrew.


johan.sjoberg wrote:
> 
> Hi,
> 
> we're performing range queries of a field which is of type double. Some
> queries which should generate results does not, and I think it's best
> explained by the following examples; it's also expected to exist data in
> all ranges:
> 
> 
> ?q=field:[10.0 TO 20.0] // OK
> ?q=field:[9.0 TO 20.0] // NOT OK
> ?q=field:[09.0 TO 20.0] // OK
> 
> Interesting here is that the range query only works if both ends of the
> interval is of equal length (hence 09-to-20 works, but not 9-20).
> Unfortunately, this logic does not work for ranges in the 100s.
> 
> 
> 
> ?q=field:[* TO 500]  // OK
> ?q=field:[100.0 TO 500.0] // OK
> ?q=field:[90.0 TO 500.0] // NOT OK
> ?q=field:[090.0 TO 500.0] // NOT OK
> 
> 
> 
> Any ideas to this very strange behaviour?
> 
> 
> Regards,
> Johan
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr-773-%28GEO-Module%29-question-tp25041799p25062912.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Wildcard seaches?

2009-08-20 Thread Andrew Clegg




Paul Tomblin wrote:
> 
> Is there such a thing as a wildcard search?  If I have a simple
> solr.StrField with no analyzer defined, can I query for "foo*" or
> "foo.*" and get everything that starts with "foo" such as 'foobar" and
> "foobaz"?
> 

Yes. foo* is fine even on a simple string field.

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Wildcard-seaches--tp25063582p25063623.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: can solr accept other tag other than field?

2009-08-20 Thread Andrew Clegg



You can use the Data Import Handler to pull data out of any XML or SQL data
source:

http://wiki.apache.org/solr/DataImportHandler

Andrew.


Elaine Li wrote:
> 
> Hi,
> 
> I am new solr user. I want to use solr search to run query against
> many xml files I have.
> I have set up the solr server to run query against the example files.
> 
> One problem is my xml does not have  tag and "name" attribute.
> My format is rather easy:
> 
> 
> 
> 
> 
> 
> I looked at the schema.xml file and realized I can only customize(add)
> attribute name.
> 
> Is there a way to let Solr accept my xml w/o me changing my xml into
> the ?
> 
> Thanks.
> 
> Elaine
> 
> 

-- 
View this message in context: 
http://www.nabble.com/can-solr-accept-other-tag-other-than-field--tp25066496p25066638.html
Sent from the Solr - User mailing list archive at Nabble.com.

Problem getting Solr home from JNDI in Tomcat

2009-09-29 Thread Andrew Clegg


Hi all, I'm having problems getting Solr to start on Tomcat 6.

Tomcat is installed in /opt/apache-tomcat , solr is in
/opt/apache-tomcat/webapps/solr , and my Solr home directory is /opt/solr .
My config file is in /opt/solr/conf/solrconfig.xml .

I have a Solr-specific context file in
/opt/apache-tomcat/conf/Catalina/localhost/solr.xml which looks like this:






But when I start Solr and browse to it, it tells me:

java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in
classpath or 'solr/conf/', cwd=/ at
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:194)
at
org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:162)
at org.apache.solr.core.Config.(Config.java:100) at
org.apache.solr.core.SolrConfig.(SolrConfig.java:113) at
org.apache.solr.core.SolrConfig.(SolrConfig.java:70) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at org.apache.catalina.core.StandardContext.start(StandardContext.java:4356)
at
org.apache.catalina.manager.ManagerServlet.start(ManagerServlet.java:1244)
at
org.apache.catalina.manager.HTMLManagerServlet.start(HTMLManagerServlet.java:604)
at
org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:129)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690) at
javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:568)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at
org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at
org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at
org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619) 

Weirdly, the exact same context file works fine on a different machine. I've
tried giving Context a docBase element (both absolute, and relative paths)
but it makes no difference -- Solr still isn't seeing the right home
directory. I also tried setting debug="1" but didn't see any more useful
info anywhere.

Any ideas? This is a total show-stopper for me as this is our production
server. (Otherwise I'd think about taking it down and hardwiring the Solr
home path into the server's context...)

Yours hopefully,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Problem-getting-Solr-home-from-JNDI-in-Tomcat-tp25662200p25662200.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem getting Solr home from JNDI in Tomcat

2009-09-29 Thread Andrew Clegg

Constantijn Visinescu wrote:
> 
> This might be a bit of a hack but i got this in the web.xml of my
> applicatin
> and it works great.
> 
> 
> 
>solr/home
>/Solr/WebRoot/WEB-INF/solr
>java.lang.String
> 
> 
> 

That worked, thanks. You're right though, it is a bit of a hack -- I'd
prefer to set the path from *outside* the app so it won't get overwritten
when I upgrade.

Now I've got a completely different error:
"org.apache.lucene.index.CorruptIndexException: Unknown format version: -9".
I think it might be time for a fresh install...

Cheers,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Problem-getting-Solr-home-from-JNDI-in-Tomcat-tp25662200p25663931.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem getting Solr home from JNDI in Tomcat

2009-09-30 Thread Andrew Clegg



hossman wrote:
> 
> 
> : Hi all, I'm having problems getting Solr to start on Tomcat 6.
> 
> which version of Solr?
> 
> 

Sorry -- a nightly build from about a month ago. Re. your other message, I
was sure the two machines had the same version on, but maybe not -- when I'm
back in the office tomorrow I'll upgrade them both to a fresh nightly.


hossman wrote:
> 
> 
> : Tomcat is installed in /opt/apache-tomcat , solr is in
> : /opt/apache-tomcat/webapps/solr , and my Solr home directory is
> /opt/solr .
> 
> if "solr is in /opt/apache-tomcat/webapps/solr" means that you put the 
> solr.war in /opt/apache-tomcat/webapps/ and tomcat expanded it into  
> /opt/apache-tomcat/webapps/solr then that is your problem -- tomcat isn't 
> even looking at your context file (it only looks at the context files to 
> ersolve URLs that it cant resolve looking in the webapps directory)
> 
> 

Yes, it's auto-expanded from a war in webapps.

I have to admit to being a bit baffled though -- I can't find this rule
anywhere in the Tomcat docs, but I'm a beginner really and they're not the
clearest :-)


hossman wrote:
> 
> 
> This is why the examples of using context files on the wiki talk about 
> keeping the war *outside* of the webapps directory, and using docBase in 
> your Context declaration...
>   http://wiki.apache.org/solr/SolrTomcat
> 
> 

Great, I'll try it this way and see if it clears up. Is it okay to keep the
war file *inside* the Solr home directory (/opt/solr in my case) so it's all
self-contained?

Many thanks,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Problem-getting-Solr-home-from-JNDI-in-Tomcat-tp25662200p25677750.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Problem getting Solr home from JNDI in Tomcat

2009-10-01 Thread Andrew Clegg




Andrew Clegg wrote:
> 
> 
> hossman wrote:
>> 
>> 
>> This is why the examples of using context files on the wiki talk about 
>> keeping the war *outside* of the webapps directory, and using docBase in 
>> your Context declaration...
>>   http://wiki.apache.org/solr/SolrTomcat
>> 
>> 
> 
> Great, I'll try it this way and see if it clears up. Is it okay to keep
> the war file *inside* the Solr home directory (/opt/solr in my case) so
> it's all self-contained?
> 

For the benefit of future searchers -- I tried it this way and it works
fine. Thanks again to everyone for helping.

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Problem-getting-Solr-home-from-JNDI-in-Tomcat-tp25662200p25701748.html
Sent from the Solr - User mailing list archive at Nabble.com.

Quotes in query string cause NullPointerException

2009-10-01 Thread Andrew Clegg


Hi folks,

I'm using the 2009-09-30 build, and any single or double quotes in the query
string cause an NPE. Is this normal behaviour? I never tried it with my
previous installation.

Example:

http://myserver:8080/solr/select/?title:%22Creatine+kinase%22

(I've also tried without the URL encoding, no difference)

Response:

HTTP Status 500 - null java.lang.NullPointerException at
java.io.StringReader.(StringReader.java:33) at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:173) at
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78) at
org.apache.solr.search.QParser.getQuery(QParser.java:131) at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:269)
at
org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:81)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:568)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at
org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at
org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at
org.jstripe.tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619) 

Single quotes have the same effect.

Is there another way to specify exact phrases?

Thanks,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Quotes-in-query-string-cause-NullPointerException-tp25702207p25702207.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Quotes in query string cause NullPointerException

2009-10-01 Thread Andrew Clegg


Sorry! I'm officially a complete idiot.

Personally I'd try to catch things like that and rethrow a
'QueryParseException' or something -- but don't feel under any obligation to
listen to me because, well, I'm an idiot.

Thanks :-)

Andrew.


Erik Hatcher-4 wrote:
> 
> don't forget q=...  :)
> 
>   Erik
> 
> On Oct 1, 2009, at 9:49 AM, Andrew Clegg wrote:
> 
>>
>> Hi folks,
>>
>> I'm using the 2009-09-30 build, and any single or double quotes in  
>> the query
>> string cause an NPE. Is this normal behaviour? I never tried it with  
>> my
>> previous installation.
>>
>> Example:
>>
>> http://myserver:8080/solr/select/?title:%22Creatine+kinase%22
>>
>> (I've also tried without the URL encoding, no difference)
>>
>> Response:
>>
>> HTTP Status 500 - null java.lang.NullPointerException at
>> java.io.StringReader.(StringReader.java:33) at
>> org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java: 
>> 173) at
>> org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java: 
>> 78) at
>> org.apache.solr.search.QParser.getQuery(QParser.java:131) at
>> org 
>> .apache 
>> .solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
>> at
>> org 
>> .apache 
>> .solr 
>> .handler 
>> .component.SearchHandler.handleRequestBody(SearchHandler.java:174)
>> at
>> org 
>> .apache 
>> .solr 
>> .handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
>> org 
>> .apache 
>> .solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>> at
>> org 
>> .apache 
>> .solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>> at
>> org 
>> .apache 
>> .catalina 
>> .core 
>> .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: 
>> 235)
>> at
>> org 
>> .apache 
>> .catalina 
>> .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> at
>> org 
>> .apache 
>> .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 
>> 233)
>> at
>> org 
>> .apache 
>> .catalina.core.StandardContextValve.invoke(StandardContextValve.java: 
>> 175)
>> at
>> org 
>> .apache 
>> .catalina.valves.RequestFilterValve.process(RequestFilterValve.java: 
>> 269)
>> at
>> org 
>> .apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java: 
>> 81)
>> at  
>> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java: 
>> 568)
>> at
>> org 
>> .apache 
>> .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>> at
>> org 
>> .jstripe 
>> .tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
>> at
>> org 
>> .jstripe 
>> .tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
>> at
>> org 
>> .jstripe 
>> .tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
>> at
>> org 
>> .jstripe 
>> .tomcat.probe.Tomcat55AgentValve.invoke(Tomcat55AgentValve.java:20)
>> at
>> org 
>> .apache 
>> .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>> at
>> org 
>> .apache 
>> .catalina.core.StandardEngineValve.invoke(StandardEngineValve.java: 
>> 109)
>> at
>> org 
>> .apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java: 
>> 286)
>> at
>> org 
>> .apache.coyote.http11.Http11Processor.process(Http11Processor.java: 
>> 844)
>> at
>> org.apache.coyote.http11.Http11Protocol 
>> $Http11ConnectionHandler.process(Http11Protocol.java:583)
>> at org.apache.tomcat.util.net.JIoEndpoint 
>> $Worker.run(JIoEndpoint.java:447)
>> at java.lang.Thread.run(Thread.java:619)
>>
>> Single quotes have the same effect.
>>
>> Is there another way to specify exact phrases?
>>
>> Thanks,
>>
>> Andrew.
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Quotes-in-query-string-cause-NullPointerException-tp25702207p25702207.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Quotes-in-query-string-cause-NullPointerException-tp25702207p25704050.html
Sent from the Solr - User mailing list archive at Nabble.com.

Result missing from query, but match shows in Field Analysis tool

2009-10-23 Thread Andrew Clegg


Hi,

I have a field in my index called related_ids, indexed and stored, with the
following field type:









Several records in my index contain the token 1cuk in the related_ids field,
but only *some* of them are returned when I query on this. e.g. if I send a
query like this:

http://localhost:8080/solr/select/?q=id:2.40.50+AND+related_ids:1cuk&version=2.2&start=0&rows=20&indent=on&fl=id,title,related_ids

I get a single hit for the record with id:2.40.50 . But if I try this, on a
different record with id:2.40 :

http://localhost:8080/solr/select/?q=id:2.40+AND+related_ids:1cuk&version=2.2&start=0&rows=20&indent=on&fl=id,title,related_ids

I get no hits. However, if I just query for id:2.40 ...

http://localhost:8080/solr/select/?q=id:2.40&version=2.2&start=0&rows=20&indent=on&fl=id,title,related_ids

I can clearly see the token "1cuk" in the related_ids field.

Not only that, but if I copy and paste record 2.40's related_ids field into
the Field Analysis tool in the admin interface, and search on "1cuk", the
term 1cuk is visible in the index analyzer's term list, and highlighted! So
Field Analysis thinks that I *should* be getting a hit for this term.

Can anyone suggest how I'd go about diagnosing this? I'm kind of hitting a
brick wall here.

If it makes any difference, related_ids for the culprit record 2.40 is
large-ish but not enormous (31000 terms). Also I've tried stopping and
restarting Solr in case it was some weird caching thing.

Thanks in advance,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Result-missing-from-query%2C-but-match-shows-in-Field-Analysis-tool-tp26029040p26029040.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Result missing from query, but match shows in Field Analysis tool

2009-10-23 Thread Andrew Clegg



That's probably it! It is quite near the end of the field. I'll try upping
it and re-indexing.

Thanks :-)


Erick Erickson wrote:
> 
> I'm really reaching here, but lucene only indexes the first 10,000 terms
> by
> default (you can up the limit). Is there a chancethat you're hitting that
> limit? That 1cuk is past the 10,000th term
> in record 2.40?
> 
> For this to be possible, I have to assume that the FieldAnalysis
> tool ignores this limit
> 
> FWIW
> Erick
> 
> On Fri, Oct 23, 2009 at 12:01 PM, Andrew Clegg
> wrote:
> 
>>
>> Hi,
>>
>> I have a field in my index called related_ids, indexed and stored, with
>> the
>> following field type:
>>
>>
>>> positionIncrementGap="100">
>>
>>> pattern="\W*\s+\W*" />
>>
>>
>>
>>
>> Several records in my index contain the token 1cuk in the related_ids
>> field,
>> but only *some* of them are returned when I query on this. e.g. if I send
>> a
>> query like this:
>>
>>
>> http://localhost:8080/solr/select/?q=id:2.40.50+AND+related_ids:1cuk&version=2.2&start=0&rows=20&indent=on&fl=id,title,related_ids
>>
>> I get a single hit for the record with id:2.40.50 . But if I try this, on
>> a
>> different record with id:2.40 :
>>
>>
>> http://localhost:8080/solr/select/?q=id:2.40+AND+related_ids:1cuk&version=2.2&start=0&rows=20&indent=on&fl=id,title,related_ids
>>
>> I get no hits. However, if I just query for id:2.40 ...
>>
>>
>> http://localhost:8080/solr/select/?q=id:2.40&version=2.2&start=0&rows=20&indent=on&fl=id,title,related_ids
>>
>> I can clearly see the token "1cuk" in the related_ids field.
>>
>> Not only that, but if I copy and paste record 2.40's related_ids field
>> into
>> the Field Analysis tool in the admin interface, and search on "1cuk", the
>> term 1cuk is visible in the index analyzer's term list, and highlighted!
>> So
>> Field Analysis thinks that I *should* be getting a hit for this term.
>>
>> Can anyone suggest how I'd go about diagnosing this? I'm kind of hitting
>> a
>> brick wall here.
>>
>> If it makes any difference, related_ids for the culprit record 2.40 is
>> large-ish but not enormous (31000 terms). Also I've tried stopping and
>> restarting Solr in case it was some weird caching thing.
>>
>> Thanks in advance,
>>
>> Andrew.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Result-missing-from-query%2C-but-match-shows-in-Field-Analysis-tool-tp26029040p26029040.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Result-missing-from-query%2C-but-match-shows-in-Field-Analysis-tool-tp26029040p26029417.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg

Morning,

Last week I was having a problem with terms visible in my search results in
large documents not causing query hits:

http://www.nabble.com/Result-missing-from-query%2C-but-match-shows-in-Field-Analysis-tool-td26029040.html#a26029351

Erick suggested it might be related to maxFieldLength, so I set this to
2147483647 in my solrconfig.xml and reindexed over the weekend.

Unfortunately I'm having the same problem now, even though Erick appears to
be right! I've narrowed it down to a single document for testing purposes,
and I can get it returned by querying for a term near the beginning, but
terms near the end cause no hit, and I can even find the point part way
through the document, after which, none of the remaining terms seem to cause
a hit.

The document is about 32000 terms long, most of which is in a single field
called related_ids of about 31000 terms. My first thought was that the text
was being chopped up into so many tokens that it was going over the
maxFieldLength anyway, but 2147483647/32000=67109, and it seems very
unlikely that 67109 tokens would be generated per term!

I've tried undeploying and redeploying the whole web app from Tomcat in case
the new maxFieldLength hadn't been read, but no difference. If I go to

http://localhost:8080/solr/admin/file/?file=solrconfig.xml

I can see

2147483647

as expected.

Does anyone have any more ideas? This could potentially be a showstopper for
us as we have quite a few long-ish documents to index. (32K words doesn't
seem that long to me, but still...)

I've tried it with today's nightly build (2009-10-26) and it makes no
difference. If this sounds like a bug, I'll open a JIRA and attach tars of
my config and data directories. Any thoughts?

Thanks,

Andrew.

--
View this message in context:
http://www.nabble.com/Solr-ignoring-maxFieldLength--tp26057808p26057808.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg



Yep, I just re-indexed it again to make double sure -- same problem
unfortunately.

My solrconfig.xml and schema.xml are attached.

In case you want to see it in action on the same data I've got, I've tarred
up my data and conf directories here:

http://biotext.org.uk/static/solr-issue-example.tar.gz

That should be enough to reproduce it with.

Thanks!

Andrew.


Yonik Seeley-2 wrote:
> 
> Yes, please show us your solrconfig.xml, and verify that you reindexed
> the document after changing maxFieldLength and restarting solr.
> 
> I'll also see if I can reproduce a problem with maxFieldLength being
> ignored.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
> On Mon, Oct 26, 2009 at 7:11 AM, Andrew Clegg 
> wrote:
>>
>> Morning,
>>
>> Last week I was having a problem with terms visible in my search results
>> in
>> large documents not causing query hits:
>>
>> http://www.nabble.com/Result-missing-from-query%2C-but-match-shows-in-Field-Analysis-tool-td26029040.html#a26029351
>>
>> Erick suggested it might be related to maxFieldLength, so I set this to
>> 2147483647 in my solrconfig.xml and reindexed over the weekend.
>>
>> Unfortunately I'm having the same problem now, even though Erick appears
>> to
>> be right! I've narrowed it down to a single document for testing
>> purposes,
>> and I can get it returned by querying for a term near the beginning, but
>> terms near the end cause no hit, and I can even find the point part way
>> through the document, after which, none of the remaining terms seem to
>> cause
>> a hit.
>>
>> The document is about 32000 terms long, most of which is in a single
>> field
>> called related_ids of about 31000 terms. My first thought was that the
>> text
>> was being chopped up into so many tokens that it was going over the
>> maxFieldLength anyway, but 2147483647/32000=67109, and it seems very
>> unlikely that 67109 tokens would be generated per term!
>>
>> I've tried undeploying and redeploying the whole web app from Tomcat in
>> case
>> the new maxFieldLength hadn't been read, but no difference. If I go to
>>
>> http://localhost:8080/solr/admin/file/?file=solrconfig.xml
>>
>> I can see
>>
>> 2147483647
>>
>> as expected.
>>
>> Does anyone have any more ideas? This could potentially be a showstopper
>> for
>> us as we have quite a few long-ish documents to index. (32K words doesn't
>> seem that long to me, but still...)
>>
>> I've tried it with today's nightly build (2009-10-26) and it makes no
>> difference. If this sounds like a bug, I'll open a JIRA and attach tars
>> of
>> my config and data directories. Any thoughts?
>>
>> Thanks,
>>
>> Andrew.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Solr-ignoring-maxFieldLength--tp26057808p26057808.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
http://www.nabble.com/file/p26060882/solrconfig.xml solrconfig.xml 
http://www.nabble.com/file/p26060882/schema.xml schema.xml 
-- 
View this message in context: 
http://www.nabble.com/Solr-ignoring-maxFieldLength--tp26057808p26060882.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg




Yonik Seeley-2 wrote:
> 
> Sorry Andrew, this is something that's bitten people before.
> search for maxFieldLength and you will see *2* of them in your config
> - one for indexDefaults and one for mainIndex.
> The one in mainIndex is set at 1 and hence overrides the one in
> indexDefaults.
> 

Sorry -- schoolboy error. Glad I'm not the only one though. Yes, that seems
to have fixed it...

Cheers,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Solr-ignoring-maxFieldLength--tp26057808p26061360.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr ignoring maxFieldLength?

2009-10-26 Thread Andrew Clegg

Yonik Seeley-2 wrote:
> 
> If you could, it would be great if you could test commenting out the
> one in mainIndex and see if it inherits correctly from
> indexDefaults... if so, I can comment it out in the example and remove
> one other little thing that people could get wrong.
> 

Yep, it seems perfectly happy like this.

I'm going to try commenting out all of mainIndex to see if it can
successfully inherit everything from indexDefaults -- since I have
single I won't need an unlockOnStartup entry, which
doesn't appear in indexDefaults (at least in any of the config files I've
seen).

So...

false
10
2147483647
2147483647
1000
1
single

If the big overnight indexing job fails with these settings, I'll let you
know.

Cheers,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Solr-ignoring-maxFieldLength--tp26057808p26062113.html
Sent from the Solr - User mailing list archive at Nabble.com.

Greater-than and less-than in data import SQL queries

2009-10-27 Thread Andrew Clegg


Hi,

If I have a DataImportHandler query with a greater-than sign in, like this:



Everything's fine. However, if it contains a less-than sign:



I get this exception:

INFO: Processing configuration from solrconfig.xml: {config=dataconfig.xml}
[Fatal Error] :240:129: The value of attribute "query" associated with an
element type "null" must not contain the '<' character.
27-Oct-2009 15:30:49 org.apache.solr.handler.dataimport.DataImportHandler
inform
SEVERE: Exception while loading DataImporter
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception
occurred while initializing context
at
org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:184)
at
org.apache.solr.handler.dataimport.DataImporter.(DataImporter.java:101)
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:113)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:424)
at org.apache.solr.core.SolrCore.(SolrCore.java:588)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4356)
at
org.apache.catalina.manager.ManagerServlet.start(ManagerServlet.java:1244)
at
org.apache.catalina.manager.HTMLManagerServlet.start(HTMLManagerServlet.java:604)
at
org.apache.catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java:129)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at
org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
at
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:568)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)
Caused by: org.xml.sax.SAXParseException: The value of attribute "query"
associated with an element type "null" must not contain the '<' character.
at
com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:239)
at
com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:283)
at
org.apache.solr.handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:172)
... 30 more

Is this fixable, or an unavoidable feature of Xerces? If the latter, perhaps
the docs could benefit from a note to say "use NOT a >= b" or something?

Speaking of, I found this in the wiki examples for the DIH:



Shouldn't that be one equals sign:

deltaImportQuery="select * from item where
ID='${dataimporter.delta.id}'"

Or is it doing something clever with Java operators?

Cheers,

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Greater-than-and-less-than-in-data-import-SQL-queries-tp26080100p26080100.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Greater-than and less-than in data import SQL queries

2009-10-27 Thread Andrew Clegg



Heh, eventually I decided

"where 4 > node_depth"

was the most pleasing (if slightly WTF-ish) way of writing it...

Cheers,

Andrew.


Erik Hatcher-4 wrote:
> 
> Use < instead of < in that attribute.  That should fix the issue.   
> Remember, it's an XML file, so it has to obey XML encoding rules which  
> make it ugly but whatcha gonna do?
> 
>   Erik
> 
> On Oct 27, 2009, at 11:50 AM, Andrew Clegg wrote:
> 
>>
>> Hi,
>>
>> If I have a DataImportHandler query with a greater-than sign in,  
>> like this:
>>
>>> query="select *,
>> title as keywords from cathnode_text where node_depth > 4">
>>
>> Everything's fine. However, if it contains a less-than sign:
>>
>>> query="select *,
>> title as keywords from cathnode_text where node_depth < 4">
>>
>> I get this exception:
>>
>> INFO: Processing configuration from solrconfig.xml:  
>> {config=dataconfig.xml}
>> [Fatal Error] :240:129: The value of attribute "query" associated  
>> with an
>> element type "null" must not contain the '<' character.
>> 27-Oct-2009 15:30:49  
>> org.apache.solr.handler.dataimport.DataImportHandler
>> inform
>> SEVERE: Exception while loading DataImporter
>> org.apache.solr.handler.dataimport.DataImportHandlerException:  
>> Exception
>> occurred while initializing context
>>at
>> org 
>> .apache 
>> .solr 
>> .handler.dataimport.DataImporter.loadDataConfig(DataImporter.java:184)
>>at
>> org 
>> .apache 
>> .solr.handler.dataimport.DataImporter.(DataImporter.java:101)
>>at
>> org 
>> .apache 
>> .solr 
>> .handler.dataimport.DataImportHandler.inform(DataImportHandler.java: 
>> 113)
>>at
>> org 
>> .apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java: 
>> 424)
>>at org.apache.solr.core.SolrCore.(SolrCore.java:588)
>>at
>> org.apache.solr.core.CoreContainer 
>> $Initializer.initialize(CoreContainer.java:137)
>>at
>> org 
>> .apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java: 
>> 83)
>>at
>> org 
>> .apache 
>> .catalina 
>> .core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java: 
>> 275)
>>at
>> org 
>> .apache 
>> .catalina 
>> .core 
>> .ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java: 
>> 397)
>>at
>> org 
>> .apache 
>> .catalina 
>> .core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
>>at
>> org 
>> .apache 
>> .catalina.core.StandardContext.filterStart(StandardContext.java:3709)
>>at
>> org.apache.catalina.core.StandardContext.start(StandardContext.java: 
>> 4356)
>>at
>> org.apache.catalina.manager.ManagerServlet.start(ManagerServlet.java: 
>> 1244)
>>at
>> org 
>> .apache 
>> .catalina.manager.HTMLManagerServlet.start(HTMLManagerServlet.java: 
>> 604)
>>at
>> org 
>> .apache 
>> .catalina.manager.HTMLManagerServlet.doGet(HTMLManagerServlet.java: 
>> 129)
>>at javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
>>at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>>at
>> org 
>> .apache 
>> .catalina 
>> .core 
>> .ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java: 
>> 290)
>>at
>> org 
>> .apache 
>> .catalina 
>> .core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>at
>> org 
>> .apache 
>> .catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java: 
>> 233)
>>at
>> org 
>> .apache 
>> .catalina.core.StandardContextValve.invoke(StandardContextValve.java: 
>> 175)
>>at
>> org 
>> .apache 
>> .catalina 
>> .authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:525)
>>at
>> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java: 
>> 568)
>>at
>> org 
>> .apache 
>> .catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>>at
>> org 
>> .apache 
>> .catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>at
>> org 
>> .apache 
>> .catalina.core.StandardEngineValve.invoke(Sta

Faceting within one document

2009-10-28 Thread Andrew Clegg


Hi,

If I give a query that matches a single document, and facet on a particular
field, I get a list of all the terms in that field which appear in that
document.

(I also get some with a count of zero, I don't really understand where they
come from... ?)

Is it possible with faceting, or a similar mechanism, to get a count of how
many times each term appears within that document?

This would be really useful for building a list of top keywords within a
long document, for summarization purposes. I can do it on the client side
but it'd be nice to know if there's a quicker way.

Thanks!

Andrew.

-- 
View this message in context: 
http://www.nabble.com/Faceting-within-one-document-tp26099278p26099278.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Faceting within one document

2009-10-28 Thread Andrew Clegg



Isn't the TermVectorComponent more for one document at a time, and the
TermsComponent for the whole index?

Actually -- having done some digging... What I'm really after is the most
informative terms in a given document, which should take into account global
document frequency as well as term frequency in the document at hand. I
think I can use the MoreLikeThisHandler to do this, with a bit of
experimentation...

Thanks for the facet mincount tip BTW.

Andrew.


Avlesh Singh wrote:
> 
> For facets -
> http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount
> For terms - http://wiki.apache.org/solr/TermsComponent
> 
> Helps?
> 
> Cheers
> Avlesh
> 
> On Wed, Oct 28, 2009 at 11:32 PM, Andrew Clegg
> wrote:
> 
>>
>> Hi,
>>
>> If I give a query that matches a single document, and facet on a
>> particular
>> field, I get a list of all the terms in that field which appear in that
>> document.
>>
>> (I also get some with a count of zero, I don't really understand where
>> they
>> come from... ?)
>>
>> Is it possible with faceting, or a similar mechanism, to get a count of
>> how
>> many times each term appears within that document?
>>
>> This would be really useful for building a list of top keywords within a
>> long document, for summarization purposes. I can do it on the client side
>> but it'd be nice to know if there's a quicker way.
>>
>> Thanks!
>>
>> Andrew.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Faceting-within-one-document-tp26099278p26099278.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-within-one-document-tp26099278p26099847.html
Sent from the Solr - User mailing list archive at Nabble.com.

dismax and query analysis

2009-10-29 Thread Andrew Clegg


Morning,

Can someone clarify how dismax queries work under the hood? I couldn't work
this particular point out from the documentation...

I get that they pretty much issue the user's query against all of the fields
in the schema -- or rather, all of the fields you've specified in the qf
parameter in the config or the request.

But, does each of these 'sub'-queries get analyzed according to the normal
analysis rules for the field it's getting sent to? Or are they passed
through verbatim?

I'm hoping it's the former, as we have a variety of different field types
with radically different tokenization and filtering...

Also, is there any plan to implement wildcards in dismax, or is this
unfeasible?

Thanks once again :-)

Andrew.

-- 
View this message in context: 
http://www.nabble.com/dismax-and-query-analysis-tp26111465p26111465.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: dismax and query analysis

2009-10-29 Thread Andrew Clegg



Thanks, that demonstrates it really nicely.

Now if only dismax did wildcards too... :-)

Cheers,

Andrew.


ANithian wrote:
> 
> The best way to get started with answering this is to pass the
> &debugQuery=true and to scroll down the results page. Here, you will see a
> breakdown of how the query you entered in the q field is being parsed and
> sent to lucene via the pf,qf, and bf. You can also see how the weights
> affect the different score and why one document was ranked higher than
> another.
> 
> The text of the query will be analyzed depending on the set of analyzers
> assigned to that particular field for queries (as opposed to indexing).
> For
> example, if "test" is matched against a "string" vs "text" field,
> different
> analyzers may be applied to "string" or "text"
> 
> Hope that helps
> Amit
> 
> On Thu, Oct 29, 2009 at 4:39 AM, Andrew Clegg
> wrote:
> 
>>
>> Morning,
>>
>> Can someone clarify how dismax queries work under the hood? I couldn't
>> work
>> this particular point out from the documentation...
>>
>> I get that they pretty much issue the user's query against all of the
>> fields
>> in the schema -- or rather, all of the fields you've specified in the qf
>> parameter in the config or the request.
>>
>> But, does each of these 'sub'-queries get analyzed according to the
>> normal
>> analysis rules for the field it's getting sent to? Or are they passed
>> through verbatim?
>>
>> I'm hoping it's the former, as we have a variety of different field types
>> with radically different tokenization and filtering...
>>
>> Also, is there any plan to implement wildcards in dismax, or is this
>> unfeasible?
>>
>> Thanks once again :-)
>>
>> Andrew.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/dismax-and-query-analysis-tp26111465p26111465.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/dismax-and-query-analysis-tp26111465p26118506.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Faceting within one document

2009-10-29 Thread Andrew Clegg



Are you sure? I've *never* explicitly deleted a document, I only ever
rebuild the entire index with the data import handler's "full import with
cleaning" operation.


Lance Norskog-2 wrote:
> 
> 0-value facets are left behind by docs which you have deleted. If you
> optimize, there should be no 0-value facets.
> 
> On Wed, Oct 28, 2009 at 11:36 AM, Andrew Clegg 
> wrote:
>>
>>
>> Isn't the TermVectorComponent more for one document at a time, and the
>> TermsComponent for the whole index?
>>
>> Actually -- having done some digging... What I'm really after is the most
>> informative terms in a given document, which should take into account
>> global
>> document frequency as well as term frequency in the document at hand. I
>> think I can use the MoreLikeThisHandler to do this, with a bit of
>> experimentation...
>>
>> Thanks for the facet mincount tip BTW.
>>
>> Andrew.
>>
>>
>> Avlesh Singh wrote:
>>>
>>> For facets -
>>> http://wiki.apache.org/solr/SimpleFacetParameters#facet.mincount
>>> For terms - http://wiki.apache.org/solr/TermsComponent
>>>
>>> Helps?
>>>
>>> Cheers
>>> Avlesh
>>>
>>> On Wed, Oct 28, 2009 at 11:32 PM, Andrew Clegg
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> If I give a query that matches a single document, and facet on a
>>>> particular
>>>> field, I get a list of all the terms in that field which appear in that
>>>> document.
>>>>
>>>> (I also get some with a count of zero, I don't really understand where
>>>> they
>>>> come from... ?)
>>>>
>>>> Is it possible with faceting, or a similar mechanism, to get a count of
>>>> how
>>>> many times each term appears within that document?
>>>>
>>>> This would be really useful for building a list of top keywords within
>>>> a
>>>> long document, for summarization purposes. I can do it on the client
>>>> side
>>>> but it'd be nice to know if there's a quicker way.
>>>>
>>>> Thanks!
>>>>
>>>> Andrew.
>>>>
>>>> --
>>>> View this message in context:
>>>> http://www.nabble.com/Faceting-within-one-document-tp26099278p26099278.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Faceting-within-one-document-tp26099278p26099847.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-within-one-document-tp26099278p26119536.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Faceting within one document

2009-10-29 Thread Andrew Clegg


Actually Avlesh pointed me at that, earlier in the thread. But thanks :-)


Yonik Seeley-2 wrote:
> 
> On Wed, Oct 28, 2009 at 2:02 PM, Andrew Clegg 
> wrote:
>> If I give a query that matches a single document, and facet on a
>> particular
>> field, I get a list of all the terms in that field which appear in that
>> document.
>>
>> (I also get some with a count of zero, I don't really understand where
>> they
>> come from... ?)
> 
> By default, solr has a facet.mincount of zero, so it includes terms
> that don't match your set of documents.
> Try facet.mincount=1
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
>> Is it possible with faceting, or a similar mechanism, to get a count of
>> how
>> many times each term appears within that document?
>>
>> This would be really useful for building a list of top keywords within a
>> long document, for summarization purposes. I can do it on the client side
>> but it'd be nice to know if there's a quicker way.
>>
>> Thanks!
>>
>> Andrew.
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Faceting-within-one-document-tp26099278p26099278.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Faceting-within-one-document-tp26099278p26120291.html
Sent from the Solr - User mailing list archive at Nabble.com.

NullPointerException with TermVectorComponent

2009-11-02 Thread Andrew Clegg


Hi,

I've recently added the TermVectorComponent as a separate handler, following
the example in the supplied config file, i.e.:

  

  
  
  true
  
  
  tvComponent
  
  

It works, but with one quirk. When you use tf.all=true, you get the tf*idf
scores in the output, just fine (along with tf and df). But if you use
tv.tf_idf=true you get an NPE:

http://server:8080/solr/tvrh/?q=1cuk&version=2.2&indent=on&tv.tf_idf=true

HTTP Status 500 - null java.lang.NullPointerException at
org.apache.solr.handler.component.TermVectorComponent$TVMapper.getDocFreq(TermVectorComponent.java:253)
at
org.apache.solr.handler.component.TermVectorComponent$TVMapper.map(TermVectorComponent.java:245)
at
org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.java:522)
at
org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.java:401)
at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:378)
at
org.apache.lucene.index.SegmentReader.getTermFreqVector(SegmentReader.java:1253)
at
org.apache.lucene.index.DirectoryReader.getTermFreqVector(DirectoryReader.java:474)
at
org.apache.solr.search.SolrIndexReader.getTermFreqVector(SolrIndexReader.java:244)
at
org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:125)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
(etc.)

Is this a bug, or am I doing it wrong?

Cheers,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/NullPointerException-with-TermVectorComponent-tp26156903p26156903.html
Sent from the Solr - User mailing list archive at Nabble.com.

Highlighting is very slow

2009-11-03 Thread Andrew Clegg


Hi everyone,

I'm experimenting with highlighting for the first time, and it seems
shockingly slow for some queries.

For example, this query:

http://server:8080/solr/select/?q=transferase&qt=dismax&version=2.2&start=0&rows=10&indent=on

takes 313ms. But when I add highlighting:

http://server:8080/solr/select/?q=transferase&qt=dismax&version=2.2&start=0&rows=10&indent=on&hl=true&hl.fl=*&fl=id

it takes 305212ms = 5mins!

Some of my documents are slightly large -- the 10 hits for that query
contain between 362 bytes and 1.4 megabytes of text each. All fields are
stored and indexed, and most are termvectored. But this doesn't seem
excessively large!

Has anyone else seen this sort of behaviour before? This is with a nightly
from 2009-10-26.

All suggestions would be appreciated. My schema and config files are
attached...

http://old.nabble.com/file/p26160216/schema.xml schema.xml 
http://old.nabble.com/file/p26160216/solrconfig.xml solrconfig.xml 

Thanks (once again),

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Highlighting-is-very-slow-tp26160216p26160216.html
Sent from the Solr - User mailing list archive at Nabble.com.

Highlighting is very slow

2009-11-03 Thread Andrew Clegg


Hi everyone,

I'm experimenting with highlighting for the first time, and it seems
shockingly slow for some queries.

For example, this query:

http://server:8080/solr/select/?q=transferase&qt=dismax&version=2.2&start=0&rows=10&indent=on

takes 313ms. But when I add highlighting:

http://server:8080/solr/select/?q=transferase&qt=dismax&version=2.2&start=0&rows=10&indent=on&hl=true&hl.fl=*&fl=id

it takes 305212ms = 5mins!

Some of my documents are slightly large -- the 10 hits for that query
contain between 362 bytes and 1.4 megabytes of text each. All fields are
stored and indexed, and most are termvectored. But this doesn't seem
excessively large!

Has anyone else seen this sort of behaviour before? This is with a nightly
from 2009-10-26.

All suggestions would be appreciated. My schema and config files are
attached...

http://old.nabble.com/file/p26160217/schema.xml schema.xml 
http://old.nabble.com/file/p26160217/solrconfig.xml solrconfig.xml 

Thanks (once again),

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Highlighting-is-very-slow-tp26160217p26160217.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highlighting is very slow

2009-11-04 Thread Andrew Clegg



We're on 1.6 already. Any chance you could share your GC settings?

Thanks,

Andrew.

PS apologies for the duplicate message yesterday, Nabble threw an exception
when I posted the first one. And the second one actually.


Jaco-4 wrote:
> 
> Hi,
> 
> We had a similar case once (although not with those really long response
> times). Fixed by moving to JRE 1.6 and tuning garbage collection.
> 
> Bye,
> 
> Jaco.
> 
> 2009/11/3 Andrew Clegg 
> 
>>
>> Hi everyone,
>>
>> I'm experimenting with highlighting for the first time, and it seems
>> shockingly slow for some queries.
>>
>> For example, this query:
>>
>>
>> http://server:8080/solr/select/?q=transferase&qt=dismax&version=2.2&start=0&rows=10&indent=on
>>
>> takes 313ms. But when I add highlighting:
>>
>>
>> http://server:8080/solr/select/?q=transferase&qt=dismax&version=2.2&start=0&rows=10&indent=on&hl=true&hl.fl=*&fl=id
>>
>> it takes 305212ms = 5mins!
>>
>> Some of my documents are slightly large -- the 10 hits for that query
>> contain between 362 bytes and 1.4 megabytes of text each. All fields are
>> stored and indexed, and most are termvectored. But this doesn't seem
>> excessively large!
>>
>> Has anyone else seen this sort of behaviour before? This is with a
>> nightly
>> from 2009-10-26.
>>
>> All suggestions would be appreciated. My schema and config files are
>> attached...
>>
>> http://old.nabble.com/file/p26160217/schema.xml schema.xml
>> http://old.nabble.com/file/p26160217/solrconfig.xml solrconfig.xml
>>
>> Thanks (once again),
>>
>> Andrew.
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Highlighting-is-very-slow-tp26160217p26160217.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Highlighting-is-very-slow-tp26160217p26194384.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highlighting is very slow

2009-11-09 Thread Andrew Clegg

Nicolas Dessaigne wrote:
> 
> Alternatively, you could use a copyfield with a maxChars limit as your
> highlighting field. Works well in my case.
> 

Thanks for the tip. We did think about doing something similar (only
enabling highlighting for certain shorter fields) but we decided that
perhaps users would be confused if search terms were sometimes
snippeted+highlighted and sometimes not. (A brief run through with a single
user suggested this, although that's not statistically significant...) So we
decided to avoid highlighting altogether until we can do it across the
board.

Cheers,

Andrew.
-- 
View this message in context: 
http://old.nabble.com/Highlighting-is-very-slow-tp26160216p26267441.html
Sent from the Solr - User mailing list archive at Nabble.com.

Selection of terms for MoreLikeThis

2009-11-10 Thread Andrew Clegg


Hi,

If I run a MoreLikeThis query like the following:

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1

one of the hits in the results is "and" (I don't do any stopword removal on
this field).

However if I look inside that document with the TermVectorComponent:

http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords

I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms
with *much* higher tf.idf scores, e.g.:


1
10
0.1


that *don't* appear in the MoreLikeThis list. (I tried adding &mlt.maxwl=999
to the end of the MLT query but it makes no difference.)

What's going on? Surely something with tf.idf = 0.1 is a far better
candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4? Or
does MoreLikeThis do some other heuristic magic to select good candidates,
and sometimes get it wrong?

BTW the keywords field is indexed, stored, multi-valued and term-vectored.

Thanks,

Andrew.

-- 
:: http://biotext.org.uk/ ::

-- 
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26286005.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arguments for Solr implementation at public web site

2009-11-13 Thread Andrew Clegg

Lukáš Vlček wrote:
> 
> I am looking for good arguments to justify implementation a search for
> sites
> which are available on the public internet. There are many sites in
> "powered
> by Solr" section which are indexed by Google and other search engines but
> still they decided to invest resources into building and maintenance of
> their own search functionality and not to go with [user_query site:
> my_site.com] google search. Why?
> 

You're assuming that Solr is just used in these cases to index discrete web
pages which Google etc. would be able to access via following navigational
links.

I would imagine that in a lot of cases, Solr is used to index database
entities which are used to build [parts of] pages dynamically, and which
might be viewable in different forms in various different pages.

Plus, with stored fields, you have the option of actually driving a website
off Solr instead of directly off a database, which might make sense from a
speed perspective in some cases.

And further, going back to page-only indexing -- you have no guarantee when
Google will decide to recrawl your site, so there may be a delay before
changes show up in their index. With an in-house search engine you can
reindex as often as you like.

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Arguments-for-Solr-implementation-at-public-web-site-tp26333987p26334734.html
Sent from the Solr - User mailing list archive at Nabble.com.

Data import problem with child entity from different database

2009-11-13 Thread Andrew Clegg

Morning all,

I'm having problems with joining child a child entity from one database to a
parent from another...

My entity definitions look like this (names changed for brevity):

c is getting indexed fine (it's stored, I can see field 'c' in the search
results) but child.d isn't. I know the child table has data for the
corresponding parent rows, and I've even watched the SQL queries against the
child table appearing in Oracle's sqldeveloper as the DataImportHandler
runs. But no content for child.d gets into the index.

My schema contains a definition for a field called d like so:

(keywords_ids is a conservatively-analyzed text type which has worked fine
in other contexts.)

Two things occur to me.

1. db1 is PostgreSQL and db2 is Oracle, although the d field in both tables
is just a char(4), nothing fancy. Could something weird with character
encodings be happening?

2. d isn't a primary key in either parent or child, but this shouldn't
matter should it?

Additional data points -- I also tried using the CachedSqlEntityProcessor to
do in-memory table caching of child, but it didn't work then either. I got a
lot of error messages like this:

No value available for the cache key : d in the entity : child

If anyone knows whether this is a known limitation (if so I can work round
it), or an unexpected case (if so I'll file a bug report), please shout. I'm
using 1.4.

Yet again, many thanks :-)

Andrew.

--
View this message in context:
http://old.nabble.com/Data-import-problem-with-child-entity-from-different-database-tp26334948p26334948.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Arguments for Solr implementation at public web site

2009-11-13 Thread Andrew Clegg



Lukáš Vlček wrote:
> 
> When you need to search for something Lucene or Solr related, which one do
> you use:
> - generic Google
> - go to a particular mail list web site and search from here (if there is
> any search form at all)
> 

Both of these (Nabble in the second case) in case any recent posts have
appeared which Google hasn't picked up.

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Arguments-for-Solr-implementation-at-public-web-site-tp26333987p26334980.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg


Any ideas on this? Is it worth sending a bug report?

Those links are live, by the way, in case anyone wants to verify that MLT is
returning suggestions with very low tf.idf.

Cheers,

Andrew.


Andrew Clegg wrote:
> 
> Hi,
> 
> If I run a MoreLikeThis query like the following:
> 
> http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=list&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1
> 
> one of the hits in the results is "and" (I don't do any stopword removal
> on this field).
> 
> However if I look inside that document with the TermVectorComponent:
> 
> http://www.cathdb.info/solr/select/?q=id:3.40.50.720&tv=true&tv.all=true&tv.fl=keywords
> 
> I see that "and" has a measly tf.idf of 7.46E-4. But there are other terms
> with *much* higher tf.idf scores, e.g.:
> 
> 
> 1
> 10
> 0.1
> 
> 
> that *don't* appear in the MoreLikeThis list. (I tried adding
> &mlt.maxwl=999 to the end of the MLT query but it makes no difference.)
> 
> What's going on? Surely something with tf.idf = 0.1 is a far better
> candidate for a MoreLikeThis query than something with tf.idf = 1.46E-4?
> Or does MoreLikeThis do some other heuristic magic to select good
> candidates, and sometimes get it wrong?
> 
> BTW the keywords field is indexed, stored, multi-valued and term-vectored.
> 
> Thanks,
> 
> Andrew.
> 
> -- 
> :: http://biotext.org.uk/ ::
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26335061.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Data import problem with child entity from different database

2009-11-13 Thread Andrew Clegg




Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
> 
> no obvious issues.
> you may post your entire data-config.xml
> 

Here it is, exactly as last attempt but with usernames etc. removed.

Ignore the comments and the unused FileDataSource...

http://old.nabble.com/file/p26335171/dataimport.temp.xml dataimport.temp.xml 


Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
> 
> do w/o CachedSqlEntityProcessor first and then apply that later
> 

Yep, that was just a bit of a wild stab in the dark to see if it made any
difference.

Thanks,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Data-import-problem-with-child-entity-from-different-database-tp26334948p26335171.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg



Chantal Ackermann wrote:
> 
> no idea, I'm afraid - but could you sent the output of 
> interestingTerms=details?
> This at least would show what MoreLikeThis uses, in comparison to the 
> TermVectorComponent you've already pasted.
> 

I can, but I'm afraid they're not very illuminating!

http://www.cathdb.info/solr/mlt?q=id:3.40.50.720&rows=0&mlt.interestingTerms=details&mlt.match.include=false&mlt.fl=keywords&mlt.mintf=1&mlt.mindf=1



 0
 59



 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0
 1.0



Cheers,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26336558.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Selection of terms for MoreLikeThis

2009-11-13 Thread Andrew Clegg

Chantal Ackermann wrote:
> 
> your URL does not include the parameter mlt.boost. Setting that to 
> "true" made a noticeable difference for my queries.
> 

Hmm, I'm really not sure if this is doing the right thing either. When I add
it I get:

 1.0
 0.60737264
 0.27599618
 0.2476748
 0.24487767
 0.23969446
 0.1990452
 0.18447271
 0.13297324
 0.1233415
 0.11993817
 0.11789705
 0.117194556
 0.11164951
 0.10744005
 0.09943076
 0.097062066
 0.09287166
 0.0877542
 0.0864609
 0.08362857
 0.07988805
 0.079598725
 0.07747293
 0.075560644

"and" scores far more highly than much more discriminative words like
"chloroplast" and "glyoxylate", both of which have *much* higher tf.idf
scores than "and" according to the TermVectorComponent:

8
1887
0.0042395336512983575

7

0.0063006300630063005

45
60316
7.460706943431262E-4

In fact an order of magnitude higher.

Chantal Ackermann wrote:
> 
> If not, there is also the parameter
>   mlt.minwl
> "minimum word length below which words will be ignored."
> 
> All your other terms seem longer than 3, so it would help in this case? 
> But seems a bit like work around.
> 

Yeah, I could do that, or add a stopword list to that field. But there are
some other common terms in the list like "protein" or "enzyme" that are long
and not really stopwords, but have a similarly low tf.idf to "and":

43
189541
2.2686384476181933E-4

15
16712
8.975586404978459E-4

Plus, of course, I'm curious to know exactly how MLT is identifying those
terms as important, and if it's a bug or my fault...

Thanks for your help though! Do any of the Solr devs have an idea of the
mechanism at work here?

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Selection-of-terms-for-MoreLikeThis-tp26286005p26337677.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: 'Connection reset' in DataImportHandler Development Console

2009-12-14 Thread Andrew Clegg

aerox7 wrote:
> 
> Hi Andrew,
> I download the last build of solr (1.4) and i have the same probleme with
> DebugNow in Dataimport dev Console. have you found a solution ?
> 

Sorry about slow reply, I've been on holiday. No, I never found a solution,
it worked in some nightlies but not in others, if I remember correctly. I
haven't tried it in 1.4 yet, I got around my problem another way.

Andrew.

-- 
View this message in context: 
http://old.nabble.com/%27Connection-reset%27-in-DataImportHandler-Development-Console-tp25005850p26779966.html
Sent from the Solr - User mailing list archive at Nabble.com.

Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg


Hi,

I'm interested in near-dupe removal as mentioned (briefly) here:

http://wiki.apache.org/solr/Deduplication

However the link for TextProfileSignature hasn't been filled in yet.

Does anyone have an example of using TextProfileSignature that demonstrates
the tunable parameters mentioned in the wiki?

Thanks!

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27127151.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg



Thanks Erik, but I'm still a little confused as to exactly where in the Solr
config I set these parameters.

The example on the wiki page uses Lookup3Signature which (presumably) takes
no parameters, so there's no indication in the XML examples of where you
would set them. Unless I'm missing something.

Thanks again,

Andrew.


Erik Hatcher-4 wrote:
> 
> 
> On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:
>> I'm interested in near-dupe removal as mentioned (briefly) here:
>>
>> http://wiki.apache.org/solr/Deduplication
>>
>> However the link for TextProfileSignature hasn't been filled in yet.
>>
>> Does anyone have an example of using TextProfileSignature that  
>> demonstrates
>> the tunable parameters mentioned in the wiki?
> 
> There are some comments in the source code*, but they weren't made  
> class-level.  I'm fixing that and committing it now, but here's the  
> comment:
> 
> /**
>   * This implementation is copied from Apache Nutch. 
>   * An implementation of a page signature. It calculates an MD5 hash
>   * of a plain text "profile" of a page.
>   * The algorithm to calculate a page "profile" takes the plain  
> text version of
>   * a page and performs the following steps:
>   * 
>   * remove all characters except letters and digits, and bring all  
> characters
>   * to lower case,
>   * split the text into tokens (all consecutive non-whitespace  
> characters),
>   * discard tokens equal or shorter than MIN_TOKEN_LEN (default 2  
> characters),
>   * sort the list of tokens by decreasing frequency,
>   * round down the counts of tokens to the nearest multiple of QUANT
>   * (QUANT = QUANT_RATE * maxFreq, where  
> QUANT_RATE is 0.01f
>   * by default, and maxFreq is the maximum token  
> frequency). If
>   * maxFreq is higher than 1, then QUANT is always higher  
> than 2 (which
>   * means that tokens with frequency 1 are always discarded).
>   * tokens, which frequency after quantization falls below QUANT,  
> are discarded.
>   * create a list of tokens and their quantized frequency,  
> separated by spaces,
>   * in the order of decreasing frequency.
>   * 
>   * This list is then submitted to an MD5 hash calculation.*/
> 
> There are two parameters this implementation takes:
> 
>  quantRate = params.getFloat("quantRate", 0.01f);
>  minTokenLen = params.getInt("minTokenLen", 2);
> 
> Hope that helps.
> 
>   Erik
> 
> 
> 
> *
> http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/update/processor/TextProfileSignature.java
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128173.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

2010-01-12 Thread Andrew Clegg



Erik Hatcher-4 wrote:
> 
> 
> On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
>> Thanks Erik, but I'm still a little confused as to exactly where in  
>> the Solr
>> config I set these parameters.
> 
> You'd configure them within the  element, something like  
> this:
> 
> 5
> 
> 

OK, thanks. (Should that really be str though, and not int or something?)


Erik Hatcher-4 wrote:
> 
> 
> Perhaps you could update the wiki with an example once you get it  
> working?
> 
> I'm flying a little blind here, just going off the source code, not  
> trying it out for real.
> 
> 

Sure -- it won't be til next week at the earliest though.

Cheers,

Andrew.

-- 
View this message in context: 
http://old.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp27127151p27128493.html
Sent from the Solr - User mailing list archive at Nabble.com.

Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-02 Thread Andrew Clegg


Hi,

Is there a way to get the DataImportHandler to skip already-seen records
rather than reindexing them?

The UpdateHandler has an  capability which (as I
understand it) means that a document whose uniqueKey matches one already in
the index will be skipped instead of overwritten.

Can the DIH be made to behave this way?

If not, would it be an easy patch? This is using the XPathEntityProcessor by
the way.

Thanks,

Andrew.
--
:: http://biotext.org.uk/ ::
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Skipping-duplicates-in-DataImportHandler-based-on-uniqueKey-tp771559p771559.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Skipping duplicates in DataImportHandler based on uniqueKey

2010-05-03 Thread Andrew Clegg



Marc Sturlese wrote:
> 
> You can use deduplication to do that. Create the signature based on the
> unique field or any field you want.
> 

Cool, thanks, I hadn't thought of that.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Skipping-duplicates-in-DataImportHandler-based-on-uniqueKey-tp771559p773268.html
Sent from the Solr - User mailing list archive at Nabble.com.

ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg


Hi,

I'm trying to get the Velocity / Solritas feature to work for one core of a
two-core Solr instance, but it's not playing nice.

I know the right jars are being loaded, because I can see them mentioned in
the log, but still I get a class not found exception:

09-May-2010 15:34:02 org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/var/www/smesh/current/config/solr/twitter/lib/apache-solr-velocity-1.4.1-dev.jar'
to classloader
09-May-2010 15:34:02 org.apache.solr.core.SolrResourceLoader
replaceClassLoader
INFO: Adding
'file:/var/www/smesh/current/config/solr/twitter/lib/velocity-1.6.1.jar' to
classloader
09-May-2010 15:34:02 org.apache.solr.core.SolrResourceLoader
replaceClassLoader
...
SEVERE: org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.response.VelocityResponseWriter'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:413)
...

I've attached the whole log, as it's quite big, and Nabble thinks it's spam
because it has "too many 'anal' words" ;-) 
http://n3.nabble.com/file/n787256/solr.log solr.log 

Here is the appropriate part of my solrconfig.xml for the core which is
attempting to load Velocity:

  
  
 
   browse
   velocity.properties
   text/html;charset=UTF-8
   Solritas
   velocity
   standard
   *:*
   10
   *,score
   on
   author_name
   1
   
   
 
 
   
 
  

Any ideas?

Many thanks, once again!

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/ClassNotFoundException-org-apache-solr-response-VelocityResponseWriter-tp787256p787256.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg



Erik Hatcher-4 wrote:
> 
> What version of Solr?   Try switching to  
> class="solr.VelocityResponseWriter", and if that doesn't work use  
> class="org.apache.solr.request.VelocityResponseWriter".  The first  
> form is the recommended way to do it.  The actual package changed in  
> trunk not too long ago.
> 

Hi Erik,

This is with vanilla Solr 1.4

I got it working with solr.VelocityResponseWriter -- thanks.

However, I'm having trouble with the defType parameter. I want to use the
standard query type so people can used nested booleans etc. in the queries.
When I tried this in solrconfig:

   standard

I got this from the Solritas page:


HTTP ERROR: 400

Unknown query type 'standard'

RequestURI=/solr/twitter/itas

Powered by Jetty://


However when I tried this:

   standard

I got this exception:


HTTP ERROR: 500

null

java.lang.NullPointerException
at java.io.StringReader.(StringReader.java:33)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
at 
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
at org.apache.solr.search.QParser.getQuery(QParser.java:131)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

RequestURI=/solr/twitter/itas

Powered by Jetty://


Do you know what the right way to do this is?

Thanks,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/ClassNotFoundException-org-apache-solr-response-VelocityResponseWriter-tp787256p787487.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter

2010-05-09 Thread Andrew Clegg


Sorry -- in the second of those error messages (the NPE) I meant

   lucene

not standard.


Andrew Clegg wrote:
> 
> 
> Erik Hatcher-4 wrote:
>> 
>> What version of Solr?   Try switching to  
>> class="solr.VelocityResponseWriter", and if that doesn't work use  
>> class="org.apache.solr.request.VelocityResponseWriter".  The first  
>> form is the recommended way to do it.  The actual package changed in  
>> trunk not too long ago.
>> 
> 
> Hi Erik,
> 
> This is with vanilla Solr 1.4
> 
> I got it working with solr.VelocityResponseWriter -- thanks.
> 
> However, I'm having trouble with the defType parameter. I want to use the
> standard query type so people can used nested booleans etc. in the
> queries. When I tried this in solrconfig:
> 
>standard
> 
> I got this from the Solritas page:
> 
> 
> HTTP ERROR: 400
> 
> Unknown query type 'standard'
> 
> RequestURI=/solr/twitter/itas
> 
> Powered by Jetty://
> 
> 
> However when I tried this:
> 
>standard
> 
> I got this exception:
> 
> 
> HTTP ERROR: 500
> 
> null
> 
> java.lang.NullPointerException
>   at java.io.StringReader.(StringReader.java:33)
>   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
>   at
> org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
>   at org.apache.solr.search.QParser.getQuery(QParser.java:131)
>   at
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
>   at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
>   at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>   at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>   at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>   at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>   at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>   at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>   at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>   at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>   at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>   at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>   at org.mortbay.jetty.Server.handle(Server.java:285)
>   at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>   at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>   at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>   at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> 
> RequestURI=/solr/twitter/itas
> 
> Powered by Jetty://
> 
> 
> Do you know what the right way to do this is?
> 
> Thanks,
> 
> Andrew.
> 
> 

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/ClassNotFoundException-org-apache-solr-response-VelocityResponseWriter-tp787256p787490.html
Sent from the Solr - User mailing list archive at Nabble.com.

Fixed: Solritas on multicore Solr, using standard query handler (was Re: ClassNotFoundException: org.apache.solr.response.VelocityResponseWriter)

2010-05-09 Thread Andrew Clegg


Don't worry Erik -- I figured this one out.

For the benefit of future searchers, you need

   lucene

And to avoid the NullPointerException from the /solr/CORENAME/itas page, you
actually need to supply a

?q=blah

initial query.

I just assumed it would give you a blank search page if you didn't supply a
query.

N.B. In case this catches anyone out -- there's also a few places where you
need to put the core name into the templates in the conf/velocity directory
for the core. They don't pick this up automatically so you need to find any
references to /solr/admin or /solr/itas and insert your core name in the
middle.

(Does anyone know if there'd be a simple way to make that automatic?)


Andrew Clegg wrote:
> 
> 
> Erik Hatcher-4 wrote:
>> 
>> What version of Solr?   Try switching to  
>> class="solr.VelocityResponseWriter", and if that doesn't work use  
>> class="org.apache.solr.request.VelocityResponseWriter".  The first  
>> form is the recommended way to do it.  The actual package changed in  
>> trunk not too long ago.
>> 
> 
> Hi Erik,
> 
> This is with vanilla Solr 1.4
> 
> I got it working with solr.VelocityResponseWriter -- thanks.
> 
> However, I'm having trouble with the defType parameter. I want to use the
> standard query type so people can used nested booleans etc. in the
> queries. When I tried this in solrconfig:
> 
>standard
> 
> I got this from the Solritas page:
> 
> 
> HTTP ERROR: 400
> 
> Unknown query type 'standard'
> 
> RequestURI=/solr/twitter/itas
> 
> Powered by Jetty://
> 
> 
> However when I tried this:
> 
>standard
> 
> I got this exception:
> 
> 
> HTTP ERROR: 500
> 
> null
> 
> java.lang.NullPointerException
>   at java.io.StringReader.(StringReader.java:33)
>   at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:197)
>   at
> org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
>   at org.apache.solr.search.QParser.getQuery(QParser.java:131)
>   at
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:89)
>   at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
>   at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>   at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>   at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>   at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>   at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>   at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>   at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>   at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>   at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>   at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>   at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>   at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>   at org.mortbay.jetty.Server.handle(Server.java:285)
>   at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>   at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>   at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>   at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>   at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>   at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>   at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> 
> RequestURI=/solr/twitter/itas
> 
> Powered by Jetty://
> 
> 
> Do you know what the right way to do this is?
> 
> Thanks,
> 
> Andrew.
> 
> 

--
:: http://biotext.org.uk/ ::

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/ClassNotFoundException-org-apache-solr-response-VelocityResponseWriter-tp787256p787589.html
Sent from the Solr - User mailing list archive at Nabble.com.

How bad is stopping Solr with SIGKILL?

2010-05-31 Thread Andrew Clegg


Hi folks,

I had a Solr instance (in Jetty on Linux) taken down by a process monitoring
tool (God) with a SIGKILL recently.

How bad is this? Can it cause index corruption if it's in the middle of
indexing something? Or will it just lose uncommitted changes? What if the
signal arrives in the middle of the commit process?

Unfortunately I can't tell exactly what it was doing at the time as
someone's deleted the logfile :-(

Thanks,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-bad-is-stopping-Solr-with-SIGKILL-tp858119p858119.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing link targets in HTML fragments

2010-06-06 Thread Andrew Clegg


Hi Solr gurus,

I'm wondering if there is an easy way to keep the targets of hyperlinks from
a field which may contain HTML fragments, while stripping the HTML.

e.g. if I had a field that looked like this:

"This is the entire content of my field, but  http://example.com/ some of
the words  are a hyperlink."

Then I'd like to keep "http://example.com/"; as a single token (along with
all of the actual words) but not the "a" and "href", giving me:

"This is the entire content of my field but http://example.com/ some of the
words are a hyperlink"

I'm thinking that since we're dealing with individual fragments rather than
entire HTML pages, Tika/SolrCell may be poorly suited and/or too heavyweight
-- but please correct me if I'm wrong.

Maybe something using regular expressions? Does anyone have a code snippet
they could share?

Many thanks,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p874547.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing link targets in HTML fragments

2010-06-06 Thread Andrew Clegg

Lance Norskog-2 wrote:
> 
> The PatternReplace and HTMPStrip tokenizers might be the right bet.
> The easiest way to go about this is to make a bunch of text fields
> with different analysis stacks and investigate them in the Scema
> Browser. You can paste an HTML document into the text box and see
> exactly how the words & markup get torn apart.
> 

Thanks Lance, I'll experiment.

For reference, for anyone else who comes across this thread -- the html in
my original post might have got munged on the way into or out of the list
server. It was supposed to look like this:

This is the entire content of my field, but [a
href="http://example.com/"]some of the words[/a] are a hyperlink.

(but with real html tags instead of the square brackets)

and I am just trying to extract the words and the link target but lose the
rest of the markup.

Cheers,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p875503.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing link targets in HTML fragments

2010-06-07 Thread Andrew Clegg

findbestopensource wrote:
> 
> Could you tell us your schema used for indexing. In my opinion, using
> standardanalyzer / Snowball analyzer will do the best. They will not break
> the URLs. Add href, and other related html tags as part of stop words and
> it
> will removed while indexing.
> 

This project's still in the planning stages -- I haven't designed the
pipeline yet.

But you're right, maybe starting with everything and just stopping out the
tag and attribute names is the most fail-safe approach.

Then at least if I get something wrong I won't miss anything. Worst case
scenario, I just end up with some extra terms in the index.

Thanks,

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-link-targets-in-HTML-fragments-tp874547p876343.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Andrew Clegg

Neeb wrote:
> 
> Just wondering if you ever managed to run TextProfileSignature based
> deduplication. I would appreciate it if you could send me the code
> fragment for it from  solrconfig.
> 

Actually the project that was for got postponed and I got distracted by
other things, for now at least.

Re. your config, I don't see a minTokenLength in the wiki page for
deduplication, is this a recent addition that's not documented yet?

It looks okay to me though -- perhaps you could do some empirical tests to
see if it's working? i.e. add some near-dupes to a collection manually and
see if it finds them?

Andrew.

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880379.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

2010-06-08 Thread Andrew Clegg



Andrew Clegg wrote:
> 
> Re. your config, I don't see a minTokenLength in the wiki page for
> deduplication, is this a recent addition that's not documented yet?
> 

Sorry about this -- stupid question -- I should have read back through the
thread and refreshed my memory.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p880385.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Filtering near-duplicates using TextProfileSignature

2010-06-09 Thread Andrew Clegg



Markus Jelsma wrote:
> 
> Well, it got me too! KMail didn't properly order this thread. Can't seem
> to 
> find Hatcher's reply anywhere. ??!!?
> 

Whole thread here:

http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tt479039.html
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Filtering-near-duplicates-using-TextProfileSignature-tp479039p881797.html
Sent from the Solr - User mailing list archive at Nabble.com.

77 matches

Mail list logo