Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently

2007-11-29 Thread Daniel Alheiros
Hi Hoss.

Well I'll enable this ignore options for fields that aren't declared in my
schema. Thanks.

Exactly, you can try it really easily, just remove one of your fields on the
example schema config and try to add content using the Java client API...
Well I'm using SOLRJ and it returns no error code for me. But anyway don't
you think the server should also have some logging informing that documents
are being discarded?

Cheers,
Daniel


On 28/11/07 19:25, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : I didn't know that trick.
> 
> erik is refering to this in the example schema.xml...
> 
>
>
> 
> ...but it sounds like you are having some other problem ... you said that
> when you POST your documents with "extra" fields you get a 200
> response but the documents aren't getting indexed at all correct?
> 
> that is not suppose to happen, Solr should be generating an error.  can
> you give us more info on your setup: what does your schema.xml look like,
> what does your update code look like (you said you were using SolrJ i
> believe?) what does Solr log when these updates happen, etc...
> 
> 
> 
> -Hoss
> 


http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.



Re: Schema class configuration syntax

2007-11-29 Thread Ryan McKinley

Norskog, Lance wrote:

Hi-
 
What is the  element in an  element that will load

this class:
 
org.apache.lucene.analysis.cn.ChineseFilter
 
This did not work:
 
 


This is in Solr 1.2.
 


the class needs to point to a FilterFactory (not a Filter)

1.3-dev adds FilterFactories for all the lucne contrib fiters.  Using 
1.2, add a jar file with this class and you should be all set:


http://svn.apache.org/repos/asf/lucene/solr/trunk/src/java/org/apache/solr/analysis/ChineseFilterFactory.java

ryan


Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently

2007-11-29 Thread Ryan McKinley
To be clear:  solr *should* fail with an error if you send an unknown 
field.


I just tested this with a clean checkout of 1.3-dev and 1.2 and in both 
cases I get an error 400 "unknown field 'asgasdgasgd'"


The suggestion to look at the "ignore option" is to make sure you don't 
have one -- this should be the only to add an arbitrary unknown field 
without an error.


From a clean 1.2/1.3-dev install, how can you reproduce the error?

I tried:
$ ant example
$ cd example/
$ java -jar start.jar

another terminal:
edit mem.xml to add: 5
$ cd example/exampledocs
$ ./post.sh mem.xml

this gives:
HTTP ERROR: 400ERROR:unknown field 'asgasdgasgd'

running either 1.2 or 1.3

ryan


Daniel Alheiros wrote:

Hi Hoss.

Well I'll enable this ignore options for fields that aren't declared in my
schema. Thanks.

Exactly, you can try it really easily, just remove one of your fields on the
example schema config and try to add content using the Java client API...
Well I'm using SOLRJ and it returns no error code for me. But anyway don't
you think the server should also have some logging informing that documents
are being discarded?

Cheers,
Daniel


On 28/11/07 19:25, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:


: I didn't know that trick.

erik is refering to this in the example schema.xml...

   

   

...but it sounds like you are having some other problem ... you said that
when you POST your documents with "extra" fields you get a 200
response but the documents aren't getting indexed at all correct?

that is not suppose to happen, Solr should be generating an error.  can
you give us more info on your setup: what does your schema.xml look like,
what does your update code look like (you said you were using SolrJ i
believe?) what does Solr log when these updates happen, etc...



-Hoss




http://www.bbc.co.uk/
This e-mail (and any attachments) is confidential and may contain personal 
views which are not the views of the BBC unless specifically stated.
If you have received it in error, please delete it from your system.
Do not use, copy or disclose the information in any way nor act in reliance on 
it and notify the sender immediately.
Please note that the BBC monitors e-mails sent or received.
Further communication will signify your consent to this.






Re: LowerCaseFilterFactory and spellchecker

2007-11-29 Thread Sean Timm
It seems the best thing to do would be to do a case-insensitive 
spellcheck, but provide the suggestion preserving the original case that 
the user provided--or at least make this an option.  Users are often 
lazy about capitalization, especially with search where they've learned 
from web search engines that case (typically) doesn't matter.


So, for example, Thurne would return Thorne, but thurne would return thorne.

-Sean

John Stewart wrote:

Rob,

Let's say it worked as you want it to in the first place.  If the
query is for Thurne, wouldn't you get thorne (lower-case 't') as the
suggestion?  This may look weird for proper names.

jds
  


How much disc space Solr consumes?

2007-11-29 Thread Evgeniy Strokin
Hello,.. If index size is 100Gb and I want to run optimize command, how much 
more space I need for this?
Also,.. If I run snapshooter does it take more space during shooting than 
actual snapshoot?
\Thank you
Gene

can I do *thing* substring searches at all?

2007-11-29 Thread Brian Whitman
With a fieldtype of string, can I do any sort of *thing* search? I  
can do thing* but not *thing or *thing*. Workarounds?







Re: can I do *thing* substring searches at all?

2007-11-29 Thread Charles Hornberger
Store a copy with the string reversed in another field. Then you can
search that field for gniht* ...

Also, I believe I saw some comments about prefix wildcards being
available in some upcoming release (1.3?) ... sorry I can't remember
any better than that. Google may help ...

-Charlie

On Nov 29, 2007 2:51 PM, Brian Whitman <[EMAIL PROTECTED]> wrote:
> With a fieldtype of string, can I do any sort of *thing* search? I
> can do thing* but not *thing or *thing*. Workarounds?
>
>
>
>
>


Document field data not getting indexed

2007-11-29 Thread Phillip Farber


Hi,

I have 22 documents. I index these by posting them using LWP::UserAgent 
all with http status 200 OK.


One of my documents (id=44) contains the word "Campeau" in the "ocr" 
field.  But according to luke this term does not appear in the index. 
Yet when I delete the index (delete by query *:* or restart server after 
 deleting /index) and index just document id=44 its ocr field data does 
appear in the  index according to luke.


Also I notice that the numTerms for 22 documents is 5579 and for just 
the doc id=44 it's 2194.  Hard to believe that 22 documents only 
increase the number of terms by so little.


Why/how could this be happening?

Thanks,

Phil

---

My schema.xml:

  required="true"/>
   required="true"/>
   required="true"/>


where "mytext" is

 
  
  
  
  
  


Indexing 22 docs:
-


22
22
5579
1196382086904
true
true
false
2007-11-30T00:22:06Z



mytext
IT---
(unstored field)
22
5513

[...]
22
22
22  ???
22
22


Indexing just doc id=44:



1
1
2194
1196381821086
true
true
false
2007-11-30T00:17:21Z



mytext
IT---
(unstored field)
1
2191

[...]
1
1
1
1  <<
1
1
1





Distribution without SSH?

2007-11-29 Thread Justin Knoll

Hello,
I recently set up Solr with distribution on a couple of servers. I  
just learned that our network policies do not permit us to use SSH  
with passphraseless keys, and the snappuller script uses SSH to  
examine the master Solr instance's state before it pulls the newest  
index via rsync.


We plan to attempt to rewrite the snappuller (and possibly other  
distribution scripts, as required) to eliminate this dependency on  
SSH. I thought I ask the list in case anyone has experience with this  
same situation or any insights into the reasoning behind requiring  
SSH access to the master instance.


Thanks,
Justin Knoll


Re: Document field data not getting indexed

2007-11-29 Thread Yonik Seeley
On Nov 29, 2007 7:29 PM, Phillip Farber <[EMAIL PROTECTED]> wrote:
> One of my documents (id=44) contains the word "Campeau" in the "ocr"
> field.  But according to luke this term does not appear in the index.

AFAIK the Luke handler lists the top terms, not necessarily all of them.
Do a search for ocr:Campeau and see if it returns anything.

-Yonik


Re: Document field data not getting indexed

2007-11-29 Thread Chris Hostetter

see yonik's comments regarding Luke and wether or not your term is 
indexedx, as for this point

: Also I notice that the numTerms for 22 documents is 5579 and for just the doc
: id=44 it's 2194.  Hard to believe that 22 documents only increase the number
: of terms by so little.

this is not suprising.  numTerms is the number of *unique* terms, 
independent of how many documents each term appears in -- if the word 
"eclipse" appears in the ocr field of 17 documents a total of 457 times, 
it is still only counted once in numTerms.


-Hoss



Re: LowerCaseFilterFactory and spellchecker

2007-11-29 Thread Chris Hostetter

: think i'm just doing something wrong...
: 
: was experimenting with the spellcheck handler with the nightly
: checkout from 11-28; seems my spellchecking is case-sensitive, even
: tho i think i'm adding the LowerCaseFilterFactory to both the index
: and query analyzers.

I'm not very familiar with the SpellCheckerRequestHandler, but i don't 
think you are doing anything wrong.

a quick skim of the code indicates that the "q" param isn't being analyzed 
by that handler, so the raw input string is pased to the 
SpellChecker.suggestSimilar method. This may or may not have been 
intentional.

I personally can't think of 
any reason why it wouldn't make sense to get the query analyzer for the 
termSourceField and use it to analyze the q param before getting 
suggestions.



-Hoss



Re: SOLR/Lucene sorting - Question/ requesting suggestion

2007-11-29 Thread Ryan McKinley

Kasi Sankaralingam wrote:

When we have the following set of data, they are first sorted based on Capital 
letters and then lower case
. Is there a way to make them sort regardless of character case?

Avaneesh
Bruce
Veda
caroleY
jonathan
junit

So carole would come after Bruce. Thanks



sorting is based on the token, not the stored field.  Use a fieldType 
that includes the LowerCaseFilterFactory


Check the 'alphaOnlySort' fieldType in the example schema.xml -- that 
makes a token with lowercase and tosses any non letters (you can get rid 
of the PatternReplaceFilterFactory but it is a good example)


ryan


Re: Distribution without SSH?

2007-11-29 Thread Matt Kangas
Your company's network policies seem to be a good thing. I've worked  
at places with this same policy, for good reason. But it does tend to  
complicate operations sometimes. Some options you might pursue:


* Set up ssh-agent on the clients and use passphrase-protected keys.  
Downside to this, someone on your ops team will be inevitably awoken  
at 4am to type in the password.
* Try to get an exception to the policy by running Solr under a new  
user account inside a jail. Use a restricted login shell to make sure  
it can do only what you intend. So when the key is compromised,  
damage is contained.


Or, write a custom server/client running on a different port. In this  
case you lose over-the-wire encryption, and if your server is buggy,  
you get pwn3d anyway.


--Matt

On Nov 29, 2007, at 7:48 PM, Justin Knoll wrote:


Hello,
I recently set up Solr with distribution on a couple of servers. I  
just learned that our network policies do not permit us to use SSH  
with passphraseless keys, and the snappuller script uses SSH to  
examine the master Solr instance's state before it pulls the newest  
index via rsync.


We plan to attempt to rewrite the snappuller (and possibly other  
distribution scripts, as required) to eliminate this dependency on  
SSH. I thought I ask the list in case anyone has experience with  
this same situation or any insights into the reasoning behind  
requiring SSH access to the master instance.


Thanks,
Justin Knoll


--
Matt Kangas / [EMAIL PROTECTED]




SOLR/Lucene sorting - Question/ requesting suggestion

2007-11-29 Thread Kasi Sankaralingam
When we have the following set of data, they are first sorted based on Capital 
letters and then lower case
. Is there a way to make them sort regardless of character case?

Avaneesh
Bruce
Veda
caroleY
jonathan
junit

So carole would come after Bruce. Thanks



Re: LowerCaseFilterFactory and spellchecker

2007-11-29 Thread Mike Klaas

On 29-Nov-07, at 5:40 PM, Chris Hostetter wrote:



I'm not very familiar with the SpellCheckerRequestHandler, but i don't
think you are doing anything wrong.

a quick skim of the code indicates that the "q" param isn't being  
analyzed

by that handler, so the raw input string is pased to the
SpellChecker.suggestSimilar method. This may or may not have been
intentional.

I personally can't think of
any reason why it wouldn't make sense to get the query analyzer for  
the

termSourceField and use it to analyze the q param before getting
suggestions.


It does make some sense, but I'm not sure that it should be blindly  
analyzed without adding logic to handle certain cases (like the  
QueryParser does).  What happens if the analyzer produces two  
tokens?  The spellchecker has to deal with this appropriately.  Spell  
checkers should be able to "reverse analyze" the suggestions as well,  
so "Pyhton" gets corrected to "Python" and not "python".  Similarly,  
"ad-hco" should probably suggest "ad-hoc" and not "adhoc".


-Mike


Re: SOLR 1.2 - Updates sent containing fields that are not on the Schema fail silently

2007-11-29 Thread Chris Hostetter

: Exactly, you can try it really easily, just remove one of your fields on the
: example schema config and try to add content using the Java client API...
: Well I'm using SOLRJ and it returns no error code for me. But anyway don't
: you think the server should also have some logging informing that documents
: are being discarded?

As someone who is not very familiar with SolrJ, I can imagine that perhaps 
it has a bug where it might not return an error code in situations like 
this (it would suprise me, but i can imagine it) however I'm really 
confused by your comment that the server isn't logging that documents are 
being discarded.  If you try to index a document with a field SOlr doesn't 
recognize, it logs quite a big exception.  This is easily reproducable 
using post.jar and the example schema (unchanged).  Running this 
command...

java -Ddata=args -jar post.jar 'hoss'

...triggers this log messages in Solr...

Nov 29, 2007 6:09:28 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 'hoss'
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:245)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:66)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:196)
...


...which leads me to suspect there's something wonky with your setup.

exactly which version of Solr are you using, what does your SolrJ code 
look like, and what log messages do you see when a document is 
*successfully* indexed?  you should see somehting like...

INFO: {add=[SOLR1000]} 0 102

...where the uniqueKey of your doc is in the [].  If you don't see those 
messages, then you aren't looking in the right place for Solr's log 
messages.





-Hoss