RE: problems with arabic search

2007-10-10 Thread Heba Farouk
I'm developing a java application using solr, this application is
working with English search

Yes, I have tried querying solr directly for Arabic and it's working

Any suggestions ??

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 10, 2007 5:50 AM
To: solr-user@lucene.apache.org
Subject: Re: problems with arabic search


FYI: you don't need to resend your question just because you didn't get
a 
reply within a day, either people haven't had a chance to reply, or they

don't know the answer.

: XML Parsing Error: mismatched tag. Expected: .
: 
: Location:
http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD%D9%85
%D8%AF&cmdSearch=Search%21

this doesn't look like a query error .. and that doesn't look like a
solr 
URL, this looks something you have in front of Solr.

: HTTP Status 400 - Query parsing error: Cannot parse 
: '': '*' or '?' not allowed as first character in 

that looks like a Solr error.  i'm guessing that your app isn't dealing 
with the UTF8 correctly, something is substituting "?" characters in
place 
of any character it doesn't understand - and Solr thinks you are trying
to 
do a wildcard query.

have you tried querying solr directly (in your browser or using curl)
for 
your arabic word?


-Hoss



Problems with mySolr Wiki

2007-10-10 Thread Christian Klinger

Hi Solr-Users,

i try to follow the instructions [1] from the solr-wiki
to build my custom solr server.

First i have created the directory-structure.

mySolr
--solr
  --conf
--schema.xml
--solrconfig.xml
--solr.xml <-- Where i can find this file?
--build.xml <-- copy & paste from the wiki

Than i run the command

$ ant mysolr.dist
Buildfile: build.xml

BUILD FAILED
/root/buildouts/mySolr/build.xml:10: Cannot find 
${env.SOLR_HOME}/build.xml imported from /root/buildouts/mySolr/build.xml


Total time: 0 seconds


Maybe someone have an tipp for me?

Thanks for your help.
Christian


[1] http://wiki.apache.org/solr/mySolr



unlockOnStartup does not work in embedded solr?

2007-10-10 Thread Alexey Shakov
Hi *,

I use solr as embedded solution.
I have set unlockOnStartup to "true" in my solrconfig.xml
But it seems,  that this option is ignored by embedded solr.

Any ideas?

Thanks in advance,

Alexey




Manage multiple indexes with Solr

2007-10-10 Thread ycrux
Hi guys !

Is it possible to configure Solr to manage different indexes depending on the 
added documents ?

For example:
* document "1", with uniq ID "ui1" will be indexed in the "indexA"
* document "2", with uniq ID "ui2" will be indexed in the "indexB"
* document "3", with uniq ID "ui1" will be indexed in the "indexA"

Thus documents "1" and "3" are stored in index "indexA" and document "2" in 
index "indexB".
In this case "indexA" and "indexB" are completely separate indexes on disk.

Thanks in advance

cheers
Y.



Re: Manage multiple indexes with Solr

2007-10-10 Thread ycrux
Sorry, there's a mistake in my previous example.

Please read this:

* document "1", with uniq ID "ui1" will be indexed in the "indexA"
* document "2", with uniq ID "ui2" will be indexed in the "indexB"
* document "3", with uniq ID "ui3" will be indexed in the "indexA"

Thanks

cheers
Y.

Message d'origine
>De: [EMAIL PROTECTED]
>A: solr-user@lucene.apache.org
>Sujet: Manage multiple indexes with Solr
>Date: Wed, 10 Oct 2007 11:18:02 +0200
>
>Hi guys !
>
>Is it possible to configure Solr to manage different indexes depending on the 
>added documents ?
>
>For example:
>* document "1", with uniq ID "ui1" will be indexed in the "indexA"
>* document "2", with uniq ID "ui2" will be indexed in the "indexB"
>* document "3", with uniq ID "ui1" will be indexed in the "indexA"
>
>Thus documents "1" and "3" are stored in index "indexA" and document "2" in 
>index "indexB".
>In this case "indexA" and "indexB" are completely separate indexes on disk.
>
>Thanks in advance
>
>cheers
>Y.
>



Re: Manage multiple indexes with Solr

2007-10-10 Thread Venkatraman S
i would be interested to know in both the cases :

Case 1 :
* document "1", with uniq ID "ui1" will be indexed in the "indexA"
* document "2", with uniq ID "ui2" will be indexed in the "indexB"
* document "3", with uniq ID "ui3" will be indexed in the "indexA"

Case 2 :
* document "1", with uniq ID "ui1" will be indexed in the "indexA"
* document "2", with uniq ID "ui2" will be indexed in the "indexB"
* document "3", with uniq ID "ui1" will be indexed in the "indexA"

-vEnKAt


Different search results for (german) singular/plural searches - looking for a solution

2007-10-10 Thread Martin Grotzke
Hello,

with our application we have the issue, that we get different
results for singular and plural searches (german language).

E.g. for "hose" we get 1.000 documents back, but for "hosen"
we get 10.000 docs. The same applies to "t-shirt" or "t-shirts",
of e.g. "hut" and "hüte" - lots of cases :)

This is absolutely correct according to the schema.xml, as right
now we do not have any stemming or synonyms included.

Now we want to have similar search results for these singular/plural
searches. I'm thinking of a solution for this, and want to ask, what
are your experiences with this.

Basically I see two options: stemming and the usage of synonyms. Are
there others?

My concern with stemming is, that it might produce unexpected results,
so that docs are found that do not match the query from the users point
of view. I asume that this needs a lot of testing with different data.

The issue with synonyms is, that we would have to create a file
containing all synonyms, so we would have to figure out all cases, in
contrast to a solutions that is based on an algorithm.
The advantage of this approach is IMHO, that it is very predictable
which results will be returned for a certain query.

Some background information:
Our documents contain products (id, name, brand, category, producttype,
description, color etc). The singular/plural issue basically applied to
the fields name, category and producttype, so we would like to restrict
the solution to these fields.

Do you have suggestions how to handle this?

Thanx in advance for sharing your experiences,
cheers,
Martin

-
Extracts of our schema.xml:

  

  



  
  



  



  


  
  

  

  



  

  text

  
  
  
  
  
-




signature.asc
Description: This is a digitally signed message part


Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-10 Thread Thomas Traeger

in short: use stemming

Try the SnowballPorterFilterFactory with German2 as language attribute 
first and use synonyms for combined words i.e. "Herrenhose" => "Herren", 
"Hose".


By using stemming you will maybe have some "interesting" results, but it 
is much better living with them than having no or much less results ;o)


Find more infos on the Snowball stemming algorithms here:

http://snowball.tartarus.org/

Also have a look at the StopFilterFactory, here is a sample stopwordlist 
for the german language:


http://snowball.tartarus.org/algorithms/german/stop.txt

Good luck,

Tom


Martin Grotzke schrieb:

Hello,

with our application we have the issue, that we get different
results for singular and plural searches (german language).

E.g. for "hose" we get 1.000 documents back, but for "hosen"
we get 10.000 docs. The same applies to "t-shirt" or "t-shirts",
of e.g. "hut" and "hüte" - lots of cases :)

This is absolutely correct according to the schema.xml, as right
now we do not have any stemming or synonyms included.

Now we want to have similar search results for these singular/plural
searches. I'm thinking of a solution for this, and want to ask, what
are your experiences with this.

Basically I see two options: stemming and the usage of synonyms. Are
there others?

My concern with stemming is, that it might produce unexpected results,
so that docs are found that do not match the query from the users point
of view. I asume that this needs a lot of testing with different data.

The issue with synonyms is, that we would have to create a file
containing all synonyms, so we would have to figure out all cases, in
contrast to a solutions that is based on an algorithm.
The advantage of this approach is IMHO, that it is very predictable
which results will be returned for a certain query.

Some background information:
Our documents contain products (id, name, brand, category, producttype,
description, color etc). The singular/plural issue basically applied to
the fields name, category and producttype, so we would like to restrict
the solution to these fields.

Do you have suggestions how to handle this?

Thanx in advance for sharing your experiences,
cheers,
Martin


  


showing results per facet-value efficiently

2007-10-10 Thread Britske

First of all, I just wanted to say that I just started working with Solr and
really like the results I'm getting from Solr (in terms of performance,
flexibility) as well as the good responses I'm getting from this group.
Hopefully I will be able to contribute in way way or another to this
wonderful application in the future!

The current issue that I'm having is the following ( I tried not to be
long-winded, but somehow that didn't work out :-)   ):

I'm extending StandardRequestHandler to no only show the counts per
facet-value but also the top-N results per facet-value (where N is
configurable). 
(See http://www.nabble.com/Result-grouping-options-tf4522284.html#a12900630
for where I got the idea from). 
I quickly implemented this by fetching a doclist for each of my facet-values
and appending these to the result as suggested in the refered post, no
problems there. 

However, I realized that for calculating the count for each of the
facetvalues, the original standardrequesthandler already loops the doclist
to check for matches. Therefore my implementation actually does double work,
since it gets doclists for each of the facetvalues again. 

My question: 
is there a way to get to the already calculated doclist per facetvalue from
a subclassed StandardRequestHandler, and so get a nice speedup?  This
facet-falculation seems to go deep into the core of Solr
(SimpleFacets.getFacetTermEnumCounts) and seems not very sensible to alter
for just this requirement. opinions appreciated. 

Some additional info:

I have a  requirement to be able to limit the result to explicitly specified
facet-values. For that I do something like: 
select?
 qt=toplist
&q=name:A OR name:B OR  name:C 
&sort=sortfield asc 
&facet=true
&facet.field=name
&facet.limit=1
&rows=2

This all works okay and results in a faceting/grouping by field: 'name', 
where for each facetvalue (A, B, C)
2 results are shown (ordered by sortfield). 

The relevant code from the subclassed standardRequestHandler is below. As
can be seen I alter the query by adding the facetvalue to FQ (which is
almost guarenteed to already exist in FQ btw.) 

Therefore a second question is: 
will there be a noticable speedup when persuing the above, since the request
that is done per facet-value is nothing more than giving the ordered result
of the intersection of the overall query (which is in the querycache) and
the facetvalue itself (which is almost certainly in the filtercache). 

As a last and somewhat related question: 
is there a way to explicity specify facet-values that I want to include in
the faceting without (ab)using Q? This is  relevant for me since the perfect
solution would be to have the ability to orthogonally get multiple toplists
in 1 query. Given the current implementation, this orthoganality is now
'corrupted' as injection of a fieldvalue in Q for one facetfield influences
the outcome of another facetfield. 

kind regards, 
Geert-Jan



---
if(true) //TODO: this needs facetinfo as a precondition. 
{
NamedList facetFieldList = ((NamedList)facetInfo.get("facet_fields"));
   for(int i = 0; i < facetFieldList.size(); i++)
   {
NamedList facetValList = (NamedList)facetFieldList.getVal(i); 
for(int j = 0; j < facetValList.size(); j++)
 {
NamedList facetValue = new SimpleOrderedMap(); 
// facetValue.add("count", valList.getVal(j));

   DocListAndSet resultList = new DocListAndSet();
   Query facetq = QueryParsing.parseQuery(
facetFieldList.getName(i) + ":" + facetValList.getName(j),
req.getSchema());
   resultList.docList = s.getDocList(query,facetq,
sort,p.getInt(CommonParams.START,0), 
p.getInt(CommonParams.ROWS,3));

   facetValue.add("results",resultList.docList);
   facetValList.setVal(j, facetValue);
 }
   }
   rsp.add("facet_results", facetFieldList);
}
-- 
View this message in context: 
http://www.nabble.com/showing-results-per-facet-value-efficiently-tf4600154.html#a13133815
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Solr and KStem

2007-10-10 Thread Wagner,Harry
Hi Piete,
Good idea. Thanks.  One other change that should probably be made is to
change the package statement from org.oclc.solr.analysis to
org.apache.solr.analysis.  Thanks again.

Cheers!
harry

-Original Message-
From: Pieter Berkel [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, October 09, 2007 9:10 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr and KStem

Hi Harry,

I re-discovered this thread last week and have made some minor changes
to
the code (remove deprication warnings) so that it compiles with trunk.
I
think it would be quite useful to get this stemmer into Solr once all
the
legal / licensing issues are resolved.  If there are no objections, I'll
open a JIRA ticket and upload my changes so we can make sure we're all
working with the same code.

cheers,
Piete



On 11/09/2007, Wagner,Harry <[EMAIL PROTECTED]> wrote:
>
> Bill,
> Currently it is a plug-in.  Put the lower case filter ahead of kstem,
> just as for porter (example below).  You can use it with porter, but I
> can't imagine why you would want to.  At least not in the same
analyzer.
> Hope this helps.
>
> 
>   
> 
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
> 
>  cacheSize="2"/>
> 
>   
>   
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
>  words="stopwords.txt"/>
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
> 
>  cacheSize="2"/>
> 
>   
> 
>
> Cheers... harry
>
>


Re: problems with arabic search

2007-10-10 Thread Grant Ingersoll
Can you give more detail about what you have done?  What character  
encoding do you have your browser set to?  In Firefox, do View ->  
Character Encoding to see what it is set to when you are on the input  
page?  Internet Explorer and other browsers have other options.  Are  
you sending the query directly to Solr or is it going through some  
other servlet?  If you are doing this, and _IF_ I recall correctly, I  
believe you need to tell your servlet the input is UTF-8 before doing  
anything else with the request.


See http://kickjava.com/src/filters/ 
SetCharacterEncodingFilter.java.htm for a Servlet Filter that does  
this (it's even Apache licensed!)  You will need to hook it up in  
your web.xml.


On Oct 10, 2007, at 2:59 AM, Heba Farouk wrote:


I'm developing a java application using solr, this application is
working with English search

Yes, I have tried querying solr directly for Arabic and it's working

Any suggestions ??

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 10, 2007 5:50 AM
To: solr-user@lucene.apache.org
Subject: Re: problems with arabic search


FYI: you don't need to resend your question just because you didn't  
get

a
reply within a day, either people haven't had a chance to reply, or  
they


don't know the answer.

: XML Parsing Error: mismatched tag. Expected: .
:
: Location:
http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD% 
D9%85

%D8%AF&cmdSearch=Search%21

this doesn't look like a query error .. and that doesn't look like a
solr
URL, this looks something you have in front of Solr.

: HTTP Status 400 - Query parsing error: Cannot parse
: '': '*' or '?' not allowed as first character in

that looks like a Solr error.  i'm guessing that your app isn't  
dealing

with the UTF8 correctly, something is substituting "?" characters in
place
of any character it doesn't understand - and Solr thinks you are  
trying

to
do a wildcard query.

have you tried querying solr directly (in your browser or using curl)
for
your arabic word?


-Hoss



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http:// 
www.apachecon.com




RE: problems with arabic search

2007-10-10 Thread Heba Farouk
In firefox, character encoding is set to UTF-8
Yes, I'm sending the query directly to solr using apache httpclient and
I set the http request header content type to : Content-Type="text/html;
charset=UTF-8"

Any suggestions

Thanks in advance

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 10, 2007 2:43 PM
To: solr-user@lucene.apache.org
Subject: Re: problems with arabic search

Can you give more detail about what you have done?  What character  
encoding do you have your browser set to?  In Firefox, do View ->  
Character Encoding to see what it is set to when you are on the input  
page?  Internet Explorer and other browsers have other options.  Are  
you sending the query directly to Solr or is it going through some  
other servlet?  If you are doing this, and _IF_ I recall correctly, I  
believe you need to tell your servlet the input is UTF-8 before doing  
anything else with the request.

See http://kickjava.com/src/filters/ 
SetCharacterEncodingFilter.java.htm for a Servlet Filter that does  
this (it's even Apache licensed!)  You will need to hook it up in  
your web.xml.

On Oct 10, 2007, at 2:59 AM, Heba Farouk wrote:

> I'm developing a java application using solr, this application is
> working with English search
>
> Yes, I have tried querying solr directly for Arabic and it's working
>
> Any suggestions ??
>
> -Original Message-
> From: Chris Hostetter [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, October 10, 2007 5:50 AM
> To: solr-user@lucene.apache.org
> Subject: Re: problems with arabic search
>
>
> FYI: you don't need to resend your question just because you didn't  
> get
> a
> reply within a day, either people haven't had a chance to reply, or  
> they
>
> don't know the answer.
>
> : XML Parsing Error: mismatched tag. Expected: .
> :
> : Location:
> http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD% 
> D9%85
> %D8%AF&cmdSearch=Search%21
>
> this doesn't look like a query error .. and that doesn't look like a
> solr
> URL, this looks something you have in front of Solr.
>
> : HTTP Status 400 - Query parsing error: Cannot parse
> : '': '*' or '?' not allowed as first character in
>
> that looks like a Solr error.  i'm guessing that your app isn't  
> dealing
> with the UTF8 correctly, something is substituting "?" characters in
> place
> of any character it doesn't understand - and Solr thinks you are  
> trying
> to
> do a wildcard query.
>
> have you tried querying solr directly (in your browser or using curl)
> for
> your arabic word?
>
>
> -Hoss
>

--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http:// 
www.apachecon.com



Re: problems with arabic search

2007-10-10 Thread Grant Ingersoll
Hmmm, by the looks of your query, it doesn't seem like it is a Solr  
query, but I admit I don't have all the parameters memorized.  What  
request handler, etc. are you using?  Have you tried debugging?


And you say you have tried a query with the Solr Admin query page,  
right?  And that works?  So what is the difference between that page  
(form.jsp in the Solr source) and your page?


Please give more details about your application.

-Grant


On Oct 10, 2007, at 8:49 AM, Heba Farouk wrote:


In firefox, character encoding is set to UTF-8
Yes, I'm sending the query directly to solr using apache httpclient  
and
I set the http request header content type to : Content-Type="text/ 
html;

charset=UTF-8"

Any suggestions

Thanks in advance

-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 10, 2007 2:43 PM
To: solr-user@lucene.apache.org
Subject: Re: problems with arabic search

Can you give more detail about what you have done?  What character
encoding do you have your browser set to?  In Firefox, do View ->
Character Encoding to see what it is set to when you are on the input
page?  Internet Explorer and other browsers have other options.  Are
you sending the query directly to Solr or is it going through some
other servlet?  If you are doing this, and _IF_ I recall correctly, I
believe you need to tell your servlet the input is UTF-8 before doing
anything else with the request.

See http://kickjava.com/src/filters/
SetCharacterEncodingFilter.java.htm for a Servlet Filter that does
this (it's even Apache licensed!)  You will need to hook it up in
your web.xml.

On Oct 10, 2007, at 2:59 AM, Heba Farouk wrote:


I'm developing a java application using solr, this application is
working with English search

Yes, I have tried querying solr directly for Arabic and it's working

Any suggestions ??

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 10, 2007 5:50 AM
To: solr-user@lucene.apache.org
Subject: Re: problems with arabic search


FYI: you don't need to resend your question just because you didn't
get
a
reply within a day, either people haven't had a chance to reply, or
they

don't know the answer.

: XML Parsing Error: mismatched tag. Expected: .
:
: Location:
http://localhost:8080/solrServlet/searchServlet?query=%D9%85%D8%AD%
D9%85
%D8%AF&cmdSearch=Search%21

this doesn't look like a query error .. and that doesn't look like a
solr
URL, this looks something you have in front of Solr.

: HTTP Status 400 - Query parsing error: Cannot  
parse

: '': '*' or '?' not allowed as first character in

that looks like a Solr error.  i'm guessing that your app isn't
dealing
with the UTF8 correctly, something is substituting "?" characters in
place
of any character it doesn't understand - and Solr thinks you are
trying
to
do a wildcard query.

have you tried querying solr directly (in your browser or using curl)
for
your arabic word?


-Hoss



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Boot Camp Training:
ApacheCon Atlanta, Nov. 12, 2007.  Sign up now!  http://
www.apachecon.com



--
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/




Re: Availability Issues

2007-10-10 Thread Otis Gospodnetic
Hi,

- Original Message 
From: David Whalen <[EMAIL PROTECTED]>

On that note -- I've read that Jetty isn't the best servlet
container to use in these situations, is that your experience?

OG: In which situations?  Jetty is great, actually! (the pretty high traffic 
site in my sig runs Jetty)

Otis 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share



> -Original Message-
> From: Chris Hostetter [mailto:[EMAIL PROTECTED] 
> Sent: Monday, October 08, 2007 11:20 PM
> To: solr-user
> Subject: RE: Availability Issues
> 
> 
> : My logs don't look anything like that.  They look like HTTP
> : requests.  Am I looking in the wrong place?
> 
> what servlet container are you using?  
> 
> every servlet container handles applications logs differently 
> -- it's especially tricky becuse even the format can be 
> changed, the examples i gave before are in the default format 
> you get if you use the jetty setup in the solr example (which 
> logs to stdout), but many servlet containers won't include 
> that much detail by default (they typically leave out the 
> classname and method name).  there's also typically a setting 
> that controls the verbosity -- so in some configurations only 
> the SEVERE messages are logged and in others the INFO 
> messages are logged ... you're going to want at least the 
> INFO level to debug stuff.
> 
> grep all the log files you can find for "Solr home set to" 
> ... that's one of the first messages Solr logs.  if you can 
> find that, you'll find the other messages i was talking about.
> 
> 
> -Hoss
> 
> 
> 





getting number of stored documents via rest api

2007-10-10 Thread Stefan Rinner

Hi

for some tests I need to know how many documents are stored in the  
index - is there a fast & easy way to retrieve this number (instead  
of searching for "*:*" and counting the results)?
I already took a look at the stats.jsp code - but there the number of  
documents is retrieved via an api call to SolrInfoRegistry and not  
the webservice.


thanks

- stefan


Re: getting number of stored documents via rest api

2007-10-10 Thread climbingrose
I think search for "*:*" is the optimal code to do it. I don't think you can
do anything faster.

On 10/11/07, Stefan Rinner <[EMAIL PROTECTED]> wrote:
>
> Hi
>
> for some tests I need to know how many documents are stored in the
> index - is there a fast & easy way to retrieve this number (instead
> of searching for "*:*" and counting the results)?
> I already took a look at the stats.jsp code - but there the number of
> documents is retrieved via an api call to SolrInfoRegistry and not
> the webservice.
>
> thanks
>
> - stefan
>



-- 
Regards,

Cuong Hoang


Re: Problems with mySolr Wiki

2007-10-10 Thread Chris Hostetter

i'm not very familiar with that wiki, but note the line in the example ant 
script...

  

...

: --solr.xml <-- Where i can find this file?

according to the wiki page...

> First we will setup a basic directory structure (assuming we only want to 
> change some fields) and copy the attached build.xml and solr.xml:

...i addume it is refering to what you named build.xml ... if you name it 
solr.xml i think you would need to run "ant -f solr.xml mysolr.dist" ... 
but like i said, i'm not too familiar with that wiki, so i could be wrong.


-Hoss



Re: getting number of stored documents via rest api

2007-10-10 Thread Chris Hostetter

: there a fast & easy way to retrieve this number (instead of searching for
: "*:*" and counting the results)?

NOTE: you don't have to count the results to know the total number of 
docs matching any query ... just use the numFound attribute of the 
 block.

: I already took a look at the stats.jsp code - but there the number of
: documents is retrieved via an api call to SolrInfoRegistry and not the
: webservice.

stats.jsp returns welformed xml (not HTML) so why not just hit that to 
extract the numDocs ?



-Hoss



Re: getting number of stored documents via rest api

2007-10-10 Thread Chris Hostetter

: I think search for "*:*" is the optimal code to do it. I don't think you can
: do anything faster.

FYI: getting the data from the xml returned by stats.jsp is definitely 
faster in the case where you really want all docs.

if you want the total number from some other query however, don't "count" 
them yourself in the client ... use 


-Hoss



WebException (ServerProtocolViolation) with SolrSharp

2007-10-10 Thread Filipe Correia
Hello,

I am trying to run SolrSharp's example application but am getting a
WebException with a ServerProtocolViolation status message.

After some debugging I found out this is happening with a call to:
http://localhost:8080/solr/update/

And using fiddler[1] found out that solr is actually throwing the
following exception:
org.apache.solr.core.SolrException: Error while creating field
'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}'
from value '1,234'
at org.apache.solr.schema.FieldType.createField(FieldType.java:173)
at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94)
at 
org.apache.solr.update.DocumentBuilder.addSingleField(DocumentBuilder.java:57)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:73)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:83)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:77)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:339)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NumberFormatException: For input string: "1,234"
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Float.parseFloat(Unknown Source)
at 
org.apache.solr.util.NumberUtils.float2sortableStr(NumberUtils.java:80)
at 
org.apache.solr.schema.SortableFloatField.toInternal(SortableFloatField.java:50)
at org.apache.solr.schema.FieldType.createField(FieldType.java:171)
... 24 more
type Status report
message Error while creating field
'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}'
from value '1,234'

I am just starting to try Solr, and might be missing some
configurations, but I have no clue where to begin to investigate this
further without digging into Solr's source, which I would really like
to avoid for now. Any thoughts?

thank you in advance,
Filipe Correia

[1] http://www.fiddlertool.com/


WebException (ServerProtocolViolation) with SolrSharp

2007-10-10 Thread Filipe Correia
Hello,

I am trying to run SolrSharp's example application but am getting a
WebException with a ServerProtocolViolation status message.

After some debugging I found out this is happening with a call to:
http://localhost:8080/solr/update/

And using fiddler[1] found out that solr is actually throwing the
following exception:
org.apache.solr.core.SolrException: Error while creating field
'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}'
from value '1,234'
at org.apache.solr.schema.FieldType.createField(FieldType.java:173)
at org.apache.solr.schema.SchemaField.createField(SchemaField.java:94)
at 
org.apache.solr.update.DocumentBuilder.addSingleField(DocumentBuilder.java:57)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:73)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:83)
at 
org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:77)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:339)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
at 
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NumberFormatException: For input string: "1,234"
at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
at java.lang.Float.parseFloat(Unknown Source)
at 
org.apache.solr.util.NumberUtils.float2sortableStr(NumberUtils.java:80)
at 
org.apache.solr.schema.SortableFloatField.toInternal(SortableFloatField.java:50)
at org.apache.solr.schema.FieldType.createField(FieldType.java:171)
... 24 more
type Status report
message Error while creating field
'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}'
from value '1,234'

I am just starting to try Solr, and might be missing some
configurations, but I have no clue where to begin to investigate this
further without digging into Solr's source, which I would really like
to avoid for now. Any thoughts?

thank you in advance,
Filipe Correia

[1] http://www.fiddlertool.com/


RE: Facets and running out of Heap Space

2007-10-10 Thread David Whalen
It looks now like I can't use facets the way I was hoping
to because the memory requirements are impractical.

So, as an alternative I was thinking I could get counts
by doing rows=0 and using filter queries.  

Is there a reason to think that this might perform better?
Or, am I simply moving the problem to another step in the
process?

DW

  

> -Original Message-
> From: Stu Hood [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, October 09, 2007 10:53 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Facets and running out of Heap Space
> 
> > Using the filter cache method on the things like media type and 
> > location; this will occupy ~2.3MB of memory _per unique value_
> 
> Mike, how did you calculate that value? I'm trying to tune my 
> caches, and any equations that could be used to determine 
> some balanced settings would be extremely helpful. I'm in a 
> memory limited environment, so I can't afford to throw a ton 
> of cache at the problem.
> 
> (I don't want to thread-jack, but I'm also wondering whether 
> anyone has any notes on how to tune cache sizes for the 
> filterCache, queryResultCache and documentCache).
> 
> Thanks,
> Stu
> 
> 
> -Original Message-
> From: Mike Klaas <[EMAIL PROTECTED]>
> Sent: Tuesday, October 9, 2007 9:30pm
> To: solr-user@lucene.apache.org
> Subject: Re: Facets and running out of Heap Space
> 
> On 9-Oct-07, at 12:36 PM, David Whalen wrote:
> 
> >(snip)
> > I'm sure we could stop storing many of these columns, 
> especially  if 
> >someone told me that would make a big difference.
> 
> I don't think that it would make a difference in memory 
> consumption, but storage is certainly not necessary for 
> faceting.  Extra stored fields can slow down search if they 
> are large (in terms of bytes), but don't really occupy extra 
> memory, unless they are polluting the doc cache.  Does 'text' 
> need to be stored?
> >
> >> what does the LukeReqeust Handler tell you about the # of distinct 
> >> terms in each field that you facet on?
> >
> > Where would I find that?  I could probably estimate that 
> myself on a 
> > per-column basis.  it ranges from 4 distinct values for 
> media_type to 
> > 30-ish for location to 200-ish for country_code to almost 
> 10,000 for 
> > site_id to almost 100,000 for journalist_id.
> 
> Using the filter cache method on the things like media type 
> and location; this will occupy ~2.3MB of memory _per unique 
> value_, so it should be a net win for those (although quite 
> close in space requirements for a 30-ary field on your index size).
> 
> -Mike
> 
> 


start tag not allowed in epilog

2007-10-10 Thread BrendanD

Hi,

I've got an xml update document that I'm sending to solr's update handler
with deletes and adds in it. For example:

12345678


And I'm getting the following exception in the catalina.out log:

Oct 10, 2007 12:58:22 PM org.apache.solr.common.SolrException log
SEVERE: javax.xml.stream.XMLStreamException: ParseError at
[row,col]:[1,4003]
Message: start tag not allowed in epilog but got a
at com.bea.xml.stream.MXParser.parseEpilog(MXParser.java:2112)
at com.bea.xml.stream.MXParser.nextImpl(MXParser.java:1945)
at com.bea.xml.stream.MXParser.next(MXParser.java:1333)
at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:148)
at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:78)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:807)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:206)


Is this not allowed? I looked at the code in XmlUpdateRequestHandler.java
and it's dying on this line (148):

  int event = parser.next();

Does anyone know how to correct this? Is it not possible to have multiple
different top-level tags in the same update xml file? It seems to me like it
should work, but perhaps there's something inherently bad about this from
the XMLStreamReader's point of view.

Thanks,

Brendan
-- 
View this message in context: 
http://www.nabble.com/start-tag-not-allowed-in-epilog-tf4602869.html#a13142631
Sent from the Solr - User mailing list archive at Nabble.com.



Re: start tag not allowed in epilog

2007-10-10 Thread Chris Hostetter

: Does anyone know how to correct this? Is it not possible to have multiple
: different top-level tags in the same update xml file? It seems to me like it
: should work, but perhaps there's something inherently bad about this from
: the XMLStreamReader's point of view.

it's inherently bad from the XML spec's point of view -- as in the XML 
spec says you can only have one "top level" tag per "XML document".

incidently: what is your use case for even trying this?  you kow that with 
a uniqueKey declaration you don't need to delete a doc before adding a new 
one withthe same uniqueKey right? ... you cna just add them all.

we may/should allow you to specify multiple  or  blocks inside 
of a delete, but i don't imagine anyone is plannining on adding syntax to 
support  and  ops in the same update command ... they are 
radically different. 


-Hoss



Re: start tag not allowed in epilog

2007-10-10 Thread BrendanD

We simply process a queue of updates from a database table. Some of the
updates are deletes, some are adds. Sometimes you can have many deletes in a
row, sometimes many adds in a row, and sometimes a mixture of deletes and
adds. We're trying to batch our updates and were hoping to avoid having to
manage separate files for adds and deletes.

Perhaps a single top-level tag e.g.  could contain
deletes and adds in the same document?

Thanks,

Brendan


hossman wrote:
> 
> 
> it's inherently bad from the XML spec's point of view -- as in the XML 
> spec says you can only have one "top level" tag per "XML document".
> 
> incidently: what is your use case for even trying this?  you kow that with 
> a uniqueKey declaration you don't need to delete a doc before adding a new 
> one withthe same uniqueKey right? ... you cna just add them all.
> 
> we may/should allow you to specify multiple  or  blocks inside 
> of a delete, but i don't imagine anyone is plannining on adding syntax to 
> support  and  ops in the same update command ... they are 
> radically different. 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/start-tag-not-allowed-in-epilog-tf4602869.html#a13143348
Sent from the Solr - User mailing list archive at Nabble.com.



Re: WebException (ServerProtocolViolation) with SolrSharp

2007-10-10 Thread Jeff Rodenburg
Hi Felipe -

The issue you're encountering is a problem with the data format being passed
to the solr server.  If you follow the stack trace that you posted, you'll
notice that the solr field is looking for a value that's a float, but the
passed value is "1,234".

I'm guessing this is caused by one of two possibilities:

(1) there's a typo in your example code, where "1,234" should actually be "
1.234", or
(2) there's a culture settings difference on your server that's converting "
1.234" to "1,234"

Assuming it's the latter, add this line in the ExampleIndexDocument
constructor:

CultureInfo MyCulture = new CultureInfo("en-US");

Please let me know if this fixes the issue, I've been looking at this
previously and would like to confirm it.

thanks,
jeff r.


On 10/10/07, Filipe Correia <[EMAIL PROTECTED]> wrote:
>
> Hello,
>
> I am trying to run SolrSharp's example application but am getting a
> WebException with a ServerProtocolViolation status message.
>
> After some debugging I found out this is happening with a call to:
> http://localhost:8080/solr/update/
>
> And using fiddler[1] found out that solr is actually throwing the
> following exception:
> org.apache.solr.core.SolrException: Error while creating field
> 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}'
> from value '1,234'
> at org.apache.solr.schema.FieldType.createField(FieldType.java
> :173)
> at org.apache.solr.schema.SchemaField.createField(SchemaField.java
> :94)
> at org.apache.solr.update.DocumentBuilder.addSingleField(
> DocumentBuilder.java:57)
> at org.apache.solr.update.DocumentBuilder.addField(
> DocumentBuilder.java:73)
> at org.apache.solr.update.DocumentBuilder.addField(
> DocumentBuilder.java:83)
> at org.apache.solr.update.DocumentBuilder.addField(
> DocumentBuilder.java:77)
> at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(
> XmlUpdateRequestHandler.java:339)
> at org.apache.solr.handler.XmlUpdateRequestHandler.update(
> XmlUpdateRequestHandler.java:162)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(
> XmlUpdateRequestHandler.java:84)
> at org.apache.solr.handler.RequestHandlerBase.handleRequest(
> RequestHandlerBase.java:77)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
> at org.apache.solr.servlet.SolrDispatchFilter.execute(
> SolrDispatchFilter.java:191)
> at org.apache.solr.servlet.SolrDispatchFilter.doFilter(
> SolrDispatchFilter.java:159)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(
> ApplicationFilterChain.java:235)
> at org.apache.catalina.core.ApplicationFilterChain.doFilter(
> ApplicationFilterChain.java:206)
> at org.apache.catalina.core.StandardWrapperValve.invoke(
> StandardWrapperValve.java:233)
> at org.apache.catalina.core.StandardContextValve.invoke(
> StandardContextValve.java:175)
> at org.apache.catalina.core.StandardHostValve.invoke(
> StandardHostValve.java:128)
> at org.apache.catalina.valves.ErrorReportValve.invoke(
> ErrorReportValve.java:102)
> at org.apache.catalina.core.StandardEngineValve.invoke(
> StandardEngineValve.java:109)
> at org.apache.catalina.connector.CoyoteAdapter.service(
> CoyoteAdapter.java:263)
> at org.apache.coyote.http11.Http11Processor.process(
> Http11Processor.java:844)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
> Http11Protocol.java:584)
> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(
> JIoEndpoint.java:447)
> at java.lang.Thread.run(Unknown Source)
> Caused by: java.lang.NumberFormatException: For input string:
> "1,234"
> at sun.misc.FloatingDecimal.readJavaFormatString(Unknown Source)
> at java.lang.Float.parseFloat(Unknown Source)
> at org.apache.solr.util.NumberUtils.float2sortableStr(
> NumberUtils.java:80)
> at org.apache.solr.schema.SortableFloatField.toInternal(
> SortableFloatField.java:50)
> at org.apache.solr.schema.FieldType.createField(FieldType.java
> :171)
> ... 24 more
> type Status report
> message Error while creating field
> 'weight{type=sfloat,properties=indexed,stored,omitNorms,sortMissingLast}'
> from value '1,234'
>
> I am just starting to try Solr, and might be missing some
> configurations, but I have no clue where to begin to investigate this
> further without digging into Solr's source, which I would really like
> to avoid for now. Any thoughts?
>
> thank you in advance,
> Filipe Correia
>
> [1] http://www.fiddlertool.com/
>


quick allowDups questions

2007-10-10 Thread Charlie Jackson
Normally this is the type of thing I'd just scour through the online
docs or the source code for, but I'm under the gun a bit. 

 

Anyway, I need to update some docs in my index because my client program
wasn't accurately putting these docs in (values for one of the fields
was missing). I'm hoping I won't have to write additional code to go
through and delete each existing doc before I add the new one, and I
think setting allowDups on the add command to false will allow me to do
this. I seem to recall something in the update handler code that goes
through and deletes all but the last copy of the doc if allowDups is
false - does that sound accurate?

 

If so, I just need to make sure that solrj properly sets that flag,
which leads me to my next question. Does solrj default allowDups to
false? If not, what do I need to do to make sure allowDups is set to
false when I'm adding these docs? 



Re: Facets and running out of Heap Space

2007-10-10 Thread Mike Klaas

On 10-Oct-07, at 12:19 PM, David Whalen wrote:


It looks now like I can't use facets the way I was hoping
to because the memory requirements are impractical.


I can't remember if this has been mentioned, but upping the  
HashDocSet size is one way to reduce memory consumption.  Whether  
this will work well depends greatly on the cardinality of your facet  
sets.  facet.enum.cache.minDf set high is another option (will not  
generate a bitset for any value whose facet set is less that this  
value).


Both options have performance implications.


So, as an alternative I was thinking I could get counts
by doing rows=0 and using filter queries.

Is there a reason to think that this might perform better?
Or, am I simply moving the problem to another step in the
process?


Running one query per unique facet value seems impractical, if that  
is what you are suggesting.  Setting minDf to a very high value  
should always outperform such an approach.


-Mike


DW




-Original Message-
From: Stu Hood [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 09, 2007 10:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space


Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_


Mike, how did you calculate that value? I'm trying to tune my
caches, and any equations that could be used to determine
some balanced settings would be extremely helpful. I'm in a
memory limited environment, so I can't afford to throw a ton
of cache at the problem.

(I don't want to thread-jack, but I'm also wondering whether
anyone has any notes on how to tune cache sizes for the
filterCache, queryResultCache and documentCache).

Thanks,
Stu


-Original Message-
From: Mike Klaas <[EMAIL PROTECTED]>
Sent: Tuesday, October 9, 2007 9:30pm
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space

On 9-Oct-07, at 12:36 PM, David Whalen wrote:


(snip)
I'm sure we could stop storing many of these columns,

especially  if

someone told me that would make a big difference.


I don't think that it would make a difference in memory
consumption, but storage is certainly not necessary for
faceting.  Extra stored fields can slow down search if they
are large (in terms of bytes), but don't really occupy extra
memory, unless they are polluting the doc cache.  Does 'text'
need to be stored?



what does the LukeReqeust Handler tell you about the # of distinct
terms in each field that you facet on?


Where would I find that?  I could probably estimate that

myself on a

per-column basis.  it ranges from 4 distinct values for

media_type to

30-ish for location to 200-ish for country_code to almost

10,000 for

site_id to almost 100,000 for journalist_id.


Using the filter cache method on the things like media type
and location; this will occupy ~2.3MB of memory _per unique
value_, so it should be a net win for those (although quite
close in space requirements for a 30-ary field on your index size).

-Mike






Re: quick allowDups questions

2007-10-10 Thread Mike Klaas

On 10-Oct-07, at 1:11 PM, Charlie Jackson wrote:

Anyway, I need to update some docs in my index because my client  
program

wasn't accurately putting these docs in (values for one of the fields
was missing). I'm hoping I won't have to write additional code to go
through and delete each existing doc before I add the new one, and I
think setting allowDups on the add command to false will allow me  
to do

this. I seem to recall something in the update handler code that goes
through and deletes all but the last copy of the doc if allowDups is
false - does that sound accurate?


Yes.  But you need to define a uniqueKey in schema and make sure it  
is the same for docs you want overwritten.  This is how solr detects  
"dups".




If so, I just need to make sure that solrj properly sets that flag,
which leads me to my next question. Does solrj default allowDups to
false? If not, what do I need to do to make sure allowDups is set to
false when I'm adding these docs?


It is the normal mode of operation for Solr, so I'd be surprised if  
it wasn't the default in solrj (but I don't actually know).


-Mike


Re: start tag not allowed in epilog

2007-10-10 Thread Mike Klaas

On 10-Oct-07, at 12:49 PM, BrendanD wrote:



We simply process a queue of updates from a database table. Some of  
the
updates are deletes, some are adds. Sometimes you can have many  
deletes in a
row, sometimes many adds in a row, and sometimes a mixture of  
deletes and
adds. We're trying to batch our updates and were hoping to avoid  
having to

manage separate files for adds and deletes.

Perhaps a single top-level tag e.g.  could  
contain

deletes and adds in the same document?


This would be very complicated from a standpoint of returning errors  
to the client.


Keep in mind the  can never be batched, regardless.  The  
only command that supports batching is  (and it is 1   
with multiple s, not multiple s).


If you keep a persistent connection open to solr, I don't see why one  
command per request should be limiting.


Note also that you can batch on your end.  If the deletes are doc  
ids, then you can collect a bunch at once and do
id:xxx id:yyy id:zzz id:aaa id:bbb to  
perform them all at once.


-Mike


RE: quick allowDups questions

2007-10-10 Thread Charlie Jackson
Thanks for the response, Mike. A quick test using the example app
confirms your statement. 

As for Solrj, you're probably right, but I'm not going to take any
chances for the time being. The server.add method has an optional
Boolean flag named "overwrite" that defaults to true. Without knowing
for sure what it does, I'm not going to mess with it. 

For the purposes of my problem, I've got an upper and lower bound of
affected docs, so I'm just going to delete them all and then initiate a
re-index of those specific ids from my source. 

Thanks again for the help!


-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, October 10, 2007 3:58 PM
To: solr-user@lucene.apache.org
Subject: Re: quick allowDups questions

On 10-Oct-07, at 1:11 PM, Charlie Jackson wrote:

> Anyway, I need to update some docs in my index because my client  
> program
> wasn't accurately putting these docs in (values for one of the fields
> was missing). I'm hoping I won't have to write additional code to go
> through and delete each existing doc before I add the new one, and I
> think setting allowDups on the add command to false will allow me  
> to do
> this. I seem to recall something in the update handler code that goes
> through and deletes all but the last copy of the doc if allowDups is
> false - does that sound accurate?

Yes.  But you need to define a uniqueKey in schema and make sure it  
is the same for docs you want overwritten.  This is how solr detects  
"dups".

>
> If so, I just need to make sure that solrj properly sets that flag,
> which leads me to my next question. Does solrj default allowDups to
> false? If not, what do I need to do to make sure allowDups is set to
> false when I'm adding these docs?

It is the normal mode of operation for Solr, so I'd be surprised if  
it wasn't the default in solrj (but I don't actually know).

-Mike


RE: Facets and running out of Heap Space

2007-10-10 Thread David Whalen
Accoriding to Yonik I can't use minDf because I'm faceting
on a string field.  I'm thinking of changing it to a tokenized
type so that I can utilize this setting, but then I'll have to
rebuild my entire index.

Unless there's some way around that?


  

> -Original Message-
> From: Mike Klaas [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, October 10, 2007 4:56 PM
> To: solr-user@lucene.apache.org
> Cc: stuhood
> Subject: Re: Facets and running out of Heap Space
> 
> On 10-Oct-07, at 12:19 PM, David Whalen wrote:
> 
> > It looks now like I can't use facets the way I was hoping 
> to because 
> > the memory requirements are impractical.
> 
> I can't remember if this has been mentioned, but upping the 
> HashDocSet size is one way to reduce memory consumption.  
> Whether this will work well depends greatly on the 
> cardinality of your facet sets.  facet.enum.cache.minDf set 
> high is another option (will not generate a bitset for any 
> value whose facet set is less that this value).
> 
> Both options have performance implications.
> 
> > So, as an alternative I was thinking I could get counts by doing 
> > rows=0 and using filter queries.
> >
> > Is there a reason to think that this might perform better?
> > Or, am I simply moving the problem to another step in the process?
> 
> Running one query per unique facet value seems impractical, 
> if that is what you are suggesting.  Setting minDf to a very 
> high value should always outperform such an approach.
> 
> -Mike
> 
> > DW
> >
> >
> >
> >> -Original Message-
> >> From: Stu Hood [mailto:[EMAIL PROTECTED]
> >> Sent: Tuesday, October 09, 2007 10:53 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Facets and running out of Heap Space
> >>
> >>> Using the filter cache method on the things like media type and 
> >>> location; this will occupy ~2.3MB of memory _per unique value_
> >>
> >> Mike, how did you calculate that value? I'm trying to tune 
> my caches, 
> >> and any equations that could be used to determine some balanced 
> >> settings would be extremely helpful. I'm in a memory limited 
> >> environment, so I can't afford to throw a ton of cache at the 
> >> problem.
> >>
> >> (I don't want to thread-jack, but I'm also wondering 
> whether anyone 
> >> has any notes on how to tune cache sizes for the filterCache, 
> >> queryResultCache and documentCache).
> >>
> >> Thanks,
> >> Stu
> >>
> >>
> >> -Original Message-
> >> From: Mike Klaas <[EMAIL PROTECTED]>
> >> Sent: Tuesday, October 9, 2007 9:30pm
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Facets and running out of Heap Space
> >>
> >> On 9-Oct-07, at 12:36 PM, David Whalen wrote:
> >>
> >>> (snip)
> >>> I'm sure we could stop storing many of these columns,
> >> especially  if
> >>> someone told me that would make a big difference.
> >>
> >> I don't think that it would make a difference in memory 
> consumption, 
> >> but storage is certainly not necessary for faceting.  Extra stored 
> >> fields can slow down search if they are large (in terms of bytes), 
> >> but don't really occupy extra memory, unless they are 
> polluting the 
> >> doc cache.  Does 'text'
> >> need to be stored?
> >>>
>  what does the LukeReqeust Handler tell you about the # 
> of distinct 
>  terms in each field that you facet on?
> >>>
> >>> Where would I find that?  I could probably estimate that
> >> myself on a
> >>> per-column basis.  it ranges from 4 distinct values for
> >> media_type to
> >>> 30-ish for location to 200-ish for country_code to almost
> >> 10,000 for
> >>> site_id to almost 100,000 for journalist_id.
> >>
> >> Using the filter cache method on the things like media type and 
> >> location; this will occupy ~2.3MB of memory _per unique 
> value_, so it 
> >> should be a net win for those (although quite close in space 
> >> requirements for a 30-ary field on your index size).
> >>
> >> -Mike
> >>
> >>
> 
> 
> 


Re: Different search results for (german) singular/plural searches - looking for a solution

2007-10-10 Thread Daniel Naber
On Wednesday 10 October 2007 12:00, Martin Grotzke wrote:

> Basically I see two options: stemming and the usage of synonyms. Are
> there others?

A large list of German words and their forms is available from a Windows 
software called Morphy 
(http://www.wolfganglezius.de/doku.php?id=public:cl:morphy). You can use 
it for mapping fullforms to base forms (Häuser -> Haus). You can also have 
a look at www.languagetool.org which uses this data in a Java software. 
LanguageTool also comes with jWordSplitter, which can find a compound's 
parts (Autowäsche -> Auto + Wäsche).

Regards
 Daniel

-- 
http://www.danielnaber.de


Re: start tag not allowed in epilog

2007-10-10 Thread BrendanD

I've re-written the code to generate separate files. One for adds and one for
deletes. And this is working well for us now. Thanks.



Mike Klaas wrote:
> 
> 
> This would be very complicated from a standpoint of returning errors  
> to the client.
> 
> Keep in mind the  can never be batched, regardless.  The  
> only command that supports batching is  (and it is 1   
> with multiple s, not multiple s).
> 
> If you keep a persistent connection open to solr, I don't see why one  
> command per request should be limiting.
> 
> Note also that you can batch on your end.  If the deletes are doc  
> ids, then you can collect a bunch at once and do
> id:xxx id:yyy id:zzz id:aaa id:bbb to  
> perform them all at once.
> 
> -Mike
> 
> 

-- 
View this message in context: 
http://www.nabble.com/start-tag-not-allowed-in-epilog-tf4602869.html#a13145693
Sent from the Solr - User mailing list archive at Nabble.com.



Internal Server Error and waitSearcher="false" for commit/optimize

2007-10-10 Thread Jason Rennie
Hello,

We're using solr 1.2 and a nightly build of the solrj client code.  We very
occasionally see things like this:

org.apache.solr.client.solrj.SolrServerException: Error executing query
at org.apache.solr.client.solrj.request.QueryRequest.process(
QueryRequest.java:86)
at org.apache.solr.client.solrj.impl.BaseSolrServer.query(
BaseSolrServer.java:99)
...
Caused by: org.apache.solr.common.SolrException: Internal Server Error

I'm guessing that this might be due to solr being in the middle of a commit
or optimize.  Could solr throw an exception like that in this case?

We also occasionally see solr taking too long to respond.  We currently make
our commit/optimize calls without any arguments.  I'm wondering whether
setting waitSearcher="false" might allow search queries to be served while a
commit/optimize is being run.  I found this in an old message from this
list:

> Yes, it looks like there is no difference... the code to make commit
> totally asynchronous was never put in (so you can't really get commit
> to return instantly, it will always wait until the IndexWriter is closed).


This isn't a problem for us as the thread making the commit/optimize call is
separate from thread(s) making queries.  Is waitSearcher="false" designed to
allow queries to be processed while a commit/optimize is being run?  Are
there any negative side effects to this setting (other than a query being
slightly out-of-date :)?

Thanks,

Jason


Re: Facets and running out of Heap Space

2007-10-10 Thread Mike Klaas

On 10-Oct-07, at 2:40 PM, David Whalen wrote:


Accoriding to Yonik I can't use minDf because I'm faceting
on a string field.  I'm thinking of changing it to a tokenized
type so that I can utilize this setting, but then I'll have to
rebuild my entire index.

Unless there's some way around that?


For the fields that matter (many unique values), this is likely  
result in a performance regression.


It might be better to try storing less unique data.  For instance,  
faceting on the blog_url field, or create_date in your schema would  
case problems (they probably have millions of unique values).


It would be helpful to know which field is causing the problem.  One  
way would be to do a sorted query on a quiescent index for each  
field, and see if there are any suspiciously large jumps in memory  
usage.


-Mike






-Original Message-
From: Mike Klaas [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 10, 2007 4:56 PM
To: solr-user@lucene.apache.org
Cc: stuhood
Subject: Re: Facets and running out of Heap Space

On 10-Oct-07, at 12:19 PM, David Whalen wrote:


It looks now like I can't use facets the way I was hoping

to because

the memory requirements are impractical.


I can't remember if this has been mentioned, but upping the
HashDocSet size is one way to reduce memory consumption.
Whether this will work well depends greatly on the
cardinality of your facet sets.  facet.enum.cache.minDf set
high is another option (will not generate a bitset for any
value whose facet set is less that this value).

Both options have performance implications.


So, as an alternative I was thinking I could get counts by doing
rows=0 and using filter queries.

Is there a reason to think that this might perform better?
Or, am I simply moving the problem to another step in the process?


Running one query per unique facet value seems impractical,
if that is what you are suggesting.  Setting minDf to a very
high value should always outperform such an approach.

-Mike


DW




-Original Message-
From: Stu Hood [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 09, 2007 10:53 PM
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space


Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique value_


Mike, how did you calculate that value? I'm trying to tune

my caches,

and any equations that could be used to determine some balanced
settings would be extremely helpful. I'm in a memory limited
environment, so I can't afford to throw a ton of cache at the
problem.

(I don't want to thread-jack, but I'm also wondering

whether anyone

has any notes on how to tune cache sizes for the filterCache,
queryResultCache and documentCache).

Thanks,
Stu


-Original Message-
From: Mike Klaas <[EMAIL PROTECTED]>
Sent: Tuesday, October 9, 2007 9:30pm
To: solr-user@lucene.apache.org
Subject: Re: Facets and running out of Heap Space

On 9-Oct-07, at 12:36 PM, David Whalen wrote:


(snip)
I'm sure we could stop storing many of these columns,

especially  if

someone told me that would make a big difference.


I don't think that it would make a difference in memory

consumption,

but storage is certainly not necessary for faceting.  Extra stored
fields can slow down search if they are large (in terms of bytes),
but don't really occupy extra memory, unless they are

polluting the

doc cache.  Does 'text'
need to be stored?



what does the LukeReqeust Handler tell you about the #

of distinct

terms in each field that you facet on?


Where would I find that?  I could probably estimate that

myself on a

per-column basis.  it ranges from 4 distinct values for

media_type to

30-ish for location to 200-ish for country_code to almost

10,000 for

site_id to almost 100,000 for journalist_id.


Using the filter cache method on the things like media type and
location; this will occupy ~2.3MB of memory _per unique

value_, so it

should be a net win for those (although quite close in space
requirements for a 30-ary field on your index size).

-Mike










Re: quick allowDups questions

2007-10-10 Thread Ryan McKinley

the default solrj implementation should do what you need.



As for Solrj, you're probably right, but I'm not going to take any
chances for the time being. The server.add method has an optional
Boolean flag named "overwrite" that defaults to true. Without knowing
for sure what it does, I'm not going to mess with it. 



direct solr update allows a few extra fields allowDups, 
overwritePending, overwriteCommited -- the future of overwritePending, 
overwriteCommited is in doubt (SOLR-60), so i did not want to bake that 
into the solrj API.


internally,

 allowDups = !overwrite; (the one field you can set)
 overwritePending = !allowDups;
 overwriteCommited = !allowDups;


ryan


RE: Facets and running out of Heap Space

2007-10-10 Thread David Whalen
I'll see what I can do about that.

Truthfully, the most important facet we need is the one on
media_type, which has only 4 unique values.  The second
most important one to us is location, which has about 30
unique values.

So, it would seem like we actually need a counter-intuitive
solution.  That's why I thought Field Queries might be the
solution.

Is there some reason to avoid setting multiValued to true
here?  It sounds like it would be the true cure-all

Thanks again!

dave


  

> -Original Message-
> From: Mike Klaas [mailto:[EMAIL PROTECTED] 
> Sent: Wednesday, October 10, 2007 6:20 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Facets and running out of Heap Space
> 
> On 10-Oct-07, at 2:40 PM, David Whalen wrote:
> 
> > Accoriding to Yonik I can't use minDf because I'm faceting 
> on a string 
> > field.  I'm thinking of changing it to a tokenized type so 
> that I can 
> > utilize this setting, but then I'll have to rebuild my entire index.
> >
> > Unless there's some way around that?
> 
> For the fields that matter (many unique values), this is 
> likely result in a performance regression.
> 
> It might be better to try storing less unique data.  For 
> instance, faceting on the blog_url field, or create_date in 
> your schema would case problems (they probably have millions 
> of unique values).
> 
> It would be helpful to know which field is causing the 
> problem.  One way would be to do a sorted query on a 
> quiescent index for each field, and see if there are any 
> suspiciously large jumps in memory usage.
> 
> -Mike
> 
> >
> >
> >
> >> -Original Message-
> >> From: Mike Klaas [mailto:[EMAIL PROTECTED]
> >> Sent: Wednesday, October 10, 2007 4:56 PM
> >> To: solr-user@lucene.apache.org
> >> Cc: stuhood
> >> Subject: Re: Facets and running out of Heap Space
> >>
> >> On 10-Oct-07, at 12:19 PM, David Whalen wrote:
> >>
> >>> It looks now like I can't use facets the way I was hoping
> >> to because
> >>> the memory requirements are impractical.
> >>
> >> I can't remember if this has been mentioned, but upping the
> >> HashDocSet size is one way to reduce memory consumption.
> >> Whether this will work well depends greatly on the
> >> cardinality of your facet sets.  facet.enum.cache.minDf set
> >> high is another option (will not generate a bitset for any
> >> value whose facet set is less that this value).
> >>
> >> Both options have performance implications.
> >>
> >>> So, as an alternative I was thinking I could get counts by doing
> >>> rows=0 and using filter queries.
> >>>
> >>> Is there a reason to think that this might perform better?
> >>> Or, am I simply moving the problem to another step in the process?
> >>
> >> Running one query per unique facet value seems impractical,
> >> if that is what you are suggesting.  Setting minDf to a very
> >> high value should always outperform such an approach.
> >>
> >> -Mike
> >>
> >>> DW
> >>>
> >>>
> >>>
>  -Original Message-
>  From: Stu Hood [mailto:[EMAIL PROTECTED]
>  Sent: Tuesday, October 09, 2007 10:53 PM
>  To: solr-user@lucene.apache.org
>  Subject: Re: Facets and running out of Heap Space
> 
> > Using the filter cache method on the things like media type and
> > location; this will occupy ~2.3MB of memory _per unique value_
> 
>  Mike, how did you calculate that value? I'm trying to tune
> >> my caches,
>  and any equations that could be used to determine some balanced
>  settings would be extremely helpful. I'm in a memory limited
>  environment, so I can't afford to throw a ton of cache at the
>  problem.
> 
>  (I don't want to thread-jack, but I'm also wondering
> >> whether anyone
>  has any notes on how to tune cache sizes for the filterCache,
>  queryResultCache and documentCache).
> 
>  Thanks,
>  Stu
> 
> 
>  -Original Message-
>  From: Mike Klaas <[EMAIL PROTECTED]>
>  Sent: Tuesday, October 9, 2007 9:30pm
>  To: solr-user@lucene.apache.org
>  Subject: Re: Facets and running out of Heap Space
> 
>  On 9-Oct-07, at 12:36 PM, David Whalen wrote:
> 
> > (snip)
> > I'm sure we could stop storing many of these columns,
>  especially  if
> > someone told me that would make a big difference.
> 
>  I don't think that it would make a difference in memory
> >> consumption,
>  but storage is certainly not necessary for faceting.  
> Extra stored
>  fields can slow down search if they are large (in terms 
> of bytes),
>  but don't really occupy extra memory, unless they are
> >> polluting the
>  doc cache.  Does 'text'
>  need to be stored?
> >
> >> what does the LukeReqeust Handler tell you about the #
> >> of distinct
> >> terms in each field that you facet on?
> >
> > Where would I find that?  I could probably estimate that
>  myself on a
> > per-column basis.  it ranges from 4 distinct values for
> 

Re: Facets and running out of Heap Space

2007-10-10 Thread Mike Klaas

On 10-Oct-07, at 3:46 PM, David Whalen wrote:


I'll see what I can do about that.

Truthfully, the most important facet we need is the one on
media_type, which has only 4 unique values.  The second
most important one to us is location, which has about 30
unique values.

So, it would seem like we actually need a counter-intuitive
solution.  That's why I thought Field Queries might be the
solution.

Is there some reason to avoid setting multiValued to true
here?  It sounds like it would be the true cure-all


Should work.  It would cost about 100 MB on a 25m corpus for those  
two fields.


Have you tried setting multivalued=true without reindexing?  I'm not  
sure, but I think it will work.


-Mike





Re: [ADMIN] - Spam problems?

2007-10-10 Thread Chris Hostetter

: Around Sept. 20 I started getting Japanese spam to this account. This is
: a special  account I only use for the Solr and Lucene user mailing
: lists. Did anybody else get these, starting around 9/20?

Note that many mailing list archives leave the sender emails in plain text 
(which results in easy harvesting by spam bots).  even the archives that 
strip or obfuscate email addresses in headers frequently do nothing baout 
email addrs in the body of hte message.  ie: when someone replies to a 
post and their mail client does something like...

>> On Wed, 10 Oct 2007, Norskog, Lance <[EMAIL PROTECTED]> wrote:

or...

>> Date: Wed, 10 Oct 2007 15:50:34 -0400
>> From: "Norskog, Lance" <[EMAIL PROTECTED]>

one solution is to configure your spam filter to automatically reject mail 
where you address is in a recepient header (as opposed to the list 
addresses)  the trade off being you'll never see private replies
replies, but hey: this is open source, everything should be discussed in 
the open :)


-Hoss



Syntax for newSearcher query

2007-10-10 Thread BrendanD

Hi,

The examples that I've found in the solrconfig.xml file and on this site are
fairly basic for pre-warming specific queries. I have some rather complex
looking queries that I'm not quite sure how to specify in my solrconfig.xml
file in the newSearcher section.

Here's an example of 3 queries that I'd like to pre-warm. The category ids
will change with each query (there are 977 different category_ids):

rows=20&start=0&facet.query=attribute_id:1003278&facet.query=attribute_id:1003928&sort=merchant_count+desc&facet=true&facet.field=min_price_cad_rounded_to_tens&facet.field=manufacturer_id&facet.field=merchant_id&facet.field=has_coupon&facet.field=has_bundle&facet.field=has_sale_price&facet.field=has_promo&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en&f.min_price_cad_rounded_to_tens.facet.limit=-1

rows=0&start=0&sort=merchant_count+desc&f.attribute_id_decimal_value_pair.facet.limit=-1&facet=true&facet.field=attribute_id_decimal_value_pair&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en&f.attribute_id_decimal_value_pair.facet.prefix=1003278

rows=0&start=0&sort=merchant_count+desc&f.attribute_id_value_en_pair.facet.prefix=1003928&facet=true&f.attribute_id_value_en_pair.facet.limit=-1&facet.field=attribute_id_value_en_pair&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en


I'm not sure if it's necessary to have all those parameters in my query for
pre-warming, but those are just the queries I see in my catalina.out file
when the user clicks on a specific category. I'd like to pre-warm the first
page of results from all of my categories.

Thanks,

Brendan



-- 
View this message in context: 
http://www.nabble.com/Syntax-for-newSearcher-query-tf4604487.html#a13147569
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Syntax for newSearcher query

2007-10-10 Thread Chris Hostetter

: looking queries that I'm not quite sure how to specify in my solrconfig.xml
: file in the newSearcher section.

: 
rows=20&start=0&facet.query=attribute_id:1003278&facet.query=attribute_id:1003928&sort=merchant_count+desc&facet=true&facet.field=min_price_cad_rounded_to_tens&facet.field=manufacturer_id&facet.field=merchant_id&facet.field=has_coupon&facet.field=has_bundle&facet.field=has_sale_price&facet.field=has_promo&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en&f.min_price_cad_rounded_to_tens.facet.limit=-1

all you have to do is put each key=val pair as a val 

it doesn't matter what the param is, or if it's a param that has multiple 
values, just list each of them the same way...


  
 
  20 
  0 
  attribute_id:1003278 
  attribute_id:1003928 
  ...

 
  ...


-Hoss



Re: Syntax for newSearcher query

2007-10-10 Thread BrendanD

Awesome! Thanks!


hossman wrote:
> 
> 
> : looking queries that I'm not quite sure how to specify in my
> solrconfig.xml
> : file in the newSearcher section.
> 
> :
> rows=20&start=0&facet.query=attribute_id:1003278&facet.query=attribute_id:1003928&sort=merchant_count+desc&facet=true&facet.field=min_price_cad_rounded_to_tens&facet.field=manufacturer_id&facet.field=merchant_id&facet.field=has_coupon&facet.field=has_bundle&facet.field=has_sale_price&facet.field=has_promo&fq=product_is_active:true&fq=product_status_code:complete&fq=category_id:"1001143"&qt=sti_dismax_en&f.min_price_cad_rounded_to_tens.facet.limit=-1
> 
> all you have to do is put each key=val pair as a val 
> 
> it doesn't matter what the param is, or if it's a param that has multiple 
> values, just list each of them the same way...
> 
> 
>   
>  
>   20 
>   0 
>   attribute_id:1003278 
>   attribute_id:1003928 
> ...
> 
>  
>   ...
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Syntax-for-newSearcher-query-tf4604487.html#a13148914
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Facets and running out of Heap Space

2007-10-10 Thread Yonik Seeley
On 10/10/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> Have you tried setting multivalued=true without reindexing?  I'm not
> sure, but I think it will work.

Yes, that will work fine.
One thing that will change is the response format for stored fields
val1
instead of
val1

Hopefully in the future we can specify a faceting method w/o having to
change the schema.

-Yonik


Re: Spell Check Handler

2007-10-10 Thread scott.tabar
Hoss,

I had a feeling someone would be quoting Yonik's Law of Patches!  ;-)

For now, this is done.

I created the changes, created JavaDoc comments on the various settings 
and their expected output, created a JUnit test for the 
SpellCheckerRequestHandler
which tests various components of the handler, and I also created the
supporting configuration files for the JUnit tests (schema and solrconfig 
files).

I attached the patch to the JIRA issue so now we just have to wait until it gets
added back in to the main code stream.

For anyone who is interested, here is a link to the JIRA:
https://issues.apache.org/jira/browse/SOLR-375

Could someone please drop me a hint on how to update the wiki or any other 
documentation that could benefit to being updated; I'll like to help out as much
as possible, but first I need to know "how". ;-)

When these changes do get committed back in to the daily build, please
review the generated JavaDoc for information on how to utilize these new 
features.
If anyone has any questions, or comments, please do not hesitate to ask.

As a general note of a self-critique on these changes, I am not 100% sure of 
the way I
implemented the "nested" structure when the "multiWords" parameter is used.  My 
interest
is that it should work smoothly with some other technology such as Prototype 
using the
JSon output type.  Unfortunately, I will not be getting a chance to start on 
that coding until
next week so it is up in the air as to if this structure will be conducive or 
not.  I am planning
on providing more details in the documentations as far as how to utilize these 
modifications
in Prototype and AJax when I get a chance (even provide links to a production 
site so you
can see it in action and view the source if interested).  So stay tuned... 

   Thanks for everyones time,
  Scott Tabar

 Chris Hostetter <[EMAIL PROTECTED]> wrote: 

: If you like, I can post the source code changes that I made to the 
: SpellCheckerRequestHandler, but at this time I am not ready to open a 
: JIRA issue and submit the changes back through the subversion.  I will 
: need to do a little more testing, documentation, and create some unit 
: tests to cover all of these changes, but what I have been able to 
: perform, it is working very well.

Keep in mind "Yonik's Law Of Patches" ...

"A half-baked patch in Jira, with no documentation, no tests 
and no backwards compatibility is better than no patch at all."
http://wiki.apache.org/solr/HowToContribute

...even if you don't think the code is "solid" yet, if you want to 
eventually make it available to people, making a "rough" version available 
to people early gives other people the opportunity to help you make it 
solid (by writing unit tests, fixing bugs, and adding documentation).


-Hoss