Re: Collection Distirbution in windows

2007-05-03 Thread Maarten . De . Vilder
damn, there goes the platform independance ...

is there anybody with a lillte more experience when it comes to collection 
distribution on Windows ?

tnx in advance !





"Bill Au" <[EMAIL PROTECTED]> 
02/05/2007 15:09
Please respond to
solr-user@lucene.apache.org


To
solr-user@lucene.apache.org
cc

Subject
Re: Collection Distirbution in windows






The collection distribution scripts relies on hard links and rsync.  It
seems that both maybe avaialble on Windows

hard links:
http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/fsutil_hardlink.mspx?mfr=true


rsync:
http://samba.anu.edu.au/rsync/download.html

I say maybe because I don't know if hard link on windows work the same way
as hard link on Linux/Unix.

You will also need something like cygwin to run the bash scripts.

Bill

On 5/2/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>
> i know this is a stupid question, but are there any collection
> distribution scripts for windows available ?
>
> thanks !



Look ahead queries

2007-05-03 Thread Ge, Yao \(Y.\)
I am planning to develop look ahead queries with Solr so that as user
type query terms a list of related terms is shown in a popup window
(similar to Google suggest). It will be a little AJAX type calls to Solr
with wildcards. So if user types "fuel", a look ahead query will be sent
to solr in form of "fuel *". User will end-up seeing relevant terms like
"fuel consumption", "fuel leaks", "fuel tank" etc showing up. In this
case, I will likely to limit queries to certain fields only and some
post processing is required to get a final list of suggestion. Let me
know if someone has already done this and there are better ways or
suggestions to accomplish this. I figured solr's caching will make this
type of application more efficient than a straight Lucene integration.

Thanks.

-Yao


Using fields to create new queries

2007-05-03 Thread r tompkins

Hello all,
I am trying to reuse fields to create new search queries.

Let me explain

when solr is processing the fields to figure out where to put the doc in the
tree, it proccesses the indexable fields and removes the stops words. This
stripped down set of keyword is passed to some other function which places
it in the correct part of the search tree.

I would like to have solr sort those keywords (i.e. with stop words removed)
and store them as another field (keywords).

This will allow me to use those keywords to find other documents related to
that document.

Is the possible and if so where should I start.

Thanks in advance


Snippet Generation at Punctuation Marks?

2007-05-03 Thread Jack L
Snippet generation use hl.fragsize to determine the size
of the snippets. This works very well. However, the snippets
often have half of a sentence at the beginning, and half
at the end. Is there a parameter I can use to tell the
snippet generation code to cut at punctuation marks when
possible?

-- 
Best regards,
Jack



Re: Snippet Generation at Punctuation Marks?

2007-05-03 Thread Brian Whitman

On May 3, 2007, at 11:39 AM, Jack L wrote:

Snippet generation use hl.fragsize to determine the size
of the snippets. This works very well. However, the snippets
often have half of a sentence at the beginning, and half
at the end. Is there a parameter I can use to tell the
snippet generation code to cut at punctuation marks when
possible?



We are working on this and hope to have a solr patch soon. Doing  
simple splitting on punctuation is a new fragmenter, which trunk solr  
does not support yet. But we're hoping to fix that asap.


-brian


Index corruptions?

2007-05-03 Thread Charlie Jackson
I have a couple of questions regarding index corruptions. 

 

1) Has anyone using Solr in a production environment ever experienced an
index corruption? If so, how frequently do they occur?

 

2) It seems like the CollectionDistribution setup would be a good way to
put in place a recovery plan for (or at least have some viable backups
of) the index. However, I have a small concern that if the index gets
corrupted on the master server, the corruption would propagate down to
the slave servers as well. Is this concern unfounded? Also, each of the
snapshots taken by snapshooter are viable full indexes, correct? If so,
that means I'd have a backup of the index each and every time a commit
(or optimize for that matter) is done, which would be awesome.

 

One of our biggest requirements for the indexing process is to have a
good backup/recover strategy in place and I want to make sure Solr will
be able to provide that. 

 

Thanks in advance!

 

Charlie



Re: Index corruptions?

2007-05-03 Thread Bill Au

In additional to snapshot, you can also make backup copies of your Solr
index using the backup script.
Backup are created the same way as snapshots using hard links.  Each one is
a viable full index.

Bill

On 5/3/07, Charlie Jackson <[EMAIL PROTECTED]> wrote:


I have a couple of questions regarding index corruptions.



1) Has anyone using Solr in a production environment ever experienced an
index corruption? If so, how frequently do they occur?



2) It seems like the CollectionDistribution setup would be a good way to
put in place a recovery plan for (or at least have some viable backups
of) the index. However, I have a small concern that if the index gets
corrupted on the master server, the corruption would propagate down to
the slave servers as well. Is this concern unfounded? Also, each of the
snapshots taken by snapshooter are viable full indexes, correct? If so,
that means I'd have a backup of the index each and every time a commit
(or optimize for that matter) is done, which would be awesome.



One of our biggest requirements for the indexing process is to have a
good backup/recover strategy in place and I want to make sure Solr will
be able to provide that.



Thanks in advance!



Charlie




Re: Searchproblem composite words

2007-05-03 Thread Walter Underwood
A agree that multi-word synonyms are an excellent way to do this.

This may sound like a hack, but you'd end up doing this even if
you had dedicated linguistic compound decomposition software.
Those usually use a dictionary of common words and the dictionary
rarely has all the words that are important for your site.

I'll be doing this for my site to handle things like "dreamgirls"
and "dream girls".

wunder

On 5/2/07 11:58 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : For example I have the composite word "wishlist" in my document. I can
> : easily find the document by using the search string "wishlist" or "wish*"
> : but I don't get any result with "list".
> 
> what you are describing is basically a substring search problem ...
> sometimes this can be dealt with by using something like the
> WordDeliminterFilter -- but only if people are using "WishList" in their
> documents.
> 
> Another approach would be to use and NGram based tokenizer (built in
> support for this will probably be added soon) but then searches for things
> like "able" will match words like "cable" ... which may not be what you
> want (yes it is a substring, but it is not what anyone would consider a
> "composite word"
> 
> the best way to match what you want extremely acurately would be to use
> the SynonymFilter and enumerate every composite word you care about in the
> Synonym list ... tedious yes, but also very accurate.
> 
> -Hoss




Re[2]: Snippet Generation at Punctuation Marks?

2007-05-03 Thread Jack L
Thanks. Looking forward to it!

> We are working on this and hope to have a solr patch soon. Doing  
> simple splitting on punctuation is a new fragmenter, which trunk solr
> does not support yet. But we're hoping to fix that asap.

> -brian



Phrase Query fetch no results - verfiy

2007-05-03 Thread solruser

Hi Everyone,

Question on phrase query. Does the phrase query will return only the
documents with all terms matched in the document?
To better illustrate there is a phrase query say "How do I look for supply".
Now the expected result using phrase query after giving valid slop, should
return back document with exact match first and then remaining documents
that match any or few terms in the phrase. Here is the example of terms with
# of occurrences in index is as follows

1. How- 2000
2. do  -1500
3. I-3000
4. look - 100
5. for - 1400
6. supply - 10

well when  try to query against index the same phrase with slop there are no
results. schema for the field uses 

Tokenizer - WhitespaceTokenizerFactory
Filter - StopFilterFactory
Filter - WordDelimiterFilterFactory
Filter - LowerCaseFilterFactory
Filter - EnglishPorterFilterFactory
Filter - RemoveDuplicatesTokenFilterFactory

The way I need results is 
It should return exact match first and thereafter words with most unique
terms in index on top followed by any other documents

So I am wonder if there is any default analyzer in Solr that exists for such
scenario or this is kind of specific requirement to write custom analyzer.
If yes, I look forward someone to share some code snippet to write a custom
analyzer

Thanks,
Amit

-- 
View this message in context: 
http://www.nabble.com/Phrase-Query-fetch-no-results---verfiy-tf3687878.html#a10309639
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snippet Generation at Punctuation Marks?

2007-05-03 Thread Mike Klaas

On 5/3/07, Brian Whitman <[EMAIL PROTECTED]> wrote:

On May 3, 2007, at 11:39 AM, Jack L wrote:
> Snippet generation use hl.fragsize to determine the size
> of the snippets. This works very well. However, the snippets
> often have half of a sentence at the beginning, and half
> at the end. Is there a parameter I can use to tell the
> snippet generation code to cut at punctuation marks when
> possible?


We are working on this and hope to have a solr patch soon. Doing
simple splitting on punctuation is a new fragmenter, which trunk solr
does not support yet. But we're hoping to fix that asap.


See http://issues.apache.org/jira/browse/SOLR-102 for my solution to
this problem.  The idea is that you'd like to split at sentence
boundaries, but also not stray too far from the desired fragment size.
It would be great to get comments on/improvements to this approach.

-Mike


Sorting in Solr

2007-05-03 Thread Scott Matthews
   I've been attempting to utilize Solr 1.1's sorting feature using the 
Syntax provided on the wiki [q=; (asc|desc)] but I'm 
having issues with it.  Some fields work some do not and my results seem 
to suggest that it doesn't work when there are any non-alphaNumeric 
values in the fields.  Can someone out there either confirm this or let 
me know what I may be doing wrong?  is it a matter of using a different 
analyzer setting or filter factory than the default setting for text.


Re: Sorting in Solr

2007-05-03 Thread Chris Hostetter

: having issues with it.  Some fields work some do not and my results seem
: to suggest that it doesn't work when there are any non-alphaNumeric
: values in the fields.  Can someone out there either confirm this or let
: me know what I may be doing wrong?  is it a matter of using a different
: analyzer setting or filter factory than the default setting for text.

Sorting requires that there be a single Term/Token per doc ... most
Analyzers do not have this behavior, so you need to use copyField to
create a String version of the field that you use for sorting.

the example schema in the trunk shows this using the name and nameSort
fields ... in the 1.1 release there is a comment about the manu_exact
field.

I've added this as a FAQ.



-Hoss



Re: related multivalued fields

2007-05-03 Thread Chris Hostetter

one appraoch would be to have a single field called "citation" and use a
custom Analyzer that will put a "medium" sized gap between a persons name
and their organization, and a "large" gap between each person ... so
citation:"John ACME"~10 will give you articles by people named John who
work for companies named ACME.

if you really want to get creative, there was talk a while back about
Phrase/SPan Queries that could know about "parallel" fields ... where the
terms in one field "line up" with the terms in another field .. not true
"hierarchical" queries, but good enough for the class of problems you are
talking about...

http://www.nabble.com/Re%3A-One-item%2C-multiple-fields%2C-and-range-queries-p8377712.html





-Hoss



Re: Sorting in Solr

2007-05-03 Thread Scott Matthews
Just to be clear, I have multiple fields per document that Are coming 
back in the queried XML. Let's say it's name, id, date, description.  I 
want to sort dynamically on fields but for my test case on Description.  
Are you suggesting that there be one field defined per document, or you 
can only sort on one field per request?  I'm not sure I understand this 
explanation.


Chris Hostetter wrote:

: having issues with it.  Some fields work some do not and my results seem
: to suggest that it doesn't work when there are any non-alphaNumeric
: values in the fields.  Can someone out there either confirm this or let
: me know what I may be doing wrong?  is it a matter of using a different
: analyzer setting or filter factory than the default setting for text.

Sorting requires that there be a single Term/Token per doc ... most
Analyzers do not have this behavior, so you need to use copyField to
create a String version of the field that you use for sorting.

the example schema in the trunk shows this using the name and nameSort
fields ... in the 1.1 release there is a comment about the manu_exact
field.

I've added this as a FAQ.



-Hoss


  




Re: Sorting in Solr

2007-05-03 Thread Chris Hostetter

: Just to be clear, I have multiple fields per document that Are coming
: back in the queried XML. Let's say it's name, id, date, description.  I
: want to sort dynamically on fields but for my test case on Description.
: Are you suggesting that there be one field defined per document, or you
: can only sort on one field per request?  I'm not sure I understand this
: explanation.

if you want to sort on a field called "description" then there must be at
most one indexed term per document for that field.  if you also ant to
sort on a field called "date" there must also be at most one indexed value
for that field per document.  for numeric or date type fields, ensuring
that there is only one index value per document is a simple value of
making sure the field is defined as "multiValue="false" in your schema,
but for textish fields it's not as simple ... you may send only one
... per doc for that field name but if you are using a
non trivial analyzer you'll wind up with more then one indexed term.

so you define name and description to be whatever you type you want with
whatever analyzer you want, and then you use copyField to create a second
version of each called nameSort and descriptionSort which use the StrField
filedtype ... now you can sort on either of those, or both at the same
time (ie: "nameSort asc, descriptionSort desc")

: Chris Hostetter wrote:
: > : having issues with it.  Some fields work some do not and my results seem
: > : to suggest that it doesn't work when there are any non-alphaNumeric
: > : values in the fields.  Can someone out there either confirm this or let
: > : me know what I may be doing wrong?  is it a matter of using a different
: > : analyzer setting or filter factory than the default setting for text.
: >
: > Sorting requires that there be a single Term/Token per doc ... most
: > Analyzers do not have this behavior, so you need to use copyField to
: > create a String version of the field that you use for sorting.
: >
: > the example schema in the trunk shows this using the name and nameSort
: > fields ... in the 1.1 release there is a comment about the manu_exact
: > field.
: >
: > I've added this as a FAQ.
: >
: >
: >
: > -Hoss
: >
: >
: >
:



-Hoss



Re: Wondering about results from PhraseQuer

2007-05-03 Thread Chris Hostetter
: the scenario, understand this that user runs a search for title which has
: pretty common terms such as "how do I update" {all of the words appears
: 1000s of times in indexes } and they want to search "prison" the last term
: appears not more than 1 or 2 times across the indexes. Now I have the
: problem, if I try to run phrase query on this I get zero results and if I

if the word "rpison" doesn't appear anywhere near the words "how do i"
then a phrase search on "how do i prison" isn't going to find any
documents.  perhaps you should search on...

+"how do i" +prison

..which will only return docs that match the phrase "how do i" and also
contain the word prison.

: 0.0 = fieldWeight(subject_t:"how do i prison" in 9268), product of:
:   0.0 = tf(phraseFreq=0.0)
:   18.508762 = idf(subject_t: how=2225 do=3359 i=4918 prison=4)
:   0.5 = fieldNorm(field=subject_t, doc=9268)

this would be my point before ... that phrase does not appear in the
document (hence the tf is zero)



-Hoss



Re: Sorting in Solr

2007-05-03 Thread Scott Matthews
Great, thanks.  I was hoping the solution you were suggesting was along 
those lines.



Chris Hostetter wrote:

: Just to be clear, I have multiple fields per document that Are coming
: back in the queried XML. Let's say it's name, id, date, description.  I
: want to sort dynamically on fields but for my test case on Description.
: Are you suggesting that there be one field defined per document, or you
: can only sort on one field per request?  I'm not sure I understand this
: explanation.

if you want to sort on a field called "description" then there must be at
most one indexed term per document for that field.  if you also ant to
sort on a field called "date" there must also be at most one indexed value
for that field per document.  for numeric or date type fields, ensuring
that there is only one index value per document is a simple value of
making sure the field is defined as "multiValue="false" in your schema,
but for textish fields it's not as simple ... you may send only one
... per doc for that field name but if you are using a
non trivial analyzer you'll wind up with more then one indexed term.

so you define name and description to be whatever you type you want with
whatever analyzer you want, and then you use copyField to create a second
version of each called nameSort and descriptionSort which use the StrField
filedtype ... now you can sort on either of those, or both at the same
time (ie: "nameSort asc, descriptionSort desc")

: Chris Hostetter wrote:
: > : having issues with it.  Some fields work some do not and my results seem
: > : to suggest that it doesn't work when there are any non-alphaNumeric
: > : values in the fields.  Can someone out there either confirm this or let
: > : me know what I may be doing wrong?  is it a matter of using a different
: > : analyzer setting or filter factory than the default setting for text.
: >
: > Sorting requires that there be a single Term/Token per doc ... most
: > Analyzers do not have this behavior, so you need to use copyField to
: > create a String version of the field that you use for sorting.
: >
: > the example schema in the trunk shows this using the name and nameSort
: > fields ... in the 1.1 release there is a comment about the manu_exact
: > field.
: >
: > I've added this as a FAQ.
: >
: >
: >
: > -Hoss
: >
: >
: >
:



-Hoss


  




Re: Phrase Query fetch no results - verfiy

2007-05-03 Thread Chris Hostetter

: Now the expected result using phrase query after giving valid slop, should
: return back document with exact match first and then remaining documents
: that match any or few terms in the phrase. Here is the example of terms with
: # of occurrences in index is as follows

there is no single query structure that does what you describe, however i
suggest you take a look at the dismax request handler ... configured
properly for the fields you care about it will not only score documents
that match on many terms better then documents that match on few terms,
but it will also give a "phrasequery" boost to documents that match on the
whole query as a phrase (with whatever amount of slop you wnat).

note that dismax is primarily designed to query multiple fields, but in
it's simplest form you can still use it's query parsing goodness on a
single field, something like...

qt=dismax&qf=txtField&pf=txtField&ps=100&mm=0&q=How+do+I+look+for+supply

http://lucene.apache.org/solr/api/org/apache/solr/request/DisMaxRequestHandler.html

-Hoss



solr.py - set boosts?

2007-05-03 Thread Jack L
I've been using solr.py to post and search. It works well.
Is it possible to specify doc boost and field boost with it?

Jack

Erik> There is a solr.py in the Solr clients directory:
Erik> http://svn.apache.org/repos/asf/lucene/solr/trunk/client/python/solr.py
Erik> It's got some utility methods for generating 's.

Mike> It is not documented very well, but you can pass in a multi-map to the
Mike> solr.py client:

Mike> .add(field_one=['one', 'two', 'three'], field_two='value', ...)




Re: facet.sort does not work in python output

2007-05-03 Thread Jack L
The Python output uses nested dictionaries for facet counts.
I read it online that Python dictionaries do not preserve order.
So when a string is eval()'d, the sorted order is lost in the
generated Python object. Is it a good idea to use list to wrap
around the dictionary? This is only needed for the fields, sorted
by counts.

-- 
Best regards,
Jack

Wednesday, May 2, 2007, 6:09:50 PM, you wrote:


> When facet.sort is used, the facet fields are sorted by the count
> in the reply string when using python output. However, after calling
> eval(), the sort order seems to be lost. Not sure if anyone has come
> up with a way to avoid this problem.

> Using the JSON output with a JSON parser for Python should work but
> I haven't tested it yet.




Solr consult in or near Florida?

2007-05-03 Thread Avi Rappoport, SearchTools.com

Hi all,

I'm really impressed by Solr and one of my potential consulting 
customers is considering installing it, using the faceted metadata 
browse results for their photo site.  They're running Windows and MS 
SQL, and would want someone to help them install and configure Solr 
for them, in Florida.  If you're interested, please contact me 
offlist and let me see what I can pitch to them.


Thanks,

Avi Rappoport
Enterprise Search Consultant

--
Complete Guide to Search Engines for Web Sites and Intranets
   


Re: facet.sort does not work in python output

2007-05-03 Thread Mike Klaas

On 5/3/07, Jack L <[EMAIL PROTECTED]> wrote:

The Python output uses nested dictionaries for facet counts.
I read it online that Python dictionaries do not preserve order.
So when a string is eval()'d, the sorted order is lost in the
generated Python object. Is it a good idea to use list to wrap
around the dictionary? This is only needed for the fields, sorted
by counts.


This might be fixed in the future, but for now, either resort on the
client-side (a one- or zero-liner), or specify json.nl=arrarr (which
affects the whole python response structure... probably not
recommended).

There is some past discussion on the list if you search the archives.

-Mike


Re: solr.py - set boosts?

2007-05-03 Thread Mike Klaas

On 5/3/07, Jack L <[EMAIL PROTECTED]> wrote:

I've been using solr.py to post and search. It works well.
Is it possible to specify doc boost and field boost with it?


Not currently, but there is an improved client in the works which you
can try here:

http://issues.apache.org/jira/browse/SOLR-216

-Mike


Re: Look ahead queries

2007-05-03 Thread Mike Klaas

On 5/3/07, Ge, Yao (Y.) <[EMAIL PROTECTED]> wrote:

I am planning to develop look ahead queries with Solr so that as user
type query terms a list of related terms is shown in a popup window
(similar to Google suggest). It will be a little AJAX type calls to Solr
with wildcards. So if user types "fuel", a look ahead query will be sent
to solr in form of "fuel *". User will end-up seeing relevant terms like
"fuel consumption", "fuel leaks", "fuel tank" etc showing up. In this


Solr (nor Lucene) support queries of this form.  You could accomplish
something similar by iterating over the termdocs of the documents
returned from a regular "fuel" term query, though.


case, I will likely to limit queries to certain fields only and some
post processing is required to get a final list of suggestion. Let me
know if someone has already done this and there are better ways or
suggestions to accomplish this. I figured solr's caching will make this
type of application more efficient than a straight Lucene integration.


I'm not sure if Solr's caching will help you in this regard (then
again, I don't know exactly what you're planning on doing).

Typically, the feature you are talking about is implemented by
analyzing query logs, which are a much more relevant corpus than the
raw documents in this context.  I suggest focusing your efforts in
that direction (possibly checking to see if someone has doing this
with lucene already...)

cheers,
-Mike


Re: facet.sort does not work in python output

2007-05-03 Thread Erik Hatcher

We resort it in solr-ruby:

  def field_facets(field)
facets = []
values = @data['facet_counts']['facet_fields'][field]
Solr::Util.paired_array_each(values) do |key, value|
  facets << FacetValue.new(key, value)
end

facets
  end



On May 3, 2007, at 8:10 PM, Mike Klaas wrote:


On 5/3/07, Jack L <[EMAIL PROTECTED]> wrote:

The Python output uses nested dictionaries for facet counts.
I read it online that Python dictionaries do not preserve order.
So when a string is eval()'d, the sorted order is lost in the
generated Python object. Is it a good idea to use list to wrap
around the dictionary? This is only needed for the fields, sorted
by counts.


This might be fixed in the future, but for now, either resort on the
client-side (a one- or zero-liner), or specify json.nl=arrarr (which
affects the whole python response structure... probably not
recommended).

There is some past discussion on the list if you search the archives.

-Mike