Item Search Database

2007-03-28 Thread Maarten . De . Vilder
hi,

i have a performance question...

we need to implement a feature called 'Item Search Database', which 
basically means we have to limit the documents a user can search ...

example :
Item1 is in database1
item2 is in database2
item3 is in database1 and database2
and the client can only see the items in database1

we currently solve this by making a new solrcolumn for each 
searchdatabase... so it looks like this :
ITEMNAMEDB1 DB2
-   --  --
Item1   truefalse
Item2   false   true
Item3   truetrue

and we limit the result of a search by putting "db1:true" in the 
querystring

but i have been reading about another method :
we could also use just one solrcolum and put the names of the database in 
it...
like so :
ITEMNAMEDB
-   -
Item1   DB1
Item2   DB2
Item3   DB1 DB2

and limit the results by putting 'db:db1' in the querystring

and now for my question :
which of these options will be more performant ?

my guess is the first option will be the most performant since the indexes 
will be better constructed
but i would really like a professional opinion on this ...

as i said, we are currently using the first option on 300.000 testrecords 
and it is really performant.
some SearchDatabases have only 12 records in it and it takes less then 1ms 
to get those 12 records back... so i'm guessing Solr is not searching the 
full 300.000 records and i am kind of afraid that with the second option 
Solr will have to search more records/indexes to get the same result...

well, hope you understand my question and thanks in advance !
- Maarten

PS: thank you to everybody on this list for the help and thank you to all 
of the Solr/Lucene developers, great stuff !!

Auto index update

2007-03-28 Thread netaji . k
Hello,

Can anybody suggest me of what is the best method to implement auto index
update on SOLR from mysql database.

thanks and regards
aditya



Fw: Download solr-tools rpm

2007-03-28 Thread Suresh Kannan
Hi,

I need to configure master / slave servers. Hence i check at wiki help 
documents. I found that i need to install solr-tools rpm. But i could not able 
to download the files. Please some help me with solr-tools rpm.

Suresh Kannan

failing post-optimize command execution

2007-03-28 Thread galo

Hi,

I've configured my solrconfig.xml to execute a snapshoot after an 
optimize is made but I keep getting the following exception in the 
tomcat logs:


SEVERE: java.io.IOException: Cannot run program "snapshooter" (in 
directory "/home/solr/solr/bin"): java.io.IOException: error=2, No such 
file or directory


I'm certain the path and filename is correct.. does anybody have 
problems with this?


Cheers,

galo


Re: failing post-optimize command execution

2007-03-28 Thread Traut

What about access rights on file snapshooter and on directories in path
/home/solr/solr/bin ?
Maybe this is the root of the problem?

On 3/28/07, galo <[EMAIL PROTECTED]> wrote:


Hi,

I've configured my solrconfig.xml to execute a snapshoot after an
optimize is made but I keep getting the following exception in the
tomcat logs:

SEVERE: java.io.IOException: Cannot run program "snapshooter" (in
directory "/home/solr/solr/bin"): java.io.IOException: error=2, No such
file or directory

I'm certain the path and filename is correct.. does anybody have
problems with this?

Cheers,

galo





--
Best regards,
Traut


Solr finding doc by one field but not by another

2007-03-28 Thread Theodan

Hi everyone.

Can anyone explain how this might happen?  I query by the "ID" field and get
the following result:

=
 


0 
16 

ID:ee483237-399c-4b17-ad73-000cc54fd3e1 




COSMEO US 
 
 
 
 
ee483237-399c-4b17-ad73-000cc54fd3e1 
 
en-US 
 
 
 
 
 
 
 
 
Social Studies American History 
Historical Periods
Expansion and Reform 1801-1861 Territorial Expansion 
 
 
 
 
EncyclopediaArticles 
 
2005 
Pony Express was a mail service 
operating between
Saint Joseph, Mo., and Sacramento, Calif., inaugurated on April 3, 1860,
under the direction of the Central Overland California and Pike's Peak
Express Co. 
True 
 
 
Pony Express 
pony express 



=

Then I query by the "title" field from the result above (so I know the
document is in the index and has been committed), and I get zero results:

=
 


0 
0 

title:"Pony Express" 




=

"ID" is not the only field that I can find the doc by.  Searching for
"Type:encyclopediaarticles" finds it too.  Also, "title" is not the only
field that misses the doc.  A search by "vocabulary" misses it too.  I
haven't tried all the fields yet to see exhaustively which ones find it and
which ones don't.  I can do that if it would help.

For what it's worth, I started with an existing Lucene index and modified
Solr's schema.xml so that I could just use the Lucene index in Solr.  That
Lucene index had about 230K docs.  I then used your "post.jar" to post
another 10K docs to the index after starting up the server.  Those 10K docs
only had 7 of the 30 fields that the original 230K docs had.  Could that be
the problem?  I am noticing that the docs that I'm having problems with are
from the original 230K-doc index, not from my subsequent 10K-doc post.  The
10K docs seem to be findable by any of their 7 fields.

Here are my config files:
http://www.nabble.com/file/7488/schema.xml schema.xml 
http://www.nabble.com/file/7489/solrconfig.xml solrconfig.xml 

Any help is greatly appreciated.

Thanks,
-Dan
-- 
View this message in context: 
http://www.nabble.com/Solr-finding-doc-by-one-field-but-not-by-another-tf3481287.html#a9716918
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Solr finding doc by one field but not by another

2007-03-28 Thread Mike Klaas

On 3/28/07, Theodan <[EMAIL PROTECTED]> wrote:


For what it's worth, I started with an existing Lucene index and modified
Solr's schema.xml so that I could just use the Lucene index in Solr.  That
Lucene index had about 230K docs.  I then used your "post.jar" to post
another 10K docs to the index after starting up the server.  Those 10K docs
only had 7 of the 30 fields that the original 230K docs had.  Could that be
the problem?  I am noticing that the docs that I'm having problems with are
from the original 230K-doc index, not from my subsequent 10K-doc post.  The
10K docs seem to be findable by any of their 7 fields.


This is almost certainly due to a mismatch between the index- and
query-time analysis of the fields.  For instance, your schema defines
the title field to be "string" (unanalyzed), but it is likely that
some tokenization (perhaps via StandardAnalyzer) occurred in the
original index.

-Mike


Re: Document boost not as expected...

2007-03-28 Thread escher2k

Chris,
   Earlier I was trying to modify the Similarity computation to make it
field dependent (we are trying to change tf based on the field). Now, I have
reverted the custom computation so that the default Similarity is used. Fro
testing, I boosted a single field in one doc. 


Y
...


This is what I see in the explain -
2.5 = (MATCH) sum of:
  2.5 = (MATCH) fieldWeight(show_all_flag:Y in 17), product of:
1.0 = tf(termFreq(show_all_flag:Y)=1)
1.0 = idf(docFreq=36239)
2.5 = fieldNorm(field=show_all_flag, doc=17)

Again, I fail to understand where it is doing a multiplication by 1.25
(score (2.5) = field_boost (2.0) * 1.25 ??).

Thanks.


Chris Hostetter wrote:
> 
> 
> Ditto everything Mike said, but i'm also curious what Similarity changes
> you made ... without knowing what that code looks like, all bets are off
> in terms of anyone being able to help you understand the scores you are
> seeing.
> 
> : I am not quite sure how the score changed from 1.33 to 1.25. I am not
> quite
> : sure how this might have happened - I have modified the custom
> similarity
> : but I don't quite have an explanation of how the score changed.
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Document-boost-not-as-expected...-tf3476653.html#a9718403
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Document boost not as expected...

2007-03-28 Thread Mike Klaas

On 3/28/07, escher2k <[EMAIL PROTECTED]> wrote:


Again, I fail to understand where it is doing a multiplication by 1.25
(score (2.5) = field_boost (2.0) * 1.25 ??).


As I said above, lengthNorm is also multiplied in.  This will depend
on your custom similar what value(s) you have in the field.

-Mike


Controlling read/write access for replicated indexes

2007-03-28 Thread Jeff Rodenburg

I'm curious what mechanisms everyone is using to control read/write access
for distributed replicated indexes.  We're moving to a replication
environment very soon, and our client applications (quite a few) all have
configuration pointers to the URLs for solr instances.  As a precaution, I
don't want errant configuration values to inadvertently send write requests
to read servers, as an example.  As an aside, we're running solr under
tomcat 5.5.x which has its own control aspects as well.

Any best practices, i.e. something that's not a maintenance headache later,
from those who have done this would be greatly appreciated.

thanks,
j.r.


Re: Document boost not as expected...

2007-03-28 Thread escher2k

Mike,
   I am not doing anything custom for this test. I am assuming that the
Default Similarity is used.
Surprisingly, if I remove the document level boost (set to 1.0) and just
have a field level boost, the result
seems to be correct.


Mike Klaas wrote:
> 
> On 3/28/07, escher2k <[EMAIL PROTECTED]> wrote:
> 
>> Again, I fail to understand where it is doing a multiplication by 1.25
>> (score (2.5) = field_boost (2.0) * 1.25 ??).
> 
> As I said above, lengthNorm is also multiplied in.  This will depend
> on your custom similar what value(s) you have in the field.
> 
> -Mike
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Document-boost-not-as-expected...-tf3476653.html#a9722264
Sent from the Solr - User mailing list archive at Nabble.com.



Best approach for indexing and querying against a multivalue name field like directors or actors?

2007-03-28 Thread Daniel Einspanjer

I'm rather new to Solr and somewhat rusty on what little I learned on
Lucene a few years back.

I've got some documents I want to index that have multiple name fields
such as directors or actors. I'm wanting to index them such that
querying for "Jane Doe" would have a higher score for "Jane M. Doe"
than for "John Doe", but I need to make sure that "Jane Doe" wouldn't
match a document with two directors, "Jane Smith" and "John Doe" at
all.

If anyone has done something like this and could suggest some of the
solr filters that might be useful to me, I'd greatly appreciate it.

Daniel


Re: Best approach for indexing and querying against a multivalue name field like directors or actors?

2007-03-28 Thread Daniel Einspanjer

I'm sorry, I said something confusing there.
Let me try that last case again.

If you have three documents with a multivalue field named director
(represented here by ; separator)
1. "Jane M. Doe"
2. "Jane Smith"; "John Doe"
3. "John Doe"

And the user searched for director:"Jane Doe", I would ideally like 1.
to have the highest score and 2 and 3 to have nearly equal scores.
The experiments I've done so far have given 2. a score higher than 3.
because the terms Jane and Doe were found in document 2. even though
they were in separate instances of the multivalue field.

I hope this makes understanding my question better rather than worse. :)

Thanks,
Daniel

On 3/28/07, Daniel Einspanjer <[EMAIL PROTECTED]> wrote:

 but I need to make sure that "Jane Doe" wouldn't
match a document with two directors, "Jane Smith" and "John Doe" at
all.


Re: Document boost not as expected...

2007-03-28 Thread Mike Klaas

On 3/28/07, escher2k <[EMAIL PROTECTED]> wrote:


Mike,
   I am not doing anything custom for this test. I am assuming that the
Default Similarity is used.
Surprisingly, if I remove the document level boost (set to 1.0) and just
have a field level boost, the result
seems to be correct.


Another detail that I forgot to mention is that fieldNorms are encoded
into one-byte floats, so you can experience severe rounding errors.
The possible values are:

0   0.0
1   5.820766E-10
2   6.9849193E-10
3   8.1490725E-10
4   9.313226E-10
5   1.1641532E-9
6   1.3969839E-9
7   1.6298145E-9
8   1.8626451E-9
9   2.3283064E-9
10  2.7939677E-9
11  3.259629E-9
12  3.7252903E-9
13  4.656613E-9
14  5.5879354E-9
15  6.519258E-9
16  7.4505806E-9
17  9.313226E-9
18  1.1175871E-8
19  1.3038516E-8
20  1.4901161E-8
21  1.8626451E-8
22  2.2351742E-8
23  2.6077032E-8
24  2.9802322E-8
25  3.7252903E-8
26  4.4703484E-8
27  5.2154064E-8
28  5.9604645E-8
29  7.4505806E-8
30  8.940697E-8
31  1.0430813E-7
32  1.1920929E-7
33  1.4901161E-7
34  1.7881393E-7
35  2.0861626E-7
36  2.3841858E-7
37  2.9802322E-7
38  3.5762787E-7
39  4.172325E-7
40  4.7683716E-7
41  5.9604645E-7
42  7.1525574E-7
43  8.34465E-7
44  9.536743E-7
45  1.1920929E-6
46  1.4305115E-6
47  1.66893E-6
48  1.9073486E-6
49  2.3841858E-6
50  2.861023E-6
51  3.33786E-6
52  3.8146973E-6
53  4.7683716E-6
54  5.722046E-6
55  6.67572E-6
56  7.6293945E-6
57  9.536743E-6
58  1.1444092E-5
59  1.335144E-5
60  1.5258789E-5
61  1.9073486E-5
62  2.2888184E-5
63  2.670288E-5
64  3.0517578E-5
65  3.8146973E-5
66  4.5776367E-5
67  5.340576E-5
68  6.1035156E-5
69  7.6293945E-5
70  9.1552734E-5
71  1.0681152E-4
72  1.2207031E-4
73  1.5258789E-4
74  1.8310547E-4
75  2.1362305E-4
76  2.4414062E-4
77  3.0517578E-4
78  3.6621094E-4
79  4.272461E-4
80  4.8828125E-4
81  6.1035156E-4
82  7.324219E-4
83  8.544922E-4
84  9.765625E-4
85  0.0012207031
86  0.0014648438
87  0.0017089844
88  0.001953125
89  0.0024414062
90  0.0029296875
91  0.0034179688
92  0.00390625
93  0.0048828125
94  0.005859375
95  0.0068359375
96  0.0078125
97  0.009765625
98  0.01171875
99  0.013671875
100 0.015625
101 0.01953125
102 0.0234375
103 0.02734375
104 0.03125
105 0.0390625
106 0.046875
107 0.0546875
108 0.0625
109 0.078125
110 0.09375
111 0.109375
112 0.125
113 0.15625
114 0.1875
115 0.21875
116 0.25
117 0.3125
118 0.375
119 0.4375
120 0.5
121 0.625
122 0.75
123 0.875
124 1.0
125 1.25
126 1.5
127 1.75
128 2.0
129 2.5
130 3.0
131 3.5
132 4.0
133 5.0
134 6.0
135 7.0
136 8.0
137 10.0
138 12.0
139 14.0
140 16.0
141 20.0
142 24.0
143 28.0
144 32.0
145 40.0
146 48.0
147 56.0
148 64.0
149 80.0
150 96.0
151 112.0
152 128.0
153 160.0
154 192.0
155 224.0
156 256.0
157 320.0
158 384.0
159 448.0
160 512.0
161 640.0
162 768.0
163 896.0
164 1024.0
165 1280.0
166 1536.0
167 1792.0
168 2048.0
169 2560.0
170 3072.0
171 3584.0
172 4096.0
173 5120.0
174 6144.0
175 7168.0
176 8192.0
177 10240.0
178 12288.0
179 14336.0
180 16384.0
181 20480.0
182 24576.0
183 28672.0
184 32768.0
185 40960.0
186 49152.0
187 57344.0
188 65536.0
189 81920.0
190 98304.0
191 114688.0
192 131072.0
193 163840.0
194 196608.0
195 229376.0
196 262144.0
197 327680.0
198 393216.0
199 458752.0
200 524288.0
201 655360.0
202 786432.0
203 917504.0
204 1048576.0
205 1310720.0
206 1572864.0
207 1835008.0
208 2097152.0
209 2621440.0
210 3145728.0
211 3670016.0
212 4194304.0
213 5242880.0
214 6291456.0
215 7340032.0
216 8388608.0
217 1.048576E7
218 1.2582912E7
219 1.4680064E7
220 1.6777216E7
221 2.097152E7
222 2.5165824E7
223 2.9360128E7
224 3.3554432E7
225 4.194304E7
226 5.0331648E7
227 5.8720256E7
228 6.7108864E7
229 8.388608E7
230 1.00663296E8
231 1.17440512E8
232 1.34217728E8
233 1.6777216E8
234 2.01326592E8
235 2.34881024E8
236 2.68435456E8
237 3.3554432E8
238 4.02653184E8
239 4.69762048E8
240 5.3687091E8
241 6.7108864E8
242 8.0530637E8
243 9.395241E8
244 1.07374182E9
245 1.34217728E9
246 1.61061274E9
247 1.87904819E9
248 2.14748365E9
249 2.68435456E9
250 3.22122547E9
251 3.75809638E9
25

Re: Fw: Download solr-tools rpm

2007-03-28 Thread Chris Hostetter

: I need to configure master / slave servers. Hence i check at wiki help
: documents. I found that i need to install solr-tools rpm. But i could
: not able to download the files. Please some help me with solr-tools rpm.

Any refrences to a "solr-tools rpm" on the wiki are outdated and leftover
from when i ported those wiki pages from CNET ... Apache Solr doesn't
distribute anything as an RPM, you should be abl to find all of those
scripts in the Solr release tgz bundles.

-Hoss



Re: Best approach for indexing and querying against a multivalue name field like directors or actors?

2007-03-28 Thread Chris Hostetter

you'll want to look into the positionIncrementGap attribute that can be
specified when defining an Analyzer for your field type ... it defines the
"logical" gap between tokens in a multi-value field, so if you use a
whitespace tokenizer add the names "Jane Smith" and "John Doe" you'll get
the tokens "Jane", "Smith", ... John", "Doe" with a big gap between Smith
and John .. so now you cna do phrase queries and as long as the slop on
your phrase queries is less the the gap you used you don't have to worry
about false matches on "Jane Doe"



: Date: Wed, 28 Mar 2007 17:28:47 -0400
: From: Daniel Einspanjer <[EMAIL PROTECTED]>
: Reply-To: solr-user@lucene.apache.org
: To: solr-user@lucene.apache.org
: Subject: Best approach for indexing and querying against a multivalue
: name field like directors or actors?
:
: I'm rather new to Solr and somewhat rusty on what little I learned on
: Lucene a few years back.
:
: I've got some documents I want to index that have multiple name fields
: such as directors or actors. I'm wanting to index them such that
: querying for "Jane Doe" would have a higher score for "Jane M. Doe"
: than for "John Doe", but I need to make sure that "Jane Doe" wouldn't
: match a document with two directors, "Jane Smith" and "John Doe" at
: all.
:
: If anyone has done something like this and could suggest some of the
: solr filters that might be useful to me, I'd greatly appreciate it.
:
: Daniel
:



-Hoss



Re: maximum index size

2007-03-28 Thread Otis Gospodnetic
Hi Mike,

I'm curious about what you said there:  "People have constructed (lucene) 
indices with over a billion
documents.".  Are you referring to somebody specific?  I've never heard of 
anyone creating a single Lucene index that large, but I'd love to know who did 
that.

Thanks,
Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Tuesday, March 27, 2007 6:20:40 PM
Subject: Re: maximum index size

On 3/27/07, Kevin Osborn <[EMAIL PROTECTED]> wrote:
> I know there are a bunch of variables here (RAM, number of fields, hits, 
> etc.), but I am trying to get a sense of how big of an index in terms of 
> number of documents Solr can reasonable handle. I have heard indexes of 3-4 
> million documents running fine. But, I have no idea what a reasonable upper 
> limit might be.

People have constructed (lucene) indices with over a billion
documents.  But if "reasonable" means something like "<1s query time
for a medium-complexity query on non-astronomical hardware", I
wouldn't go much higher than the figure you quote.

> I have a large number of documents and about 200-300 customers would have 
> access to varying subsets of those documents. So, one possible strategy is to 
> have everything in a large index, but duplicate the documents for each 
> customer that has access to that document. But, that would really make the 
> total number of documents huge. So, I am trying to get a sense of how big is 
> too big. Each document will probably have about 30 fields. Most of them will 
> be strings, but there will be some text, ints,a nd floats.

If you are going to store a document for each customer then some field
must indicate to which customer the document instance belongs.  In
that case, why not index a single copy of each document, with a field
containing a list of customers having access?

-Mike





Snippets of indexed text

2007-03-28 Thread Pierre-Yves LANDRON

Hello everybody !

I wondering if there a way to get some relevant snippets (searched terms 
contextualized) of indexed text with a solr response to a query, instead of 
just the entire indexed field ? ( more widely, what are the possibilities to 
let solr formate the answer (highlight terms, etc.) ? )


Thanks,
Kind regards,
P-Y Landron

_
Express yourself instantly with MSN Messenger! Download today it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




Re: Snippets of indexed text

2007-03-28 Thread Thierry Collogne

It is possible. You need to pass highlighting parameters. Look here :

 http://wiki.apache.org/solr/HighlightingParameters

Hope this helps.

On 29/03/07, Pierre-Yves LANDRON <[EMAIL PROTECTED]> wrote:


Hello everybody !

I wondering if there a way to get some relevant snippets (searched terms
contextualized) of indexed text with a solr response to a query, instead
of
just the entire indexed field ? ( more widely, what are the possibilities
to
let solr formate the answer (highlight terms, etc.) ? )

Thanks,
Kind regards,
P-Y Landron

_
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/




index problem

2007-03-28 Thread James liu

i use freebsd6, tomcat 6(without install)+jdk1.5_07+php5+mssql

i debug my program and data is ok before update to do index

and index process is ok. no error.

but i find index file not what i wanna.  it have changed.

tomcat6's server.xml,,i added "URIEncoding="UTF-8"

data send to solr do index by curl (with utf-8)


anyone know how to fix it?


--
regards
jl