Item Search Database
hi, i have a performance question... we need to implement a feature called 'Item Search Database', which basically means we have to limit the documents a user can search ... example : Item1 is in database1 item2 is in database2 item3 is in database1 and database2 and the client can only see the items in database1 we currently solve this by making a new solrcolumn for each searchdatabase... so it looks like this : ITEMNAMEDB1 DB2 - -- -- Item1 truefalse Item2 false true Item3 truetrue and we limit the result of a search by putting "db1:true" in the querystring but i have been reading about another method : we could also use just one solrcolum and put the names of the database in it... like so : ITEMNAMEDB - - Item1 DB1 Item2 DB2 Item3 DB1 DB2 and limit the results by putting 'db:db1' in the querystring and now for my question : which of these options will be more performant ? my guess is the first option will be the most performant since the indexes will be better constructed but i would really like a professional opinion on this ... as i said, we are currently using the first option on 300.000 testrecords and it is really performant. some SearchDatabases have only 12 records in it and it takes less then 1ms to get those 12 records back... so i'm guessing Solr is not searching the full 300.000 records and i am kind of afraid that with the second option Solr will have to search more records/indexes to get the same result... well, hope you understand my question and thanks in advance ! - Maarten PS: thank you to everybody on this list for the help and thank you to all of the Solr/Lucene developers, great stuff !!
Auto index update
Hello, Can anybody suggest me of what is the best method to implement auto index update on SOLR from mysql database. thanks and regards aditya
Fw: Download solr-tools rpm
Hi, I need to configure master / slave servers. Hence i check at wiki help documents. I found that i need to install solr-tools rpm. But i could not able to download the files. Please some help me with solr-tools rpm. Suresh Kannan
failing post-optimize command execution
Hi, I've configured my solrconfig.xml to execute a snapshoot after an optimize is made but I keep getting the following exception in the tomcat logs: SEVERE: java.io.IOException: Cannot run program "snapshooter" (in directory "/home/solr/solr/bin"): java.io.IOException: error=2, No such file or directory I'm certain the path and filename is correct.. does anybody have problems with this? Cheers, galo
Re: failing post-optimize command execution
What about access rights on file snapshooter and on directories in path /home/solr/solr/bin ? Maybe this is the root of the problem? On 3/28/07, galo <[EMAIL PROTECTED]> wrote: Hi, I've configured my solrconfig.xml to execute a snapshoot after an optimize is made but I keep getting the following exception in the tomcat logs: SEVERE: java.io.IOException: Cannot run program "snapshooter" (in directory "/home/solr/solr/bin"): java.io.IOException: error=2, No such file or directory I'm certain the path and filename is correct.. does anybody have problems with this? Cheers, galo -- Best regards, Traut
Solr finding doc by one field but not by another
Hi everyone. Can anyone explain how this might happen? I query by the "ID" field and get the following result: = 0 16 ID:ee483237-399c-4b17-ad73-000cc54fd3e1 COSMEO US ee483237-399c-4b17-ad73-000cc54fd3e1 en-US Social Studies American History Historical Periods Expansion and Reform 1801-1861 Territorial Expansion EncyclopediaArticles 2005 Pony Express was a mail service operating between Saint Joseph, Mo., and Sacramento, Calif., inaugurated on April 3, 1860, under the direction of the Central Overland California and Pike's Peak Express Co. True Pony Express pony express = Then I query by the "title" field from the result above (so I know the document is in the index and has been committed), and I get zero results: = 0 0 title:"Pony Express" = "ID" is not the only field that I can find the doc by. Searching for "Type:encyclopediaarticles" finds it too. Also, "title" is not the only field that misses the doc. A search by "vocabulary" misses it too. I haven't tried all the fields yet to see exhaustively which ones find it and which ones don't. I can do that if it would help. For what it's worth, I started with an existing Lucene index and modified Solr's schema.xml so that I could just use the Lucene index in Solr. That Lucene index had about 230K docs. I then used your "post.jar" to post another 10K docs to the index after starting up the server. Those 10K docs only had 7 of the 30 fields that the original 230K docs had. Could that be the problem? I am noticing that the docs that I'm having problems with are from the original 230K-doc index, not from my subsequent 10K-doc post. The 10K docs seem to be findable by any of their 7 fields. Here are my config files: http://www.nabble.com/file/7488/schema.xml schema.xml http://www.nabble.com/file/7489/solrconfig.xml solrconfig.xml Any help is greatly appreciated. Thanks, -Dan -- View this message in context: http://www.nabble.com/Solr-finding-doc-by-one-field-but-not-by-another-tf3481287.html#a9716918 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr finding doc by one field but not by another
On 3/28/07, Theodan <[EMAIL PROTECTED]> wrote: For what it's worth, I started with an existing Lucene index and modified Solr's schema.xml so that I could just use the Lucene index in Solr. That Lucene index had about 230K docs. I then used your "post.jar" to post another 10K docs to the index after starting up the server. Those 10K docs only had 7 of the 30 fields that the original 230K docs had. Could that be the problem? I am noticing that the docs that I'm having problems with are from the original 230K-doc index, not from my subsequent 10K-doc post. The 10K docs seem to be findable by any of their 7 fields. This is almost certainly due to a mismatch between the index- and query-time analysis of the fields. For instance, your schema defines the title field to be "string" (unanalyzed), but it is likely that some tokenization (perhaps via StandardAnalyzer) occurred in the original index. -Mike
Re: Document boost not as expected...
Chris, Earlier I was trying to modify the Similarity computation to make it field dependent (we are trying to change tf based on the field). Now, I have reverted the custom computation so that the default Similarity is used. Fro testing, I boosted a single field in one doc. Y ... This is what I see in the explain - 2.5 = (MATCH) sum of: 2.5 = (MATCH) fieldWeight(show_all_flag:Y in 17), product of: 1.0 = tf(termFreq(show_all_flag:Y)=1) 1.0 = idf(docFreq=36239) 2.5 = fieldNorm(field=show_all_flag, doc=17) Again, I fail to understand where it is doing a multiplication by 1.25 (score (2.5) = field_boost (2.0) * 1.25 ??). Thanks. Chris Hostetter wrote: > > > Ditto everything Mike said, but i'm also curious what Similarity changes > you made ... without knowing what that code looks like, all bets are off > in terms of anyone being able to help you understand the scores you are > seeing. > > : I am not quite sure how the score changed from 1.33 to 1.25. I am not > quite > : sure how this might have happened - I have modified the custom > similarity > : but I don't quite have an explanation of how the score changed. > > > -Hoss > > > -- View this message in context: http://www.nabble.com/Document-boost-not-as-expected...-tf3476653.html#a9718403 Sent from the Solr - User mailing list archive at Nabble.com.
Re: Document boost not as expected...
On 3/28/07, escher2k <[EMAIL PROTECTED]> wrote: Again, I fail to understand where it is doing a multiplication by 1.25 (score (2.5) = field_boost (2.0) * 1.25 ??). As I said above, lengthNorm is also multiplied in. This will depend on your custom similar what value(s) you have in the field. -Mike
Controlling read/write access for replicated indexes
I'm curious what mechanisms everyone is using to control read/write access for distributed replicated indexes. We're moving to a replication environment very soon, and our client applications (quite a few) all have configuration pointers to the URLs for solr instances. As a precaution, I don't want errant configuration values to inadvertently send write requests to read servers, as an example. As an aside, we're running solr under tomcat 5.5.x which has its own control aspects as well. Any best practices, i.e. something that's not a maintenance headache later, from those who have done this would be greatly appreciated. thanks, j.r.
Re: Document boost not as expected...
Mike, I am not doing anything custom for this test. I am assuming that the Default Similarity is used. Surprisingly, if I remove the document level boost (set to 1.0) and just have a field level boost, the result seems to be correct. Mike Klaas wrote: > > On 3/28/07, escher2k <[EMAIL PROTECTED]> wrote: > >> Again, I fail to understand where it is doing a multiplication by 1.25 >> (score (2.5) = field_boost (2.0) * 1.25 ??). > > As I said above, lengthNorm is also multiplied in. This will depend > on your custom similar what value(s) you have in the field. > > -Mike > > -- View this message in context: http://www.nabble.com/Document-boost-not-as-expected...-tf3476653.html#a9722264 Sent from the Solr - User mailing list archive at Nabble.com.
Best approach for indexing and querying against a multivalue name field like directors or actors?
I'm rather new to Solr and somewhat rusty on what little I learned on Lucene a few years back. I've got some documents I want to index that have multiple name fields such as directors or actors. I'm wanting to index them such that querying for "Jane Doe" would have a higher score for "Jane M. Doe" than for "John Doe", but I need to make sure that "Jane Doe" wouldn't match a document with two directors, "Jane Smith" and "John Doe" at all. If anyone has done something like this and could suggest some of the solr filters that might be useful to me, I'd greatly appreciate it. Daniel
Re: Best approach for indexing and querying against a multivalue name field like directors or actors?
I'm sorry, I said something confusing there. Let me try that last case again. If you have three documents with a multivalue field named director (represented here by ; separator) 1. "Jane M. Doe" 2. "Jane Smith"; "John Doe" 3. "John Doe" And the user searched for director:"Jane Doe", I would ideally like 1. to have the highest score and 2 and 3 to have nearly equal scores. The experiments I've done so far have given 2. a score higher than 3. because the terms Jane and Doe were found in document 2. even though they were in separate instances of the multivalue field. I hope this makes understanding my question better rather than worse. :) Thanks, Daniel On 3/28/07, Daniel Einspanjer <[EMAIL PROTECTED]> wrote: but I need to make sure that "Jane Doe" wouldn't match a document with two directors, "Jane Smith" and "John Doe" at all.
Re: Document boost not as expected...
On 3/28/07, escher2k <[EMAIL PROTECTED]> wrote: Mike, I am not doing anything custom for this test. I am assuming that the Default Similarity is used. Surprisingly, if I remove the document level boost (set to 1.0) and just have a field level boost, the result seems to be correct. Another detail that I forgot to mention is that fieldNorms are encoded into one-byte floats, so you can experience severe rounding errors. The possible values are: 0 0.0 1 5.820766E-10 2 6.9849193E-10 3 8.1490725E-10 4 9.313226E-10 5 1.1641532E-9 6 1.3969839E-9 7 1.6298145E-9 8 1.8626451E-9 9 2.3283064E-9 10 2.7939677E-9 11 3.259629E-9 12 3.7252903E-9 13 4.656613E-9 14 5.5879354E-9 15 6.519258E-9 16 7.4505806E-9 17 9.313226E-9 18 1.1175871E-8 19 1.3038516E-8 20 1.4901161E-8 21 1.8626451E-8 22 2.2351742E-8 23 2.6077032E-8 24 2.9802322E-8 25 3.7252903E-8 26 4.4703484E-8 27 5.2154064E-8 28 5.9604645E-8 29 7.4505806E-8 30 8.940697E-8 31 1.0430813E-7 32 1.1920929E-7 33 1.4901161E-7 34 1.7881393E-7 35 2.0861626E-7 36 2.3841858E-7 37 2.9802322E-7 38 3.5762787E-7 39 4.172325E-7 40 4.7683716E-7 41 5.9604645E-7 42 7.1525574E-7 43 8.34465E-7 44 9.536743E-7 45 1.1920929E-6 46 1.4305115E-6 47 1.66893E-6 48 1.9073486E-6 49 2.3841858E-6 50 2.861023E-6 51 3.33786E-6 52 3.8146973E-6 53 4.7683716E-6 54 5.722046E-6 55 6.67572E-6 56 7.6293945E-6 57 9.536743E-6 58 1.1444092E-5 59 1.335144E-5 60 1.5258789E-5 61 1.9073486E-5 62 2.2888184E-5 63 2.670288E-5 64 3.0517578E-5 65 3.8146973E-5 66 4.5776367E-5 67 5.340576E-5 68 6.1035156E-5 69 7.6293945E-5 70 9.1552734E-5 71 1.0681152E-4 72 1.2207031E-4 73 1.5258789E-4 74 1.8310547E-4 75 2.1362305E-4 76 2.4414062E-4 77 3.0517578E-4 78 3.6621094E-4 79 4.272461E-4 80 4.8828125E-4 81 6.1035156E-4 82 7.324219E-4 83 8.544922E-4 84 9.765625E-4 85 0.0012207031 86 0.0014648438 87 0.0017089844 88 0.001953125 89 0.0024414062 90 0.0029296875 91 0.0034179688 92 0.00390625 93 0.0048828125 94 0.005859375 95 0.0068359375 96 0.0078125 97 0.009765625 98 0.01171875 99 0.013671875 100 0.015625 101 0.01953125 102 0.0234375 103 0.02734375 104 0.03125 105 0.0390625 106 0.046875 107 0.0546875 108 0.0625 109 0.078125 110 0.09375 111 0.109375 112 0.125 113 0.15625 114 0.1875 115 0.21875 116 0.25 117 0.3125 118 0.375 119 0.4375 120 0.5 121 0.625 122 0.75 123 0.875 124 1.0 125 1.25 126 1.5 127 1.75 128 2.0 129 2.5 130 3.0 131 3.5 132 4.0 133 5.0 134 6.0 135 7.0 136 8.0 137 10.0 138 12.0 139 14.0 140 16.0 141 20.0 142 24.0 143 28.0 144 32.0 145 40.0 146 48.0 147 56.0 148 64.0 149 80.0 150 96.0 151 112.0 152 128.0 153 160.0 154 192.0 155 224.0 156 256.0 157 320.0 158 384.0 159 448.0 160 512.0 161 640.0 162 768.0 163 896.0 164 1024.0 165 1280.0 166 1536.0 167 1792.0 168 2048.0 169 2560.0 170 3072.0 171 3584.0 172 4096.0 173 5120.0 174 6144.0 175 7168.0 176 8192.0 177 10240.0 178 12288.0 179 14336.0 180 16384.0 181 20480.0 182 24576.0 183 28672.0 184 32768.0 185 40960.0 186 49152.0 187 57344.0 188 65536.0 189 81920.0 190 98304.0 191 114688.0 192 131072.0 193 163840.0 194 196608.0 195 229376.0 196 262144.0 197 327680.0 198 393216.0 199 458752.0 200 524288.0 201 655360.0 202 786432.0 203 917504.0 204 1048576.0 205 1310720.0 206 1572864.0 207 1835008.0 208 2097152.0 209 2621440.0 210 3145728.0 211 3670016.0 212 4194304.0 213 5242880.0 214 6291456.0 215 7340032.0 216 8388608.0 217 1.048576E7 218 1.2582912E7 219 1.4680064E7 220 1.6777216E7 221 2.097152E7 222 2.5165824E7 223 2.9360128E7 224 3.3554432E7 225 4.194304E7 226 5.0331648E7 227 5.8720256E7 228 6.7108864E7 229 8.388608E7 230 1.00663296E8 231 1.17440512E8 232 1.34217728E8 233 1.6777216E8 234 2.01326592E8 235 2.34881024E8 236 2.68435456E8 237 3.3554432E8 238 4.02653184E8 239 4.69762048E8 240 5.3687091E8 241 6.7108864E8 242 8.0530637E8 243 9.395241E8 244 1.07374182E9 245 1.34217728E9 246 1.61061274E9 247 1.87904819E9 248 2.14748365E9 249 2.68435456E9 250 3.22122547E9 251 3.75809638E9 25
Re: Fw: Download solr-tools rpm
: I need to configure master / slave servers. Hence i check at wiki help : documents. I found that i need to install solr-tools rpm. But i could : not able to download the files. Please some help me with solr-tools rpm. Any refrences to a "solr-tools rpm" on the wiki are outdated and leftover from when i ported those wiki pages from CNET ... Apache Solr doesn't distribute anything as an RPM, you should be abl to find all of those scripts in the Solr release tgz bundles. -Hoss
Re: Best approach for indexing and querying against a multivalue name field like directors or actors?
you'll want to look into the positionIncrementGap attribute that can be specified when defining an Analyzer for your field type ... it defines the "logical" gap between tokens in a multi-value field, so if you use a whitespace tokenizer add the names "Jane Smith" and "John Doe" you'll get the tokens "Jane", "Smith", ... John", "Doe" with a big gap between Smith and John .. so now you cna do phrase queries and as long as the slop on your phrase queries is less the the gap you used you don't have to worry about false matches on "Jane Doe" : Date: Wed, 28 Mar 2007 17:28:47 -0400 : From: Daniel Einspanjer <[EMAIL PROTECTED]> : Reply-To: solr-user@lucene.apache.org : To: solr-user@lucene.apache.org : Subject: Best approach for indexing and querying against a multivalue : name field like directors or actors? : : I'm rather new to Solr and somewhat rusty on what little I learned on : Lucene a few years back. : : I've got some documents I want to index that have multiple name fields : such as directors or actors. I'm wanting to index them such that : querying for "Jane Doe" would have a higher score for "Jane M. Doe" : than for "John Doe", but I need to make sure that "Jane Doe" wouldn't : match a document with two directors, "Jane Smith" and "John Doe" at : all. : : If anyone has done something like this and could suggest some of the : solr filters that might be useful to me, I'd greatly appreciate it. : : Daniel : -Hoss
Re: maximum index size
Hi Mike, I'm curious about what you said there: "People have constructed (lucene) indices with over a billion documents.". Are you referring to somebody specific? I've never heard of anyone creating a single Lucene index that large, but I'd love to know who did that. Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: Mike Klaas <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Tuesday, March 27, 2007 6:20:40 PM Subject: Re: maximum index size On 3/27/07, Kevin Osborn <[EMAIL PROTECTED]> wrote: > I know there are a bunch of variables here (RAM, number of fields, hits, > etc.), but I am trying to get a sense of how big of an index in terms of > number of documents Solr can reasonable handle. I have heard indexes of 3-4 > million documents running fine. But, I have no idea what a reasonable upper > limit might be. People have constructed (lucene) indices with over a billion documents. But if "reasonable" means something like "<1s query time for a medium-complexity query on non-astronomical hardware", I wouldn't go much higher than the figure you quote. > I have a large number of documents and about 200-300 customers would have > access to varying subsets of those documents. So, one possible strategy is to > have everything in a large index, but duplicate the documents for each > customer that has access to that document. But, that would really make the > total number of documents huge. So, I am trying to get a sense of how big is > too big. Each document will probably have about 30 fields. Most of them will > be strings, but there will be some text, ints,a nd floats. If you are going to store a document for each customer then some field must indicate to which customer the document instance belongs. In that case, why not index a single copy of each document, with a field containing a list of customers having access? -Mike
Snippets of indexed text
Hello everybody ! I wondering if there a way to get some relevant snippets (searched terms contextualized) of indexed text with a solr response to a query, instead of just the entire indexed field ? ( more widely, what are the possibilities to let solr formate the answer (highlight terms, etc.) ? ) Thanks, Kind regards, P-Y Landron _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Snippets of indexed text
It is possible. You need to pass highlighting parameters. Look here : http://wiki.apache.org/solr/HighlightingParameters Hope this helps. On 29/03/07, Pierre-Yves LANDRON <[EMAIL PROTECTED]> wrote: Hello everybody ! I wondering if there a way to get some relevant snippets (searched terms contextualized) of indexed text with a solr response to a query, instead of just the entire indexed field ? ( more widely, what are the possibilities to let solr formate the answer (highlight terms, etc.) ? ) Thanks, Kind regards, P-Y Landron _ Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
index problem
i use freebsd6, tomcat 6(without install)+jdk1.5_07+php5+mssql i debug my program and data is ok before update to do index and index process is ok. no error. but i find index file not what i wanna. it have changed. tomcat6's server.xml,,i added "URIEncoding="UTF-8" data send to solr do index by curl (with utf-8) anyone know how to fix it? -- regards jl