Re: joins and filter queries effecting scoring
Have your tried using the join in the fq instead of the q? Like this (assuming user_id_i is a field in the post document type and self_id_i a field in the user document type): q=posts_text:"hello"&fq={!join from=self_id_i to=user_id_i}is_active_boolean:true In this example the fq produces a docset that contains all user documents that are active. This docset is used as filter during the execution of the main query (q param), so it only returns posts with the contain the text hello for active users. Martijn On 28 October 2011 01:57, Jason Toy wrote: > Does anyone have any idea on this issue? > > On Tue, Oct 25, 2011 at 11:40 AM, Jason Toy wrote: > >> Hi Yonik, >> >> Without a Join I would normally query user docs with: >> q=data_text:"test"&fq=is_active_boolean:true >> >> With joining users with posts, I get no no results: >> q={!join from=self_id_i >> to=user_id_i}data_text:"test"&fq=is_active_boolean:true&fq=posts_text:"hello" >> >> >> >> I am able to use this query, but it gives me the results in an order that I >> dont want(nor do I understand its order): >> q={!join from=self_id_i to=user_id_i}data_text:"test" AND >> is_active_boolean:true&fq=posts_text:"hello" >> >> I want the order to be the same as I would get from my original >> "q=data_text:"test"&fq=is_active_boolean:true", but with the ability to join >> with the Posts docs. >> >> >> >> >> >> On Tue, Oct 25, 2011 at 11:30 AM, Yonik Seeley > > wrote: >> >>> Can you give an example of the request (URL) you are sending to Solr? >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >>> >>> >>> On Mon, Oct 24, 2011 at 3:31 PM, Jason Toy wrote: >>> > I have 2 types of docs, users and posts. >>> > I want to view all the docs that belong to certain users by joining >>> posts >>> > and users together. I have to filter the users with a filter query of >>> > "is_active_boolean:true" so that the score is not effected,but since I >>> do a >>> > join, I have to move the filter query to the query parameter so that I >>> can >>> > get the filter applied. The problem is that since the is_active_boolean >>> is >>> > moved to the query, the score is affected which returns back an order >>> that I >>> > don't want. >>> > If I leave the is_active_boolean:true in the fq paramater, I get no >>> > results back. >>> > >>> > My question is how can I apply a filter query to users so that the score >>> is >>> > not effected? >>> > >>> >> >> >> >> -- >> - sent from my mobile >> >> >> > > > -- > - sent from my mobile > -- Met vriendelijke groet, Martijn van Groningen
Always return total number of documents
Currently I'm making 2 calls to Solr to be able to state "matched 20 out of 200 documents". Is there no way to return the total number of docs as part of a search? -- IntelCompute Web Design & Local Online Marketing http://www.intelcompute.com
Re: Always return total number of documents
Am 28.10.2011 11:16, schrieb Robert Brown: > Is there no way to return the total number of docs as part of a search? No, it isn't. Usually this information is of absolutely no value to the end user. A workaround would be to add some field to the schema that has the same value for every document, and use this for facetting. Greetings, Kuli
Re: Always return total number of documents
Cheers Kuli, This is actually of huge importance to our customers, to see how many documents we store. The faceting option sounds a bit messy, maybe we'll have to stick with 2 queries. --- IntelCompute Web Design & Local Online Marketing http://www.intelcompute.com On Fri, 28 Oct 2011 11:43:11 +0200, Michael Kuhlmann wrote: > Am 28.10.2011 11:16, schrieb Robert Brown: >> Is there no way to return the total number of docs as part of a search? > > No, it isn't. Usually this information is of absolutely no value to the > end user. > > A workaround would be to add some field to the schema that has the same > value for every document, and use this for facetting. > > Greetings, > Kuli
Re: Too many values for UnInvertedField faceting on field autocompleteField
Am Mittwoch, den 26.10.2011, 08:02 -0400 schrieb Yonik Seeley: > You can also try adding facet.method=enum directly to your request Added query.set("facet.method", "enum"); to my solr query at code level and now it works. Don't know why the handler stuff gets ignored or overriden, but its ok for my usecase to specify it at query level. thx Torsten smime.p7s Description: S/MIME cryptographic signature
Solr 3.4 group.truncate does not work with facet queries
Hi, I'm using Grouping with group.truncate=true, The following simple facet query: facet.query=Monitor_id:[38 TO 40] Doesn't give the same number as the nGroups result (with grouping.ngroups=true) for the equivalent filter query: fq=Monitor_id:[38 TO 40] I thought they should be the same - from the Wiki page: 'group.truncate: If true, facet counts are based on the most relevant document of each group matching the query.' What am I doing wrong? If I turn off group.truncate then the counts are the same, as I'd expect - but unfortunately I'm only interested in the grouped results. - I have also asked this question on StackOverflow, here: http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries Thanks! -- Ian i...@isfluent.com +44 (0)1223 257903
Re: changing omitNorms on an already built index
On Fri, Oct 28, 2011 at 12:20 AM, Robert Muir wrote: > On Thu, Oct 27, 2011 at 6:00 PM, Simon Willnauer > wrote: >> we are not actively removing norms. if you set omitNorms=true and >> index documents they won't have norms for this field. Yet, other >> segment still have norms until they get merged with a segment that has >> no norms for that field ie. omits norms. omitNorms is anti-viral so >> once you set it to true it will be true for other segment eventually. >> If you optimize you index you should see that norms go away. >> > > This is only true in trunk (4.x!) > https://issues.apache.org/jira/browse/LUCENE-2846 ah right, I thought this was ported - nevermind! thanks robert simon > > -- > lucidimagination.com >
Re: Solr 3.4 group.truncate does not work with facet queries
Hi Ian, I think this is a bug. After looking into the code the facet.query feature doesn't take into account the group.truncate option. This needs to be fixed. You can open a new issue in Jira if you want to. Martijn On 28 October 2011 12:09, Ian Grainger wrote: > Hi, I'm using Grouping with group.truncate=true, The following simple facet > query: > > facet.query=Monitor_id:[38 TO 40] > > Doesn't give the same number as the nGroups result (with > grouping.ngroups=true) for the equivalent filter query: > > fq=Monitor_id:[38 TO 40] > > I thought they should be the same - from the Wiki page: 'group.truncate: If > true, facet counts are based on the most relevant document of each group > matching the query.' > > What am I doing wrong? > > If I turn off group.truncate then the counts are the same, as I'd expect - > but unfortunately I'm only interested in the grouped results. > > - I have also asked this question on StackOverflow, here: > http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries > > Thanks! > > -- > Ian > > i...@isfluent.com > +44 (0)1223 257903 > -- Met vriendelijke groet, Martijn van Groningen
Re: Solr 3.4 group.truncate does not work with facet queries
Thanks, Marijn. I have logged the bug here: https://issues.apache.org/jira/browse/SOLR-2863 Is there any chance of a workaround for this issue before the bug is fixed? If you want to answer the question on StackOverflow: http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries I'll accept your answer. On Fri, Oct 28, 2011 at 12:14 PM, Martijn v Groningen < martijn.v.gronin...@gmail.com> wrote: > Hi Ian, > > I think this is a bug. After looking into the code the facet.query > feature doesn't take into account the group.truncate option. > This needs to be fixed. You can open a new issue in Jira if you want to. > > Martijn > > On 28 October 2011 12:09, Ian Grainger wrote: > > Hi, I'm using Grouping with group.truncate=true, The following simple > facet > > query: > > > > facet.query=Monitor_id:[38 TO 40] > > > > Doesn't give the same number as the nGroups result (with > > grouping.ngroups=true) for the equivalent filter query: > > > > fq=Monitor_id:[38 TO 40] > > > > I thought they should be the same - from the Wiki page: 'group.truncate: > If > > true, facet counts are based on the most relevant document of each group > > matching the query.' > > > > What am I doing wrong? > > > > If I turn off group.truncate then the counts are the same, as I'd expect > - > > but unfortunately I'm only interested in the grouped results. > > > > - I have also asked this question on StackOverflow, here: > > > http://stackoverflow.com/questions/7905756/solr-3-4-group-truncate-does-not-work-with-facet-queries > > > > Thanks! > > > > -- > > Ian > > > > i...@isfluent.com > > +44 (0)1223 257903 > > > > > > -- > Met vriendelijke groet, > > Martijn van Groningen > -- Ian i...@isfluent.com +44 (0)1223 257903
Solr Profiling
Hi, My Solr becomes very slow or hangs up at times, we have done almost everything possible like . Giving 16GB memory to JVM . Sharding But these help only for X time, i want to profile the server and see whats going wrong? How can I profile solr remotely? Regards, Rohit
Re: solr break up word
Hi Erick, I'll try without the type="index" on analyzer tag and then I'll re-index some files. Thanks for you answer. On Thu, Oct 27, 2011 at 6:54 PM, Erick Erickson wrote: > Hmmm, I'm not sure what happens when you specify > (without type="index" and > . I have no clue which one > is used. > > Look at the admin/analysis page to understand how things are > broken up. > > Did you re-index after you added the ngram filter? > > You'll get better help if you include example queries with > &debugQuery=on appended, it'll give us a lot more to > work with. > > Best > Erick > > On Wed, Oct 26, 2011 at 4:14 PM, Boris Quiroz wrote: >> Hi, >> >> I've solr running on a CentOS server working OK, but sometimes my >> application needs to index some parts of a word. For example, if I search >> 'dislike' word fine but if I search 'disl' it returns zero. Also, if I >> search 'disl*' returns some values (the same if I search for 'dislike') but >> if I search 'dislike*' it returns zero too. >> >> So, I've two questions: >> >> 1. How exactly the asterisk works as a wildcard? >> >> 2. What can I do to index properly parts of a word? I added this lines to my >> schema.xml: >> >> >> >> >> >> >> > maxGramSize="15"/> >> >> >> >> >> >> >> >> >> >> But I can't get it to work. Is OK what I did or I'm wrong? >> >> Thanks. >> >> -- >> Boris Quiroz >> boris.qui...@menco.it >> >> > -- Boris Quiroz boris.qui...@menco.it
Re: Collection Distribution vs Replication in Solr
So I have to ask my question again. Is there any reason not to use Replication in Solr and use Collection Distribution? Thanks On Thu, Oct 27, 2011 at 5:33 PM, Alireza Salimi wrote: > I can't see those benchmarks, can you? > > On Thu, Oct 27, 2011 at 5:20 PM, Marc Sturlese wrote: > >> Replication is easier to manage and a bit faster. See the performance >> numbers: http://wiki.apache.org/solr/SolrReplication >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/Collection-Distribution-vs-Replication-in-Solr-tp3458724p3459178.html >> Sent from the Solr - User mailing list archive at Nabble.com. >> > > > > -- > Alireza Salimi > Java EE Developer > > > -- Alireza Salimi Java EE Developer
Re: Faceting on multiple fields, with multiple where clauses
Thank you Erik, Now i understand the difference between Q and QF. Unfortunately, there is 1 unsolved problem left (didn't find the answer yesterday evening). I added grouping on this query, because i want to show a group of trips with the same code only once. (A trip has multiple departure days, and i just want to show 1 trip, while in de detail screen, i'll show all the available trips (departure dates). When i don't filter by country, i receive all countries with their correct count. When i do a filter by country, the count of my countries isn't grouped anymore When i get the number of trips/month, i just get numbers for the next 2 months and no numbers for the other months (the trip should appear here each time in a month, because they have departures in each) Can you help me again? I'll appreciate it very much :) http://localhost:8080/solr/select?facet=true&facet.date={!ex=SD}StartDate&f.StartDate.facet.date.start=2011-10-01T00:00:00Z&f.StartDate.facet.date.end=2012-09-30T00:00:00Z&f.StartDate.facet.date.gap=%2B1MONTH&facet.field={!ex=CC}CountryCode&rows=0&version=2.2&q=*:*&group=true&group.field=RoundtripgroupCode&group.truncate=true These parts of the query are added when a selection is made: &fq={!tag=CC}CountryCode:CR &fq={!tag=SD}StartDate:[2011-10-01T00:00:00Z TO 2011-10-31T00:00:00Z] -- View this message in context: http://lucene.472066.n3.nabble.com/Faceting-on-multiple-fields-with-multiple-where-clauses-tp3457432p3460934.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: bbox issue
Oops, didn't mean for this conversation to leave the mailing lists. OK, so your lat and lon types were being stored as text but not indexed (hence no search matches). A dynamic field of "*" does tend to hide bugs/problems ;-) > So should I have another for _latLon? Would it look like: > Yep. It shouldn't be stored though (unless you just want to verify for debugging). -Yonik http://www.lucidimagination.com On Fri, Oct 28, 2011 at 9:35 AM, Christopher Gross wrote: > Hi Yonik. > > I never made a dynamicField definition for _latLon ... I was following > the examples on http://wiki.apache.org/solr/SpatialSearchDev, so I > just added the field type definition, then the field in the list of > fields. I wasn't aware that I had to do anything else. The only > dynamic that I have is: > multiValued="true"/> > > So should I have another for _latLon? Would it look like: > > > -- Chris > > > > On Fri, Oct 28, 2011 at 9:27 AM, Yonik Seeley > wrote: >> On Fri, Oct 28, 2011 at 8:42 AM, Christopher Gross wrote: >>> Hi Yonik. >>> >>> I'm having more of a problem now... >>> I made the following lines in my schema.xml (in the appropriate places): >>> >>> >> subFieldSuffix="_latLon"/> >>> >>> >> required="false"/> >>> >>> I have data (did a q=*:*, found one with a point): >>> 48.306074,14.286293 >>> >>> 48.306074 >>> >>> >>> 14.286293 >>> >>> >>> I've tried to do a bbox: >>> q=*:*&fq=point:[30.0,10.0%20TO%2050.0,20.0] >>> q=*:*&fq={!bbox}&sfield=point&pt=48,14&d=50 >>> >>> And neither of those seem to find the point... >> >> Hmmm, what's the dynamicField definition for _latLon? Is it indexed? >> If you add debugQuery=true, you should be able to see the underlying >> range queries for your explicit range query. >> >> -Yonik >> http://www.lucidimagination.com >> >
Re: bbox issue
Ah! That all makes sense. The example on the SpacialSearchDev page should have that bit added in! I'm back in business now, thanks Yonik! -- Chris On Fri, Oct 28, 2011 at 9:40 AM, Yonik Seeley wrote: > Oops, didn't mean for this conversation to leave the mailing lists. > > OK, so your lat and lon types were being stored as text but not > indexed (hence no search matches). > A dynamic field of "*" does tend to hide bugs/problems ;-) > >> So should I have another for _latLon? Would it look like: >> > > Yep. It shouldn't be stored though (unless you just want to verify > for debugging). > > -Yonik > http://www.lucidimagination.com > > > > On Fri, Oct 28, 2011 at 9:35 AM, Christopher Gross wrote: >> Hi Yonik. >> >> I never made a dynamicField definition for _latLon ... I was following >> the examples on http://wiki.apache.org/solr/SpatialSearchDev, so I >> just added the field type definition, then the field in the list of >> fields. I wasn't aware that I had to do anything else. The only >> dynamic that I have is: >> > multiValued="true"/> >> >> So should I have another for _latLon? Would it look like: >> >> >> -- Chris >> >> >> >> On Fri, Oct 28, 2011 at 9:27 AM, Yonik Seeley >> wrote: >>> On Fri, Oct 28, 2011 at 8:42 AM, Christopher Gross >>> wrote: Hi Yonik. I'm having more of a problem now... I made the following lines in my schema.xml (in the appropriate places): >>> subFieldSuffix="_latLon"/> >>> required="false"/> I have data (did a q=*:*, found one with a point): 48.306074,14.286293 48.306074 14.286293 I've tried to do a bbox: q=*:*&fq=point:[30.0,10.0%20TO%2050.0,20.0] q=*:*&fq={!bbox}&sfield=point&pt=48,14&d=50 And neither of those seem to find the point... >>> >>> Hmmm, what's the dynamicField definition for _latLon? Is it indexed? >>> If you add debugQuery=true, you should be able to see the underlying >>> range queries for your explicit range query. >>> >>> -Yonik >>> http://www.lucidimagination.com >>> >> >
Re: solr break up word
Hi, I solved the issue. I added to my schema.xml the following lines: ... ... Then, I re-index and everything is working great :-) Thanks for your help. On Fri, Oct 28, 2011 at 10:08 AM, Boris Quiroz wrote: > Hi Erick, > > I'll try without the type="index" on analyzer tag and then I'll > re-index some files. > > Thanks for you answer. > > On Thu, Oct 27, 2011 at 6:54 PM, Erick Erickson > wrote: >> Hmmm, I'm not sure what happens when you specify >> (without type="index" and >> . I have no clue which one >> is used. >> >> Look at the admin/analysis page to understand how things are >> broken up. >> >> Did you re-index after you added the ngram filter? >> >> You'll get better help if you include example queries with >> &debugQuery=on appended, it'll give us a lot more to >> work with. >> >> Best >> Erick >> >> On Wed, Oct 26, 2011 at 4:14 PM, Boris Quiroz wrote: >>> Hi, >>> >>> I've solr running on a CentOS server working OK, but sometimes my >>> application needs to index some parts of a word. For example, if I search >>> 'dislike' word fine but if I search 'disl' it returns zero. Also, if I >>> search 'disl*' returns some values (the same if I search for 'dislike') but >>> if I search 'dislike*' it returns zero too. >>> >>> So, I've two questions: >>> >>> 1. How exactly the asterisk works as a wildcard? >>> >>> 2. What can I do to index properly parts of a word? I added this lines to >>> my schema.xml: >>> >>> >>> >>> >>> >>> >>> >> maxGramSize="15"/> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> But I can't get it to work. Is OK what I did or I'm wrong? >>> >>> Thanks. >>> >>> -- >>> Boris Quiroz >>> boris.qui...@menco.it >>> >>> >> > > > > -- > Boris Quiroz > boris.qui...@menco.it > -- Boris Quiroz boris.qui...@menco.it
Updating a document multi-value field (no dup values) without needed it to be already committed
Sorry for the lengthy text, it's a bit difficult to explain: We are using Solr to index some user info like username, email (among other things). I'm also trying to use facets for search, so for example, I added a multi-value field to user called "organizations" where I would store the name of the organizations that user work for. So i can use that field for facetted search and be able to filter a user search query result by the organizations this user work for. So now, the issue I have is my code does something like: 1) Add users documents to Solr 2) When a user is assigned an organization membership(role), update the user doc to set the organizations field Now I have the following issue with step 2: If I just do a addField("organizations", "BigCorp") on the user doc, it will add that value regardless if organizations already have that value("BigCorp") or not, but I want each org name to appear only once. So only way I found to get that behavior is to query the user document, get the values of "organization" and only add the new value if it's not already in there - if !userDoc.getValues("organiations").contains(value) {... add the value to the doc and save it ...}- Now that works well, but only if I commit all the time(between step 1 & 2 at least), because the document query will not work unless it has been committed already. Obviously in theory its best not to commit all the time performance-wise, and unpractical since I process those inserts in batches. *So I guess the main issue would be:* * Is there a way to update a multi-value field, without allowing duplicates, that would not require querying the doc to manually prevent duplicates ? * Maybe some better way to do this ? Thanks.
Re: Updating a document multi-value field (no dup values) without needed it to be already committed
Related questions is: Is there a way to update a doc to remove a specific value from a multi-value field (in my case remove a role) I manage to do that by querying the doc and reading all the other values "manually" then saving, but that has the same issues and is inefficient. On 10/28/11 10:04 AM, Thibaut Colar wrote: Sorry for the lengthy text, it's a bit difficult to explain: We are using Solr to index some user info like username, email (among other things). I'm also trying to use facets for search, so for example, I added a multi-value field to user called "organizations" where I would store the name of the organizations that user work for. So i can use that field for facetted search and be able to filter a user search query result by the organizations this user work for. So now, the issue I have is my code does something like: 1) Add users documents to Solr 2) When a user is assigned an organization membership(role), update the user doc to set the organizations field Now I have the following issue with step 2: If I just do a addField("organizations", "BigCorp") on the user doc, it will add that value regardless if organizations already have that value("BigCorp") or not, but I want each org name to appear only once. So only way I found to get that behavior is to query the user document, get the values of "organization" and only add the new value if it's not already in there - if !userDoc.getValues("organiations").contains(value) {... add the value to the doc and save it ...}- Now that works well, but only if I commit all the time(between step 1 & 2 at least), because the document query will not work unless it has been committed already. Obviously in theory its best not to commit all the time performance-wise, and unpractical since I process those inserts in batches. *So I guess the main issue would be:* * Is there a way to update a multi-value field, without allowing duplicates, that would not require querying the doc to manually prevent duplicates ? * Maybe some better way to do this ? Thanks.
Recover index
Hello all, When moving a SOLR index to another instance I lost the files: segments.gen segments_xk I have the .cfs file complete. What are my options to recover the data. Any ideia that I can test? Thank you. Frederico Azeiteiro
Re: Query/Delete performance difference between straight HTTP and SolrJ
On 10/27/2011 5:56 AM, Michael Sokolov wrote: From everything you've said, it certainly sounds like a low-level I/O problem in the client, not a server slowdown of any sort. Maybe Perl is using the same connection over and over (keep-alive) and Java is not. I really don't know. One thing I've heard is that StreamingUpdateSolrServer (I think that's what it's called) can give better throughput for large request batches. If you're not using that, you may be having problems w/closing and re-opening connections? I turned off the perl build system and had the Java program take over full build duties for both index chains. It's been designed so one copy of the program can keep any number of index chains up to date simultaneously. On the most recently hourly run, the servers without virtualization took 50 seconds, the servers with virtualization and more memory took only 16 seconds, so it looks like this problem has nothing to do with SolrJ, it's due to the 1000 clause queries actually taking a long time to execute. The 16 second runtime is still longer than the last run by the perl program (12 seconds), but I am also executing an index rebuild in the build cores on those servers, so I'm not overly concerned by that. At this point there isn't any way for me to know whether the speedup with the old server builds is due to the extra memory (OS disk cache) or due to some quirk of virtualization. I'm really hoping it's due to the extra memory, because I really don't want to go back to a virtualized environment. I'll be able to figure it out after I eliminate my current bug and complete the migration. Thank you very much to everyone who offered assistance. It helped me make sure my testing was as unbiased as I could achieve. Shawn
form-data post to ExtractingRequestHandler with utf-8 characters not handled
I'm trying to post a PDF along with a whole bunch of metadata fields to the ExtractingRequestHandler as multipart/form-data. It works fine except for the utf-8 character handling. Here is what my post looks like (abridged): POST /solr/update/extract HTTP/1.1 TE: deflate,gzip;q=0.3 Connection: TE, close Host: localhost:8983 Content-Length: 21418 Content-Type: multipart/form-data; boundary=wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX Content-Disposition: form-data; name=literal.title smart >>‘<< quote --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX Content-Disposition: form-data; name="myfile"; filename="text.pdf.1174588823" Content-Type: application/pdf Content-Transfer-Encoding: binary ...binary pdf data I've verified on the network that the quote character, a LEFT SINGLE QUOTATION MARK (U+2018) is going across the wire as the utf-8 bytes "e2 80 98" which is correct. However, when I search for the document in Solr, it's coming back as the byte sequence "c3 a2 c2 80 c2 98" which I'm guessing is it being double-utf8-encoded. The multipart/form-data is MIME, which is supposed to be 7-bit, so I've tried encoding any non-ascii fields as quoted-printable Content-Disposition: form-data; name=literal.title Content-Transfer-Encoding: quoted-printable smart >>=E2=80=98<< quote= as well as base64 Content-Disposition: form-data; name=literal.title Content-Transfer-Encoding: base64 c21hcnQgPj7igJg8PCBxdW90ZSBmb29iYXI= but what sold puts in its index is just that value, it's not decoding either the quoted-printable or the base64. I've tried encoding the utf-8 values as HTML entities, but then Solr doesn't unescape them either, and any accented characters are stored as the HTML entities, not as the unicode characters. Can anybody give me any pointers as to where I might be going wrong, where to look for solutions, or any different/better ways to handle this? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3461731.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Partial updates?
An ability to update would be extremely useful for us. Different parts of records sometimes come from different databases, and being able to update after creation of the Solr index would be extremely useful. I've made some processes that reads a record and adds a new field to it. The most awkward thing is when there's been a CopyField, when the record is read and re-saved, the copied field causes CopyField to be invoked again. -- View this message in context: http://lucene.472066.n3.nabble.com/Partial-updates-tp502570p3461740.html Sent from the Solr - User mailing list archive at Nabble.com.
large scale indexing issues / single threaded bottleneck
Hi everyone, I'm looking for some help with Solr indexing issues on a large scale. We are indexing few terabytes/month on a sizeable Solr cluster (8 masters / serving writes, 16 slaves / serving reads). After certain amount of tuning we got to the point where a single Solr instance can handle index size of 100GB without much issues, but after that we are starting to observe noticeable delays on index flush and they are getting larger. See the attached picture for details, it's done for a single JVM on a single machine. We are posting data in 8 threads using javabin format and doing commit every 5K documents, merge factor 20, and ram buffer size about 384MB. >From the picture it can be seen that a single-threaded index flushing code kicks in on every commit and blocks all other indexing threads. The hardware is decent (12 physical / 24 virtual cores per machine) and it is mostly idle when the index is flushing. Very little CPU utilization and disk I/O (<5%), with the exception of a single CPU core which actually does index flush (95% CPU, 5% I/O wait). My questions are: 1) will Solr changes from real-time branch help to resolve these issues? I was reading http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html and it looks like we have exactly the same problem 2) what would be the best way to port these (and only these) changes to 3.4.0? I tried to dig into the branching and revisions, but got lost quickly. Tried something like "svn diff […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not sure if it's even possible to merge these into 3.4.0 3) what would you recommend for production 24/7 use? 3.4.0? 4) is there a workaround that can be used? also, I listed the stack trace below Thank you! Roman P.S. This single "index flushing" thread spends 99% of all the time in "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then the merge seems to go quickly. I looked it up and it looks like the intent here is deleting old commit points (we are keeping only 1 non-optimized commit point per config). Not sure why is it taking that long. pool-2-thread-1 [RUNNABLE] CPU time: 3:31 java.nio.Bits.copyToByteArray(long, Object, long, long) java.nio.DirectByteBuffer.get(byte[], int, int) org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, int) org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) org.apache.lucene.index.SegmentTermEnum.next() org.apache.lucene.index.TermInfosReader.(Directory, String, FieldInfos, int, int) org.apache.lucene.index.SegmentCoreReaders.(SegmentReader, Directory, SegmentInfo, int, int) org.apache.lucene.index.SegmentReader.get(boolean, Directory, SegmentInfo, int, boolean, int) org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean, int, int) org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean) org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool, List) org.apache.lucene.index.IndexWriter.doFlush(boolean) org.apache.lucene.index.IndexWriter.flush(boolean, boolean) org.apache.lucene.index.IndexWriter.closeInternal(boolean) org.apache.lucene.index.IndexWriter.close(boolean) org.apache.lucene.index.IndexWriter.close() org.apache.solr.update.SolrIndexWriter.close() org.apache.solr.update.DirectUpdateHandler2.closeWriter() org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand) org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run() java.util.concurrent.Executors$RunnableAdapter.call() java.util.concurrent.FutureTask$Sync.innerRun() java.util.concurrent.FutureTask.run() java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask) java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) java.util.concurrent.ThreadPoolExecutor$Worker.run() java.lang.Thread.run()
Re: large scale indexing issues / single threaded bottleneck
I'm wondering if this is relevant: https://issues.apache.org/jira/browse/LUCENE-2680 - Improve how IndexWriter flushes deletes against existing segments Roman On Fri, Oct 28, 2011 at 11:38 AM, Roman Alekseenkov wrote: > Hi everyone, > > I'm looking for some help with Solr indexing issues on a large scale. > > We are indexing few terabytes/month on a sizeable Solr cluster (8 > masters / serving writes, 16 slaves / serving reads). After certain > amount of tuning we got to the point where a single Solr instance can > handle index size of 100GB without much issues, but after that we are > starting to observe noticeable delays on index flush and they are > getting larger. See the attached picture for details, it's done for a > single JVM on a single machine. > > We are posting data in 8 threads using javabin format and doing commit > every 5K documents, merge factor 20, and ram buffer size about 384MB. > From the picture it can be seen that a single-threaded index flushing > code kicks in on every commit and blocks all other indexing threads. > The hardware is decent (12 physical / 24 virtual cores per machine) > and it is mostly idle when the index is flushing. Very little CPU > utilization and disk I/O (<5%), with the exception of a single CPU > core which actually does index flush (95% CPU, 5% I/O wait). > > My questions are: > > 1) will Solr changes from real-time branch help to resolve these > issues? I was reading > http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html > and it looks like we have exactly the same problem > > 2) what would be the best way to port these (and only these) changes > to 3.4.0? I tried to dig into the branching and revisions, but got > lost quickly. Tried something like "svn diff > […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not > sure if it's even possible to merge these into 3.4.0 > > 3) what would you recommend for production 24/7 use? 3.4.0? > > 4) is there a workaround that can be used? also, I listed the stack trace > below > > Thank you! > Roman > > P.S. This single "index flushing" thread spends 99% of all the time in > "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then > the merge seems to go quickly. I looked it up and it looks like the > intent here is deleting old commit points (we are keeping only 1 > non-optimized commit point per config). Not sure why is it taking that > long. > > pool-2-thread-1 [RUNNABLE] CPU time: 3:31 > java.nio.Bits.copyToByteArray(long, Object, long, long) > java.nio.DirectByteBuffer.get(byte[], int, int) > org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, > int) > org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) > org.apache.lucene.index.SegmentTermEnum.next() > org.apache.lucene.index.TermInfosReader.(Directory, String, > FieldInfos, int, int) > org.apache.lucene.index.SegmentCoreReaders.(SegmentReader, > Directory, SegmentInfo, int, int) > org.apache.lucene.index.SegmentReader.get(boolean, Directory, > SegmentInfo, int, boolean, int) > org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, > boolean, int, int) > org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean) > org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool, > List) > org.apache.lucene.index.IndexWriter.doFlush(boolean) > org.apache.lucene.index.IndexWriter.flush(boolean, boolean) > org.apache.lucene.index.IndexWriter.closeInternal(boolean) > org.apache.lucene.index.IndexWriter.close(boolean) > org.apache.lucene.index.IndexWriter.close() > org.apache.solr.update.SolrIndexWriter.close() > org.apache.solr.update.DirectUpdateHandler2.closeWriter() > org.apache.solr.update.DirectUpdateHandler2.commit(CommitUpdateCommand) > org.apache.solr.update.DirectUpdateHandler2$CommitTracker.run() > java.util.concurrent.Executors$RunnableAdapter.call() > java.util.concurrent.FutureTask$Sync.innerRun() > java.util.concurrent.FutureTask.run() > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor$ScheduledFutureTask) > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run() > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) > java.util.concurrent.ThreadPoolExecutor$Worker.run() > java.lang.Thread.run() >
Re: large scale indexing issues / single threaded bottleneck
Hey Roman, On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov wrote: > Hi everyone, > > I'm looking for some help with Solr indexing issues on a large scale. > > We are indexing few terabytes/month on a sizeable Solr cluster (8 > masters / serving writes, 16 slaves / serving reads). After certain > amount of tuning we got to the point where a single Solr instance can > handle index size of 100GB without much issues, but after that we are > starting to observe noticeable delays on index flush and they are > getting larger. See the attached picture for details, it's done for a > single JVM on a single machine. > > We are posting data in 8 threads using javabin format and doing commit > every 5K documents, merge factor 20, and ram buffer size about 384MB. > From the picture it can be seen that a single-threaded index flushing > code kicks in on every commit and blocks all other indexing threads. > The hardware is decent (12 physical / 24 virtual cores per machine) > and it is mostly idle when the index is flushing. Very little CPU > utilization and disk I/O (<5%), with the exception of a single CPU > core which actually does index flush (95% CPU, 5% I/O wait). > > My questions are: > > 1) will Solr changes from real-time branch help to resolve these > issues? I was reading > http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html > and it looks like we have exactly the same problem did you also read http://bit.ly/ujLw6v - here I try to explain the major difference between Lucene 3.x and 4.0 and why 3.x has these long idle times. In Lucene 3.x a full flush / commit is a single threaded process, as you observed there is only one thread making progress. In Lucene 4 there is still a single thread executing the commit but other threads are not blocked anymore. Depending on how fast the thread can flush other threads might help flushing segments for that commit concurrently or simply index into new documents writers. So basically 4.0 won't have this problem anymore. The realtime branch you talk about is already merged into 4.0 trunk. > > 2) what would be the best way to port these (and only these) changes > to 3.4.0? I tried to dig into the branching and revisions, but got > lost quickly. Tried something like "svn diff > […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not > sure if it's even possible to merge these into 3.4.0 Possible yes! Worth the trouble, I would say no! DocumentsWriterPerThread (DWPT) is a very big change and I don't think we should backport this into our stable branch. However, this feature is very stable in 4.0 though. > > 3) what would you recommend for production 24/7 use? 3.4.0? I think 3.4 is a safe bet! I personally tend to use trunk in production too the only problem is that this is basically a moving target and introduces extra overhead on your side to watch changes and index format modification which could basically prevent you from simple upgrades > > 4) is there a workaround that can be used? also, I listed the stack trace > below > > Thank you! > Roman > > P.S. This single "index flushing" thread spends 99% of all the time in > "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then > the merge seems to go quickly. I looked it up and it looks like the > intent here is deleting old commit points (we are keeping only 1 > non-optimized commit point per config). Not sure why is it taking that > long. in 3.x there is no way to apply deletes without doing a flush (afaik). In 3.x a flush means single threaded again - similar to commit just without syncing files to disk and writing a new segments file. In 4.0 you have way more control over this via IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied without blocking other threads. In trunk we hijack indexing threads to do all that work concurrently so you get better cpu utilization and due to concurrent flushing better and usually continuous IO utilization. hope that helps. simon > > pool-2-thread-1 [RUNNABLE] CPU time: 3:31 > java.nio.Bits.copyToByteArray(long, Object, long, long) > java.nio.DirectByteBuffer.get(byte[], int, int) > org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, > int) > org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) > org.apache.lucene.index.SegmentTermEnum.next() > org.apache.lucene.index.TermInfosReader.(Directory, String, > FieldInfos, int, int) > org.apache.lucene.index.SegmentCoreReaders.(SegmentReader, > Directory, SegmentInfo, int, int) > org.apache.lucene.index.SegmentReader.get(boolean, Directory, > SegmentInfo, int, boolean, int) > org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, > boolean, int, int) > org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean) > org.apache.lucene.index.BufferedDeletesStream.applyDeletes(IndexWriter$ReaderPool, > List) > org.apache.lucene.index.IndexWriter.doFlush(boolean) > org.apache.lucene.index.IndexWriter.flush(boolean, b
Re: large scale indexing issues / single threaded bottleneck
On Fri, Oct 28, 2011 at 9:17 PM, Simon Willnauer wrote: > Hey Roman, > > On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov > wrote: >> Hi everyone, >> >> I'm looking for some help with Solr indexing issues on a large scale. >> >> We are indexing few terabytes/month on a sizeable Solr cluster (8 >> masters / serving writes, 16 slaves / serving reads). After certain >> amount of tuning we got to the point where a single Solr instance can >> handle index size of 100GB without much issues, but after that we are >> starting to observe noticeable delays on index flush and they are >> getting larger. See the attached picture for details, it's done for a >> single JVM on a single machine. >> >> We are posting data in 8 threads using javabin format and doing commit >> every 5K documents, merge factor 20, and ram buffer size about 384MB. >> From the picture it can be seen that a single-threaded index flushing >> code kicks in on every commit and blocks all other indexing threads. >> The hardware is decent (12 physical / 24 virtual cores per machine) >> and it is mostly idle when the index is flushing. Very little CPU >> utilization and disk I/O (<5%), with the exception of a single CPU >> core which actually does index flush (95% CPU, 5% I/O wait). >> >> My questions are: >> >> 1) will Solr changes from real-time branch help to resolve these >> issues? I was reading >> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html >> and it looks like we have exactly the same problem > > did you also read http://bit.ly/ujLw6v - here I try to explain the > major difference between Lucene 3.x and 4.0 and why 3.x has these long > idle times. In Lucene 3.x a full flush / commit is a single threaded > process, as you observed there is only one thread making progress. In > Lucene 4 there is still a single thread executing the commit but other > threads are not blocked anymore. Depending on how fast the thread can > flush other threads might help flushing segments for that commit > concurrently or simply index into new documents writers. So basically > 4.0 won't have this problem anymore. The realtime branch you talk > about is already merged into 4.0 trunk. > >> >> 2) what would be the best way to port these (and only these) changes >> to 3.4.0? I tried to dig into the branching and revisions, but got >> lost quickly. Tried something like "svn diff >> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not >> sure if it's even possible to merge these into 3.4.0 > > Possible yes! Worth the trouble, I would say no! > DocumentsWriterPerThread (DWPT) is a very big change and I don't think > we should backport this into our stable branch. However, this feature > is very stable in 4.0 though. >> >> 3) what would you recommend for production 24/7 use? 3.4.0? > > I think 3.4 is a safe bet! I personally tend to use trunk in > production too the only problem is that this is basically a moving > target and introduces extra overhead on your side to watch changes and > index format modification which could basically prevent you from > simple upgrades > >> >> 4) is there a workaround that can be used? also, I listed the stack trace >> below >> >> Thank you! >> Roman >> >> P.S. This single "index flushing" thread spends 99% of all the time in >> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then >> the merge seems to go quickly. I looked it up and it looks like the >> intent here is deleting old commit points (we are keeping only 1 >> non-optimized commit point per config). Not sure why is it taking that >> long. > > in 3.x there is no way to apply deletes without doing a flush (afaik). > In 3.x a flush means single threaded again - similar to commit just > without syncing files to disk and writing a new segments file. In 4.0 > you have way more control over this via > IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied > without blocking other threads. In trunk we hijack indexing threads to > do all that work concurrently so you get better cpu utilization and > due to concurrent flushing better and usually continuous IO > utilization. > > hope that helps. > > simon >> >> pool-2-thread-1 [RUNNABLE] CPU time: 3:31 >> java.nio.Bits.copyToByteArray(long, Object, long, long) >> java.nio.DirectByteBuffer.get(byte[], int, int) >> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, >> int) >> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) >> org.apache.lucene.index.SegmentTermEnum.next() >> org.apache.lucene.index.TermInfosReader.(Directory, String, >> FieldInfos, int, int) >> org.apache.lucene.index.SegmentCoreReaders.(SegmentReader, >> Directory, SegmentInfo, int, int) >> org.apache.lucene.index.SegmentReader.get(boolean, Directory, >> SegmentInfo, int, boolean, int) >> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, >> boolean, int, int) >> org.apache.lucene.index.IndexWriter$ReaderPool.get(SegmentInfo, boolean) >> or
Re: large scale indexing issues / single threaded bottleneck
> We should maybe try to fix this in 3.x too? +1 I suggested it should be backported a while back. Or that Lucene 4.x should be released. I'm not sure what is holding up Lucene 4.x at this point, bulk postings is only needed useful for PFOR. On Fri, Oct 28, 2011 at 3:27 PM, Simon Willnauer wrote: > On Fri, Oct 28, 2011 at 9:17 PM, Simon Willnauer > wrote: >> Hey Roman, >> >> On Fri, Oct 28, 2011 at 8:38 PM, Roman Alekseenkov >> wrote: >>> Hi everyone, >>> >>> I'm looking for some help with Solr indexing issues on a large scale. >>> >>> We are indexing few terabytes/month on a sizeable Solr cluster (8 >>> masters / serving writes, 16 slaves / serving reads). After certain >>> amount of tuning we got to the point where a single Solr instance can >>> handle index size of 100GB without much issues, but after that we are >>> starting to observe noticeable delays on index flush and they are >>> getting larger. See the attached picture for details, it's done for a >>> single JVM on a single machine. >>> >>> We are posting data in 8 threads using javabin format and doing commit >>> every 5K documents, merge factor 20, and ram buffer size about 384MB. >>> From the picture it can be seen that a single-threaded index flushing >>> code kicks in on every commit and blocks all other indexing threads. >>> The hardware is decent (12 physical / 24 virtual cores per machine) >>> and it is mostly idle when the index is flushing. Very little CPU >>> utilization and disk I/O (<5%), with the exception of a single CPU >>> core which actually does index flush (95% CPU, 5% I/O wait). >>> >>> My questions are: >>> >>> 1) will Solr changes from real-time branch help to resolve these >>> issues? I was reading >>> http://blog.mikemccandless.com/2011/05/265-indexing-speedup-with-lucenes.html >>> and it looks like we have exactly the same problem >> >> did you also read http://bit.ly/ujLw6v - here I try to explain the >> major difference between Lucene 3.x and 4.0 and why 3.x has these long >> idle times. In Lucene 3.x a full flush / commit is a single threaded >> process, as you observed there is only one thread making progress. In >> Lucene 4 there is still a single thread executing the commit but other >> threads are not blocked anymore. Depending on how fast the thread can >> flush other threads might help flushing segments for that commit >> concurrently or simply index into new documents writers. So basically >> 4.0 won't have this problem anymore. The realtime branch you talk >> about is already merged into 4.0 trunk. >> >>> >>> 2) what would be the best way to port these (and only these) changes >>> to 3.4.0? I tried to dig into the branching and revisions, but got >>> lost quickly. Tried something like "svn diff >>> […]realtime_search@r953476 […]realtime_search@r1097767", but I'm not >>> sure if it's even possible to merge these into 3.4.0 >> >> Possible yes! Worth the trouble, I would say no! >> DocumentsWriterPerThread (DWPT) is a very big change and I don't think >> we should backport this into our stable branch. However, this feature >> is very stable in 4.0 though. >>> >>> 3) what would you recommend for production 24/7 use? 3.4.0? >> >> I think 3.4 is a safe bet! I personally tend to use trunk in >> production too the only problem is that this is basically a moving >> target and introduces extra overhead on your side to watch changes and >> index format modification which could basically prevent you from >> simple upgrades >> >>> >>> 4) is there a workaround that can be used? also, I listed the stack trace >>> below >>> >>> Thank you! >>> Roman >>> >>> P.S. This single "index flushing" thread spends 99% of all the time in >>> "org.apache.lucene.index.BufferedDeletesStream.applyDeletes", and then >>> the merge seems to go quickly. I looked it up and it looks like the >>> intent here is deleting old commit points (we are keeping only 1 >>> non-optimized commit point per config). Not sure why is it taking that >>> long. >> >> in 3.x there is no way to apply deletes without doing a flush (afaik). >> In 3.x a flush means single threaded again - similar to commit just >> without syncing files to disk and writing a new segments file. In 4.0 >> you have way more control over this via >> IndexWriterConfig#setMaxBufferedDeleteTerms which are also applied >> without blocking other threads. In trunk we hijack indexing threads to >> do all that work concurrently so you get better cpu utilization and >> due to concurrent flushing better and usually continuous IO >> utilization. >> >> hope that helps. >> >> simon >>> >>> pool-2-thread-1 [RUNNABLE] CPU time: 3:31 >>> java.nio.Bits.copyToByteArray(long, Object, long, long) >>> java.nio.DirectByteBuffer.get(byte[], int, int) >>> org.apache.lucene.store.MMapDirectory$MMapIndexInput.readBytes(byte[], int, >>> int) >>> org.apache.lucene.index.TermBuffer.read(IndexInput, FieldInfos) >>> org.apache.lucene.index.SegmentTermEnum.next() >>> org.apache.lucene.index.TermInfosReader.(Directo
RE: Partial updates?
I would love to see this too. Most of our data comes from a relational database, but there are some files on the file system related to our products that may need to be indexed. The files have different change control / life cycle, so I can't be sure that our application will know when this data changes, so a recurring background re-index job would be helpful. Having to go to the database to get 99% of the data (which didn't change anyway) to send along with the 1% from the file system is a big limitation. This also prevents the use of DIH. Brandon Ramirez | Office: 585.214.5413 | Fax: 585.295.4848 Software Engineer II | Element K | www.elementk.com -Original Message- From: mlevy [mailto:ml...@ushmm.org] Sent: Friday, October 28, 2011 2:21 PM To: solr-user@lucene.apache.org Subject: Re: Partial updates? An ability to update would be extremely useful for us. Different parts of records sometimes come from different databases, and being able to update after creation of the Solr index would be extremely useful. I've made some processes that reads a record and adds a new field to it. The most awkward thing is when there's been a CopyField, when the record is read and re-saved, the copied field causes CopyField to be invoked again. -- View this message in context: http://lucene.472066.n3.nabble.com/Partial-updates-tp502570p3461740.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: large scale indexing issues / single threaded bottleneck
On Fri, Oct 28, 2011 at 5:03 PM, Jason Rutherglen wrote: > +1 I suggested it should be backported a while back. Or that Lucene > 4.x should be released. I'm not sure what is holding up Lucene 4.x at > this point, bulk postings is only needed useful for PFOR. This is not true, most modern index compression schemes, not just PFOR-delta read more than one integer at a time. Thats why its important not only to abstract away the encoding of the index, but to also ensure that the enumeration apis aren't biased towards one-at-a-time vInt. Otherwise we have "flexible indexing" where "flexible" means "slower if you do anything but the default". -- lucidimagination.com
edismax/boost: certain documents should be last
(I am using solr 3.4 and edismax.) In my index, I have a multivalued field named "genre". One of the values this field can have is "Citation". I would like documents that have a genre field of Citation to always be at the bottom of the search results. I've been experimenting, but I can't seem to figure out the syntax of the search I need. Here is the search that seems most logical to me (newlines added here for readability): q=%2bcontent%3Anotes+genre%3ACitation^0.01 &start=0 &rows=3 &fl=genre+title &version=2.2 &defType=edismax I get the same results whether I include "genre%3ACitation^0.01" or not. Just to see if my names were correct, I put a minus sign before "genre" and it did, in fact, stop returning all the documents containing Citation. What am I doing wrong? Here are the results from the above query: 0 1 genre title 0 +content:notes genre:Citation^0.01 3 2.2 edismax CitationFiction Notes on novelists With some other notes Citation Novel notes Citation Knock about notes
i don't get why this says non-match
It looks to me like everything matches down the line but top level says otherQuery is a non-match... I don't get it? - - 0 77 - SyncMaster *,score on on 0 +syncmaster -SyncMaster standard standard 41 2.2 + - +syncmaster -SyncMaster +syncmaster -SyncMaster +moreWords:syncmaster -MultiPhraseQuery(moreWords:"sync (master syncmaster)") +moreWords:syncmaster -moreWords:"sync (master syncmaster)" SyncMaster - 0.0 = (NON-MATCH) Failure to meet condition(s) of required/prohibited clause(s) 1.4043131 = (MATCH) fieldWeight(moreWords:syncmaster in 46710), product of: 1.4142135 = tf(termFreq(moreWords:syncmaster)=2) 9.078851 = idf(docFreq=41, maxDocs=135472) 0.109375 = fieldNorm(field=moreWords, doc=46710) 0.0 = match on prohibited clause (moreWords:"sync (master syncmaster)") 9.393997 = (MATCH) weight(moreWords:"sync (master syncmaster)" in 46710), product of: 2.5863855 = queryWeight(moreWords:"sync (master syncmaster)"), product of: 23.481407 = idf(moreWords:"sync (master syncmaster)") 0.1101461 = queryNorm 3.6320949 = (MATCH) fieldWeight(moreWords:"sync (master syncmaster)" in 46710), product of: 1.4142135 = tf(phraseFreq=2.0) 23.481407 = idf(moreWords:"sync (master syncmaster)") 0.109375 = fieldNorm(field=moreWords, doc=46710)
Re: large scale indexing issues / single threaded bottleneck
> Otherwise we have "flexible indexing" where "flexible" means "slower > if you do anything but the default". The other encodings should exist as modules since they are pluggable. 4.0 can ship with the existing codec. 4.1 with additional codecs and the bulk postings at a later time. Otherwise it will be 6 months before 4.0 ships, that's too long. Also it is an amusing contradiction that your argument flies in the face of Lucid shipping 4.x today without said functionality. On Fri, Oct 28, 2011 at 5:09 PM, Robert Muir wrote: > On Fri, Oct 28, 2011 at 5:03 PM, Jason Rutherglen > wrote: > >> +1 I suggested it should be backported a while back. Or that Lucene >> 4.x should be released. I'm not sure what is holding up Lucene 4.x at >> this point, bulk postings is only needed useful for PFOR. > > This is not true, most modern index compression schemes, not just > PFOR-delta read more than one integer at a time. > > Thats why its important not only to abstract away the encoding of the > index, but to also ensure that the enumeration apis aren't biased > towards one-at-a-time vInt. > > Otherwise we have "flexible indexing" where "flexible" means "slower > if you do anything but the default". > > -- > lucidimagination.com >
Re: large scale indexing issues / single threaded bottleneck
On Fri, Oct 28, 2011 at 8:10 PM, Jason Rutherglen wrote: >> Otherwise we have "flexible indexing" where "flexible" means "slower >> if you do anything but the default". > > The other encodings should exist as modules since they are pluggable. > 4.0 can ship with the existing codec. 4.1 with additional codecs and > the bulk postings at a later time. you don't know what you are talking about: go look at the source code. the whole problem is that encodings aren't pluggable. > > Otherwise it will be 6 months before 4.0 ships, that's too long. sucks for you. > > Also it is an amusing contradiction that your argument flies in the > face of Lucid shipping 4.x today without said functionality. > No it doesn't. trunk is open source. you can use it, too, if you want. -- lucidimagination.com
Re: large scale indexing issues / single threaded bottleneck
> abstract away the encoding of the index Robert, this is what you wrote. "Abstract away the encoding of the index" means pluggable, otherwise it's not abstract and / or it's a flawed design. Sounds like it's the latter.
Re: URL Redirect
Finotti Simone yoox.com> writes: > > Hello, > > I have been assigned the task to migrate from Endeca to Solr. > > The former engine allowed me to set keyword triggers that, when matched exactly, caused the web client to > redirect to a specified URL. > > Does that feature exist in Solr? If so, where can I get some info? > > Thank you Hi, Iam also looking out for migrating from Endeca to Solr , but on the first look it looks extremely tedious to me ...please pass on any tips or how to approach the problem..