Re: DIH wiht several Cores
Unfortunately, what you are asking for is not possible. The DIH needs to be configured separately for each core. I have a similar situation with my Solr application. I am solving it by creating a custom index feeder that is aware of all of the cores and which documents to send to which cores. -- View this message in context: http://lucene.472066.n3.nabble.com/DIH-wiht-several-Cores-tp1767883p1769794.html Sent from the Solr - User mailing list archive at Nabble.com.
How does DIH multithreading work?
I understand that the thread count is specified on root entities only. Does it spawn multiple threads per root entity? Or multiple threads per descendant entity? Can someone give an example of how you would make a database query in an entity with 4 threads that would select 1 row per thread? Thanks, Mark -- View this message in context: http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1776111.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How does DIH multithreading work?
Anyone know how it works? -- View this message in context: http://lucene.472066.n3.nabble.com/How-does-DIH-multithreading-work-tp1776111p1784419.html Sent from the Solr - User mailing list archive at Nabble.com.
Core/shard preference
I have a small core performing deltas quickly (core00), and a large core performing deltas slowly (core01), both on the same set of documents. The delta core is cleaned nightly. As you can imagine, at times there are two versions of a document, one in each core. When I execute a query that matches this document, sometimes it will come from the delta core, and some times it will come from the large core. It almost seems random. Here is my query: http://porsche:8181/worldip5/core00/select?shards=porsche:8181/worldip5/core00/,porsche:8181/worldip5/core01/&start=0&rows=20&q=hazard+gas+countrycode:JP When the delta documents from core00 are returned as desired the access logs show: 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST /worldip5/core00/select HTTP/1.1 200 293 1 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST /worldip5/core01/select HTTP/1.1 200 506 1 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST /worldip5/core00/select HTTP/1.1 200 1151 1 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST /worldip5/core01/select HTTP/1.1 200 2597 1 10.36.34.151 - - [19/Oct/2009:15:22:37 -0700] GET /worldip5/core00/select?shards=porsche:8181/worldip5/core00/,porsche:8181/worldip5/core01/&start=0&rows=20&q=hazard+gas+countrycode:JP HTTP/1.1 200 11881 9 When the documents are returned from core01 the access logs show: 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST /worldip5/core00/select HTTP/1.1 200 289 1 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST /worldip5/core01/select HTTP/1.1 200 506 1 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST /worldip5/core01/select HTTP/1.1 200 3390 1 10.36.34.151 - - [19/Oct/2009:15:22:37 -0700] GET /worldip5/core00/select?shards=porsche:8181/worldip5/core00/,porsche:8181/worldip5/core01/&start=0&rows=20&q=hazard+gas+countrycode:JP HTTP/1.1 200 11873 9 Any ideas on why there is a difference in the requests made? Is there a way I can tell Solr to prefer the documents in core00? Mark -- View this message in context: http://www.nabble.com/Core-shard-preference-tp25966791p25966791.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Core/shard preference
Thank you guys for your responses. That is what I suspected, that it was going with the first instance of the document that it sees. I tried setting up Solr in Eclipse and ran into a couple of issues blocking it from compiling. I also did some reading, but none of the write ups were very comprehensive. Are there any good write ups that you know of with instructions on setting up Solr in Eclipse? Thanks again, Mark Yonik Seeley-2 wrote: > > Although shards should be disjoint, Solr "tolerates" duplication > (won't return duplicates in the main results list, but doesn't make > any effort to correct facet counts, etc). > > Currently, whichever shard responds first wins. > The relevant code is around line 420 in QueryComponent.java: > > String prevShard = uniqueDoc.put(id, srsp.getShard()); > if (prevShard != null) { > // duplicate detected > numFound--; > > // For now, just always use the first encountered since we > can't currently > // remove the previous one added to the priority queue. > If we switched > // to the Java5 PriorityQueue, this would be easier. > continue; > // make which duplicate is used deterministic based on shard > // if (prevShard.compareTo(srsp.shard) >= 0) { > // TODO: remove previous from priority queue > // continue; > // } > } > > So it's certainly possible to make it deterministic, we just haven't > done it yet. > > -Yonik > http://www.lucidimagination.com > > > On Mon, Oct 19, 2009 at 7:30 PM, Lance Norskog wrote: >> Distributed Search is designed only for disjoint cores. >> >> The document list from each core is returned sorted by the relevance >> score. The distributed searcher merges these sorted lists. Solr does >> not implement "distributed IDF", which essentially means distributed >> coordinated scoring. All scoring happens inside each core, relative to >> that core's contents. The resulting score numbers are not coordinated >> with each other, and you will get random results. >> >> There is no way to say "use this core's results" because the searches >> are not compared all at once. Only the page of results fetched is >> compared, so there's no way to suppress a result in the second page if >> it was already found in the first. >> >> On Mon, Oct 19, 2009 at 3:30 PM, markwaddle wrote: >>> >>> I have a small core performing deltas quickly (core00), and a large core >>> performing deltas slowly (core01), both on the same set of documents. >>> The >>> delta core is cleaned nightly. As you can imagine, at times there are >>> two >>> versions of a document, one in each core. When I execute a query that >>> matches this document, sometimes it will come from the delta core, and >>> some >>> times it will come from the large core. It almost seems random. Here is >>> my >>> query: >>> >>> http://porsche:8181/worldip5/core00/select?shards=porsche:8181/worldip5/core00/,porsche:8181/worldip5/core01/&start=0&rows=20&q=hazard+gas+countrycode:JP >>> >>> When the delta documents from core00 are returned as desired the access >>> logs >>> show: >>> >>> 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST >>> /worldip5/core00/select >>> HTTP/1.1 200 293 1 >>> 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST >>> /worldip5/core01/select >>> HTTP/1.1 200 506 1 >>> 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST >>> /worldip5/core00/select >>> HTTP/1.1 200 1151 1 >>> 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST >>> /worldip5/core01/select >>> HTTP/1.1 200 2597 1 >>> 10.36.34.151 - - [19/Oct/2009:15:22:37 -0700] GET >>> /worldip5/core00/select?shards=porsche:8181/worldip5/core00/,porsche:8181/worldip5/core01/&start=0&rows=20&q=hazard+gas+countrycode:JP >>> HTTP/1.1 200 11881 9 >>> >>> When the documents are returned from core01 the access logs show: >>> 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST >>> /worldip5/core00/select >>> HTTP/1.1 200 289 1 >>> 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST >>> /worldip5/core01/select >>> HTTP/1.1 200 506 1 >>> 10.36.34.150 - - [19/Oct/2009:15:22:37 -0700] POST >>> /worldip5/core01/select >>> HTTP/1.1 200 3390 1 >>> 10.36.34.151 - - [19/Oct/2009:15:22:37 -0700] GET >&
RE: Solr under tomcat - UTF-8 issue
I was originally using POST for the same reason, however I discovered that Tomcat could easily be configured to accept any length URI. All it requires is specifying the maxHttpHeaderSize attribute in your default Connector in server.xml. I set my value to 1MB, which is certainly excessive, but it ensures I will never hit the limit. As the other chap mentioned, I now have the benefits of caching and most importantly, proper web logs! I also have a similar situation where I constrain the search results based on the user's role. I have only two roles to support, so my case is very simple, but I could imagine having a multivalued "role" field that you could perform facet queries on. Mark Glock, Thomas wrote: > > Thanks - > > I agree. However my application requires results be trimmed to users > based on roles. The roles are repeating values on the documents. Users > have many different role combinations as do documents. > I recognize this is going to hamper caching - but using a GET will tend to > limit the size of search phrases when combined with the boolean role > clause. And I am concerned with hitting url limits. > > At any rate I solved it thanks to Yonik's recommendation. > > My flex client httpservice by default only sets the content-type request > header to "application/x-www-form-urlencoded" what it needed to do for > tomcat is set the content-type request header to content-type = > "application/x-www-form-urlencoded; charset=UTF-8"; > > If you have any suggestions regarding limiting results based on user and > document role permutations - I'm all ears. I've been to the Search Summit > in NYC and no vendor could even seem to grasp the concept. > > The problem case statement is this - I have users globally who need to > search for content tailored to them. Users searching for 'Holiday' don't > get any value from 1 documents having the word holiday. What they need > are documents authored for that population. The documents have the > associated role information as metadata and therefore users will get only > the documents they have access to and are relevant to them. That's the > plan anyway! > > By chance I stumbled in Solr a month or so ago and I think its awesome. I > got the book two days ago too - fantastic! > > Thanks again, > Tom > -- View this message in context: http://www.nabble.com/Solr-under-tomcat---UTF-8-issue-tp26040052p26054942.html Sent from the Solr - User mailing list archive at Nabble.com.
Delete, commit, optimize doesn't reduce index file size
I have an index that used to have ~38M docs at 17.2GB. I deleted all but 13K docs using a delete by query, commit and then optimize. A "*:*" query now returns 13K docs. The problem is that the files on disk are still 17.1GB in size. I expected the optimize to shrink the files. Is there a way I can shrink them now that the index only has 13K docs? Mark -- View this message in context: http://old.nabble.com/Delete%2C-commit%2C-optimize-doesn%27t-reduce-index-file-size-tp26958067p26958067.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Delete, commit, optimize doesn't reduce index file size
Yonik Seeley-2 wrote: > > On Tue, Dec 29, 2009 at 1:23 PM, markwaddle wrote: >> I have an index that used to have ~38M docs at 17.2GB. I deleted all but >> 13K >> docs using a delete by query, commit and then optimize. A "*:*" query now >> returns 13K docs. The problem is that the files on disk are still 17.1GB >> in >> size. I expected the optimize to shrink the files. Is there a way I can >> shrink them now that the index only has 13K docs? > > Are you on Windows? > The IndexWriter can't delete files in use by the current IndexReader > (like it can in UNIX) when the commit is done. > If you make further changes to the index and do a commit, you should > see the space go down. > > -Yonik > http://www.lucidimagination.com > > I am on Windows. Would a DataImportHandler delta-import with 1 or more changes be a sufficient change to allow the files to be deleted? Mark -- View this message in context: http://old.nabble.com/Delete%2C-commit%2C-optimize-doesn%27t-reduce-index-file-size-tp26958067p26960857.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Delete, commit, optimize doesn't reduce index file size
Yonik Seeley-2 wrote: > > If you make further changes to the index and do a commit, you should > see the space go down. > It worked. I added a bogus document using /update and then performed a commit and now the files are down to 6MB. http://.../core00/update?stream.body=%3Cadd%3E%3Cdoc%3E%3Cfield%20name=%22id%22%3E0%3C/field%3E%3C/doc%3E%3C/add%3E http://.../core00/update?stream.body=%3Ccommit/%3E Thanks! Mark -- View this message in context: http://old.nabble.com/Delete%2C-commit%2C-optimize-doesn%27t-reduce-index-file-size-tp26958067p26960957.html Sent from the Solr - User mailing list archive at Nabble.com.
Unexpected boolean query behavior
Here is my query: (virt* AND "machine fingerprinting") OR (virt* AND encryption) OR (virt* AND anonymous) OR (virt* AND analytic*) AND owned:true It can be broken down to: (A) OR (B) OR (C) OR (D) AND E A, B, C and D are themselves AND boolean clauses. The E clause at the end is not behaving the way I would expect. No matter how I order the A,B,C and D clauses, it always returns the equivalent of ((D) AND E). When I add additional parentheses it behaves the way I expect. Like: ((A) OR (B) OR (C) OR (D)) AND E or (A) OR (B) OR (C) OR ((D) AND E) Can anyone explain why it behaves the way it does without the parentheses? Is there something I am missing in the way it processes boolean clauses? Thanks, Mark -- View this message in context: http://old.nabble.com/Unexpected-boolean-query-behavior-tp27166967p27166967.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unexpected boolean query behavior
That is a reasonable question. The problem here is that my users have already created numerous queries just like this one, using ANDs and ORs. My users are very technical and they have been using the results of these queries for months now to perform analysis that drives business decisions. I need an explanation for why this is happening so I can not only train them on how to use it more effectively, but also to restore their trust in the search application. Does anyone understand this behavior? Or can you recommend a place for me to look? Otis Gospodnetic wrote: > > Mark, > > Does it help if you rewrite your query using +/- syntax ("required", > "prohibited"), or nothing for "should"? Because that's what happens under > the hood (terms are required, prohibited, or should occur). > > > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > - Original Message >> From: markwaddle >> To: solr-user@lucene.apache.org >> Sent: Thu, January 14, 2010 2:39:21 PM >> Subject: Unexpected boolean query behavior >> >> >> Here is my query: >> (virt* AND "machine fingerprinting") OR (virt* AND encryption) OR (virt* >> AND >> anonymous) OR (virt* AND analytic*) AND owned:true >> >> It can be broken down to: >> (A) OR (B) OR (C) OR (D) AND E >> >> A, B, C and D are themselves AND boolean clauses. >> >> The E clause at the end is not behaving the way I would expect. No matter >> how I order the A,B,C and D clauses, it always returns the equivalent of >> ((D) AND E). >> >> When I add additional parentheses it behaves the way I expect. Like: >> ((A) OR (B) OR (C) OR (D)) AND E >> or >> (A) OR (B) OR (C) OR ((D) AND E) >> >> Can anyone explain why it behaves the way it does without the >> parentheses? >> Is there something I am missing in the way it processes boolean clauses? >> >> Thanks, >> Mark >> -- >> View this message in context: >> http://old.nabble.com/Unexpected-boolean-query-behavior-tp27166967p27166967.html >> Sent from the Solr - User mailing list archive at Nabble.com. > > > -- View this message in context: http://old.nabble.com/Unexpected-boolean-query-behavior-tp27166967p27167750.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Unexpected boolean query behavior
That explains my exact problem, thank you! May I ask how you found that wiki posting? Otis Gospodnetic wrote: > > HI Mark, > > Does this help? > http://wiki.apache.org/lucene-java/BooleanQuerySyntax > > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > -- View this message in context: http://old.nabble.com/Unexpected-boolean-query-behavior-tp27166967p27170172.html Sent from the Solr - User mailing list archive at Nabble.com.