Re: Problem in faceting
change the default operator from "OR" to "AND" by using q.op or in schema - Thanx: Grijesh http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p248.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet Query
No ,Facet query and fq parameters work with any type of query. when you will search for facet.query=city:mumbai then it will return facet like 3 Facet query is for faceting against perticullar query. If you wants result for that query then you have to go for fq=city:mumbai - Thanx: Grijesh http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/Facet-Query-tp2422212p2422267.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem in faceting
But i want results as it is as the above query is returning. There is no problem with the results with it is returning. Problem detail I have implemented search for my company in which in search box user can search any query. Now when a user search "water treatment plant". Then the results come back according to above given query in which the documents containing words "water" or "treatment" or "plant" or "water treatment plant" is matching. All these searched results are correct and fulfill my requirements . Along with these results i am doing faceting over cities to display. Currently all cities are displayed if they are of a record matching with any word "water" or "treatment" or "plant" or "water treatment plant". But now my requirement is to keep the records as it is but do faceting over only those cities for which complete text "water treatment plant" is matching. Is it possible by a single query to solr please suggest. Thanks a lot for your response. -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2422353.html Sent from the Solr - User mailing list archive at Nabble.com.
Solr faceting on score
Hi friends, Is it possible to do faceting over score. I want to results from facets which have more score. Please suggest. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-faceting-on-score-tp2422076p2422076.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Problem in faceting
Try solr's new Local Params ,may that will help for your requirement. http://wiki.apache.org/solr/LocalParams - Thanx: Grijesh http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2422534.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Problem in faceting
Using a facet query like facet.query=+water +treatement +plant ... should give a count of 0 to documents not having all tree terms. This could do the trick, if I understand how this parameter works.
RE: Problem in faceting
facet.query=+water +treatement +plant will not return the city facet that is needed by poster. That will give the counts matching the query facet.query=+water +treatement +plant only - Thanx: Grijesh http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2422881.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR 1.4 and Lucene 3.0.3 index problem
thanks Dominique I am on windows... how do I do this on a windows 7 machine... I have netbeans and I have SVN and ant plugins regards Mambe Churchill Nanje 237 33011349, AfroVisioN Founder, President,CEO http://www.afrovisiongroup.com | http://mambenanje.blogspot.com skypeID: mambenanje www.twitter.com/mambenanje On Fri, Feb 4, 2011 at 8:10 AM, Dominique Bejean wrote: > Hi, > > I would not try to change the lucene version in Solr 1.4.1 from 2.9.x to > 3.0.x. > > As said Koji, the best solution is to get the branch 3.x or the trunk and > build it. You need svn and ant. > > 1. Create a working directory > > $ mkdir ~/solr > > 2. Get the source > > $ cd ~/solr > > $ svn co http://svn.apache.org/repos/asf/lucene/dev/trunk > or > > $ svn co http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x > > 3. build > > $ cd ~/solr/modules > $ ant compile > $ cd ~/solr/lucene > $ ant dist > $ cd ~/solr/modules > $ ant dist > > Dominique > > Le 02/02/11 12:47, Churchill Nanje Mambe a écrit : > > thanks guys >> I will try the trunk >> >> as for unpacking the war and changing the lucene... I am not an expect and >> this my get complicated for me maybe over time >> when I am comfortable >> >> Mambe Churchill Nanje >> 237 33011349, >> AfroVisioN Founder, President,CEO >> http://www.afrovisiongroup.com | http://mambenanje.blogspot.com >> skypeID: mambenanje >> www.twitter.com/mambenanje >> >> >> >> On Wed, Feb 2, 2011 at 8:03 AM, Grijesh wrote: >> >> You can extract the solr.war using java's jar -xvf solr.war command >>> >>> change the lucene-2.9.jar with your lucene-3.0.3.jar in WEB-INF/lib >>> directory >>> >>> then use jar -cxf solr.war * to again pack the war >>> >>> deploy that war hope that work >>> >>> - >>> Thanx: >>> Grijesh >>> -- >>> View this message in context: >>> >>> http://lucene.472066.n3.nabble.com/SOLR-1-4-and-Lucene-3-0-3-index-problem-tp2396605p2403542.html >>> Sent from the Solr - User mailing list archive at Nabble.com. >>> >>>
Re: DataImportHandler usage with RDF database
Hi Lewis, > I am very interested in DataImportHandler. I have data stored in an RDF db > and >wish to use this data to boost query results via Solr. I wish to keep this >data >stored in db as I have a web app which directly maintains this db. Is it >possible to use a DataImportHandler to read RDF data from db in memory I don't think DIH can read from a triple store today. It can read from a RDBMS, RSS/Atom feeds, URLs, mail servers, maybe others... Maybe what you should be looking at is the ManifoldCF instead, although I don't think it can fetch data from triple stores today either. > without sending an index commit to Solr. As far as I can see > DataImportHandler >currently supports full and delta imports which mean I would be indexing. > I don't follow what you mean by this and how it relates to the first part. > So far I have yet to find a requestHandler which is able to read then store >data in memory, then use this data elsewhere prior to returning documents via >queryResponseWriter. I think you are talking about a custom SearchComponent that reads some data from somewhere (e.g. your triple store) and then uses it at search time for something. This sounds doable, although you didn't provide details. For example, we (Sematext) have implemented custom SearchComponents for e-commerce customers where frequently-changing information about product availability was fetched from external stores and applied to search results. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
RE: Problem in faceting
Yes, I see I didn't understand that facet.query parameter. Have you consider submitting two queries ? One for results with q.op=OR, one for faceting with q.op=AND ? -Message d'origine- De : Grijesh [mailto:pintu.grij...@gmail.com] Envoyé : vendredi 4 février 2011 10:42 À : solr-user@lucene.apache.org Objet : RE: Problem in faceting facet.query=+water +treatement +plant will not return the city facet that is needed by poster. That will give the counts matching the query facet.query=+water +treatement +plant only - Thanx: Grijesh http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2422881.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: geodist and spacial search
Hi Grant, Thanks for the tip This seems to work: q=*:* fq={!func}geodist() sfield=store pt=49.45031,11.077721 fq={!bbox} sfield=store pt=49.45031,11.077721 d=40 fl=store sort=geodist() asc On Thu, Feb 3, 2011 at 7:46 PM, Grant Ingersoll wrote: > Use a filter query? See the {!geofilt} stuff on the wiki page. That gives > you your filter to restrict down your result set, then you can sort by exact > distance to get your sort of just those docs that make it through the > filter. > > > On Feb 3, 2011, at 10:24 AM, Eric Grobler wrote: > > > Hi Erick, > > > > Thanks I saw that example, but I am trying to sort by distance AND > specify > > the max distance in 1 query. > > > > The reason is: > > running bbox on 2 million documents with a 20km distance takes only > 200ms. > > Sorting 2 million documents by distance takes over 1.5 seconds! > > > > So it will be much faster for solr to first filter the 20km documents and > > then to sort them. > > > > Regards > > Ericz > > > > On Thu, Feb 3, 2011 at 1:27 PM, Erick Erickson >wrote: > > > >> Further down that very page ... > >> > >> Here's an example of sorting by distance ascending: > >> > >> - > >> > >> ...&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist() > >> asc< > >> > http://localhost:8983/solr/select?wt=json&indent=true&fl=name,store&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist()%20asc > >>> > >> > >> > >> > >> > >> The key is just the &sort=geodist(), I'm pretty sure that's independent > of > >> the bbox, but > >> I could be wrong. > >> > >> Best > >> Erick > >> > >> On Wed, Feb 2, 2011 at 11:18 AM, Eric Grobler < > impalah...@googlemail.com > >>> wrote: > >> > >>> Hi > >>> > >>> In http://wiki.apache.org/solr/SpatialSearch > >>> there is an example of a bbox filter and a geodist function. > >>> > >>> Is it possible to do a bbox filter and sort by distance - combine the > >> two? > >>> > >>> Thanks > >>> Ericz > >>> > >> > > -- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search > >
Re: What is the best protocol for data transfer rate HTTP or RMI?
Gustavo, I haven't used RMI in 5 years, but last time I used it I remember it being problematic - this is in the context of Lucene-based search involving some 40 different shards/servers, high query rates, and some 2 billion documents, if I remember correctly. I remember us wanting to get away from RMI to something simpler, less problematic, more HTTP-like. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Gustavo Maia > To: solr-user@lucene.apache.org > Sent: Thu, February 3, 2011 1:05:16 PM > Subject: What is the best protocol for data transfer rate HTTP or RMI? > > Hello, > > > > I am doing a comparative study between Lucene and Solr and wish to obtain > more concrete data on the data transfer using the lucene RemoteSearch that > uses RMI and data transfer of SOLR that uses the HTTP protocol. > > > > > Gustavo Maia >
Re: value for maxFieldLength
Lewis, A large maxFieldLength may not necessarily result in OOM - it depends on -Xmx you are using, the number of concurrent documents being processed, and such. So the first thing I'd look would be my machine's RAM, then -Xmx I can afford, then based on that set maxFieldLengthmay. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: "McGibbney, Lewis John" > To: "solr-user@lucene.apache.org" > Sent: Wed, February 2, 2011 10:20:58 AM > Subject: value for maxFieldLength > > Hello list, > > I am aware that setting the value of maxFieldLength in solrconfig.xml too > high >may/will result in out-of-mem errors. I wish to provide content extraction on >a >number of pdf documents which are large, by large I mean 8-11MB (occasionally >more), and I am also not sure how many terms reside in each field when it is >indexed. My question is therefore what is a sensible number to set this value >to in order to include the majority/all terms within documents of this size. > > Thank you > > Lewis > > > Glasgow Caledonian University is a registered Scottish charity, number >SC021474 > > Winner: Times Higher Education's Widening Participation Initiative of the > Year >2009 and Herald Society's Education Initiative of the Year 2009. >http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html >l > > Winner: Times Higher Education's Outstanding Support for Early Career >Researchers of the Year 2010, GCU as a lead with Universities Scotland >partners. >http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html >l >
Re: Using terms and N-gram
Hi, The main difference is that CommonGrams will take 2 adjacent words and put them together, while NGram* stuff will take a single word and chop it up in sequences of one or more characters/letters. If you are stuck with auto-complete stuff, consider http://sematext.com/products/autocomplete/index.html Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: openvictor Open > To: solr-user@lucene.apache.org > Sent: Thu, February 3, 2011 10:15:47 AM > Subject: Re: Using terms and N-gram > > Thank you, I will do that and hopefuly it will be handy ! > > But can someone explain me difference between CommonGramFIlterFactory et > NGramFilterFactory ? ( Maybe the solution is there) > > Thank you all, > best regards > > 2011/2/3 Grijesh > > > > > Use analysis.jsp to see what happening at index time and query time with > > your > > input data.You can use highlighting to see if match found. > > > > - > > Thanx: > > Grijesh > > http://lucidimagination.com > > -- > > View this message in context: > > >http://lucene.472066.n3.nabble.com/Using-terms-and-N-gram-tp2410938p2411244.html > > Sent from the Solr - User mailing list archive at Nabble.com. > > >
Re: phrase, inidividual term, prefix, fuzzy and stemming search
Hi, I'll admit I didn't read your email closely, but the first part makes me thing that ngrams, which I don't think you mentioned, might be handy for you here, allowing for misspellings without the implementation complexity. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: cyang2010 > To: solr-user@lucene.apache.org > Sent: Mon, January 31, 2011 5:22:19 PM > Subject: phrase, inidividual term, prefix, fuzzy and stemming search > > > My current project has the requirement to support search when user inputs any > number of terms across a few index fields (movie title, actor, director). > > In order to maximize result, I plan to support all those searches listed in > the subject, phrase, individual term, prefix, fuzzy and stemming. Of > course, score relevance in the right order is also important. > > I have considered using dismax query. However, it does not support prefix > query. I am not sure if it supports fuzzy query, my guess is does not. > > Therefore, i still need to use standard query.For example, if someone > searches "deim moer" (typo for demi moore), i compare the phrase and terms > with each searchable fields (title, actor, director): > > > title_display: "deim moer"~30 actors: "deim moer"~30 directors: "deim > moer"~30<-- OR > > title_display: deim<-- OR > actors: deim > directors: deim > > title_display: deim* <-- OR > actors: deim* > directors: deim* > > title_display: deim~0.6 <-- OR > actors: deim~0.6 > directors: deim~0.6 > > title_display: moer<-- OR > actors: moer > directors: moer > > title_display: moer*<-- OR > actors: moer* > directors: moer* > > title_display: moer~0.6<-- OR > actors: moer~0.6 > directors: moer~0.6 > > The solr relevance score is sum for all those OR. In that way, i can make > sure relevance score are in order. For example, for the exact match ("deim > moer"), it will match phrase, term, prefix and fuzzy query all at the same > time. Therefore, it will score higher than some input text only matchs > term, or prefix or fuzzy. At the same time, i can apply boost to a > particular search field if requirement needs. > > > Does it sound right to you? Is there better ways to achieve the same thing? > My concern is my query is not going to perform, since it tries to do too > much. But isn't that what people want to get (maximize result) when they > just type in a few search words? > > Another question is that: Can i combine the result of two query together? > For example, first i query phrase and term match, next I query for prefix > match. Can I just append the result for prefix match to that for > phrase/term match? I thought two queries have different queryNorm, > therefore, the score is not comparable to each other so as to combine. Is > it correct? > > > Thanks. love to hear what your thought is. > > > -- > View this message in context: >http://lucene.472066.n3.nabble.com/phrase-inidividual-term-prefix-fuzzy-and-stemming-search-tp239p239.html > > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Solr Indexing Performance
Hi, 2 GB for ramBufferSize is probably too much and not needed, but you could increase it from default 32 MB to something like 128 MB or even 512 MB, if you really have that much data where that would make a difference (you mention only 49 PDF files). I'd leave mergeFactor at 10 for now. The slowness (if there is slowness - how long is it taking?) could be from: * slow DB * suboptimal SQL * PDF content extraction * indexing itself * ... Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Tomás Fernández Löbbe > To: solr-user@lucene.apache.org > Sent: Mon, January 31, 2011 10:13:32 AM > Subject: Re: Solr Indexing Performance > > Well, I would say that the best way to be sure is to benchmark different > configurations. > As far as I know, it's usually not recommended such a big RAM Buffer size, > default is 32 MB and probably won't get any improvements using more than 128 > MB. > The same with the mergeFactor, I know that a larger merge factor it's better > for indexing, but 50 sounds like a lot. Anyway, as I said before, the best > thing to do is benchmark different configurations and see which one works > better for you. > > Have you tried assigning less memory to the JVM? That would leave more > memory available to the OS. > > Tomás > > On Sun, Jan 30, 2011 at 1:54 AM, Darx Oman wrote: > > > Hi guys > > > > > > > > I'm running a solr instance (trunk) in my dev. Server to test my > > configuration. I'm doing a DIH full import to index 49 PDF files with > > their > > corresponding database records. Both the PDF files and database are local > > in the server. > > > > *Server : * > > > > · Windows 2008 R2 > > > > · MS SQL server 2008 R2 > > > > · 16 core processor > > > > · 16 GB ram > > > > *Tomcat (7.0.5) : * > > > > · Set JAVA_OPTS = %JAVA_OPTS% -Xms1024M -Xmx8192M > > > > *Solrconfig:* > > > > · Main index configurations > > 2048 > > 50 > > > > *DIH configuration:* > > > > · 2 data sources defined jdbcDataSource and BinFileDataSource > > > > · One main entity with 3 sub entities > > > > > > > > > > > > > > > > > > > > > > > > · Total schema fields are 8, three of which are text type and > > multivalued. > > > > *My DIH import Status Messages:* > > > > · Total Requests made to DataSource = 99** > > > > · Total Rows Fetched = 2124** > > > > · Total DocumentsProcessed = 49** > > > > · Time Taken = *0:2:3:880*** > > > > * > > Is this time reasonable or it can be improved?* > > >
Re: Detect Out of Memory Errors
Hi, There are external tools that one can use to watch Java processes, listen for errors, and restart processes if they die - monit, daemontools, and some Java-specific ones. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: saureen > To: solr-user@lucene.apache.org > Sent: Thu, January 27, 2011 9:41:56 AM > Subject: Detect Out of Memory Errors > > > Hi, > > is ther a way by which i could detect the out of memory errors in solr so > that i could implement some functionality such as restarting the tomcat or > alert me via email whenever such error is detected.? > -- > View this message in context: >http://lucene.472066.n3.nabble.com/Detect-Out-of-Memory-Errors-tp2362872p2362872.html > > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Performance optimization of Proximity/Wildcard searches
Salman, I only skimmed your email, but wanted to say that this part sounds a little suspicious: > Our warm up script currently executes all distinct queries in our logs > having count > 5. It was run yesterday (with all the indexing update every It sounds like this will make warmup take a long time, assuming you have more than a handful distinct queries in your logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Salman Akram > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk > Sent: Tue, January 25, 2011 6:32:48 AM > Subject: Re: Performance optimization of Proximity/Wildcard searches > > By warmed index you only mean warming the SOLR cache or OS cache? As I said > our index is updated every hour so I am not sure how much SOLR cache would > be helpful but OS cache should still be helpful, right? > > I haven't compared the results with a proper script but from manual testing > here are some of the observations. > > 'Recent' queries which are in cache of course return immediately (only if > they are exactly same - even if they took 3-4 mins first time). I will need > to test how many recent queries stay in cache but still this would work only > for very common queries. User can run different queries and I want at least > them to be at 'acceptable' level (5-10 secs) even if not very fast. > > Our warm up script currently executes all distinct queries in our logs > having count > 5. It was run yesterday (with all the indexing update every > hour after that) and today when I executed some of the same queries again > their time seemed a little less (around 15-20%), I am not sure if this means > anything. However, still their time is not acceptable. > > What do you think is the best way to compare results? First run all the warm > up queries and then execute same randomly and compare? > > We are using Windows server, would it make a big difference if we move to > Linux? Our load is not high but some queries are really complex. > > Also I was hoping to move to SSD in last after trying out all software > options. Is that an agreed fact that on large indexes (which don't fit in > RAM) proximity/wildcard/phrase queries (on common words) would be slow and > it can be only improved by cache warm up and better hardware? Otherwise with > an index of around 150GB such queries will take more than a min? > > If that's the case I know this question is very subjective but if a single > query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD > (everything else same)? > > Thanks! > > > On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen wrote: > > > On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: > > > Cache warming is a good option too but the index get updated every hour > > so > > > not sure how much would that help. > > > > What is the time difference between queries with a warmed index and a > > cold one? If the warmed index performs satisfactory, then one answer is > > to upgrade your underlying storage. As always for IO-caused performance > > problem in Lucene/Solr-land, SSD is the answer. > > > > > > > -- > Regards, > > Salman Akram >
Re: Performance optimization of Proximity/Wildcard searches
Hi, > Sharding is an option too but that too comes with limitations so want to > keep that as a last resort but I think there must be other things coz 150GB > is not too big for one drive/server with 32GB Ram. Hmm what makes you think 32 GB is enough for your 150 GB index? It depends on queries and distribution of matching documents, for example. What's yours like? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Salman Akram > To: solr-user@lucene.apache.org > Sent: Tue, January 25, 2011 4:20:34 AM > Subject: Performance optimization of Proximity/Wildcard searches > > Hi, > > I am facing performance issues in three types of queries (and their > combination). Some of the queries take more than 2-3 mins. Index size is > around 150GB. > > >- Wildcard >- Proximity >- Phrases (with common words) > > I know CommonGrams and Stop words are a good way to resolve such issues but > they don't fulfill our functional requirements (Common Grams seem to have > issues with phrase proximity, stop words have issues with exact match etc). > > Sharding is an option too but that too comes with limitations so want to > keep that as a last resort but I think there must be other things coz 150GB > is not too big for one drive/server with 32GB Ram. > > Cache warming is a good option too but the index get updated every hour so > not sure how much would that help. > > What are the other main tips that can help in performance optimization of > the above queries? > > Thanks > > -- > Regards, > > Salman Akram >
Re: Highlighting with/without Term Vectors
Salman, It also depends on the size of your documents. Re-analyzing 20 fields of 500 bytes each will be a lot faster than re-analyzing 20 fields with 50 KB each. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Grant Ingersoll > To: solr-user@lucene.apache.org > Sent: Wed, January 26, 2011 10:44:09 AM > Subject: Re: Highlighting with/without Term Vectors > > > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote: > > > Hi, > > > > Does anyone have any benchmarks how much highlighting speeds up with Term > > Vectors (compared to without it)? e.g. if highlighting on 20 documents take > > 1 sec with Term Vectors any idea how long it will take without them? > > > > I need to know since the index used for highlighting has a TVF file of > > around 450GB (approx 65% of total index size) so I am trying to see whether > > the decreasing the index size by dropping TVF would be more helpful for > > performance (less RAM, should be good for I/O too I guess) or keeping it is > > still better? > > > > I know the best way is try it out but indexing takes a very long time so > > trying to see whether its even worthy or not. > > > Try testing on a smaller set. In general, you are saving the process of >re-analyzing the content, so, to some extent it is going to be dependent on >how >fast your analyzer chain is. At the size you are at, I don't know if storing >TVs is worth it.
Re: Solr for finding similar word between two documents
Rohan, You can really do that with Lucene's tokenizers to get individual tokens/words and a HashMap where keys are those words/tokens from the first document. You can then tokenize the second doc and check each of its words in the HashMap. Our Key Phrase Extractor ( http://sematext.com/products/key-phrase-extractor/index.html ) includes similar functionality that works with 2 corpora (or 2 pieces of text or 2 language models) and gets you the "overlap". I think it also takes into consideration term frequencies, which can be handy. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: rohan rai > To: solr-user@lucene.apache.org > Sent: Thu, February 3, 2011 2:35:39 PM > Subject: Re: Solr for finding similar word between two documents > > Lets say 1 have document(file) which is large and contains word inside it. > > And the 2nd document also is a text file. > > Problem is to find all those words in 2nd document which is present in first > document > when both of the files are large enough. > > Regards > Rohan > > On Fri, Feb 4, 2011 at 1:01 AM, openvictor Open wrote: > > > Rohan : what you want to do can be done with quite little effort if your > > document has a limited size (up to some Mo) with common and basic > > structures > > like Hasmap. > > > > Do you have any additional information on your problem so that we can give > > you more useful inputs ? > > > > 2011/2/3 Gora Mohanty > > > > > On Thu, Feb 3, 2011 at 11:32 PM, rohan rai wrote: > > > > Is there a way to use solr and get similar words between two document > > > > (files). > > > [...] > > > > > > This is *way* too vague t make any sense out of. Could you elaborate, > > > as I could have sworn that what you seem to want is the essential > > > function of a search engine. > > > > > > Regards, > > > Gora > > > > > >
Re: Highlighting with/without Term Vectors
Basically Term Vectors are only on one main field i.e. Contents. Average size of each document would be few KB's but there are around 130 million documents so what do you suggest now? On Fri, Feb 4, 2011 at 5:24 PM, Otis Gospodnetic wrote: > Salman, > > It also depends on the size of your documents. Re-analyzing 20 fields of > 500 > bytes each will be a lot faster than re-analyzing 20 fields with 50 KB > each. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Grant Ingersoll > > To: solr-user@lucene.apache.org > > Sent: Wed, January 26, 2011 10:44:09 AM > > Subject: Re: Highlighting with/without Term Vectors > > > > > > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote: > > > > > Hi, > > > > > > Does anyone have any benchmarks how much highlighting speeds up with > Term > > > Vectors (compared to without it)? e.g. if highlighting on 20 documents > take > > > 1 sec with Term Vectors any idea how long it will take without them? > > > > > > I need to know since the index used for highlighting has a TVF file of > > > around 450GB (approx 65% of total index size) so I am trying to see > whether > > > the decreasing the index size by dropping TVF would be more helpful > for > > > performance (less RAM, should be good for I/O too I guess) or keeping > it is > > > still better? > > > > > > I know the best way is try it out but indexing takes a very long time > so > > > trying to see whether its even worthy or not. > > > > > > Try testing on a smaller set. In general, you are saving the process of > >re-analyzing the content, so, to some extent it is going to be dependent > on how > >fast your analyzer chain is. At the size you are at, I don't know if > storing > >TVs is worth it. > -- Regards, Salman Akram
Re: Problem in faceting
Sending two separate queries is an approach but i think it may affect performance of the solr because for every new search there will be two queries to solr due to this reason i was thinking to do it by a single query. I am going to implement it with two queries now but if any thing is found useful in future then suggest me please. Thanks for the suggestion -- Thanks and Regards Bagesh Sharma -- View this message in context: http://lucene.472066.n3.nabble.com/Problem-in-faceting-tp2422182p2424104.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet Query
yes it works fine ... thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Facet-Query-tp2422212p2424155.html Sent from the Solr - User mailing list archive at Nabble.com.
RE: Index Not Matching
Hello Grijesh, The URL below returns a 404 with the following error: The requested resource (/select/) is not available. -Original Message- From: Grijesh [mailto:pintu.grij...@gmail.com] Sent: Friday, February 04, 2011 12:17 AM To: solr-user@lucene.apache.org Subject: RE: Index Not Matching http://localhost:8080/select/?q=*:* will return all records form solr - Thanx: Grijesh http://lucidimagination.com -- View this message in context: http://lucene.472066.n3.nabble.com/Index-Not-Matching-tp2417612p2421560. html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Index Not Matching
try http://localhost:8080/solr/select?q=*:* or while using solr's default port http://localhost:8983/solr/select?q=*:* On Fri, Feb 4, 2011 at 2:50 PM, Esclusa, Will wrote: > Hello Grijesh, > > The URL below returns a 404 with the following error: > > The requested resource (/select/) is not available. > > > > -Original Message- > From: Grijesh [mailto:pintu.grij...@gmail.com] > Sent: Friday, February 04, 2011 12:17 AM > To: solr-user@lucene.apache.org > Subject: RE: Index Not Matching > > > http://localhost:8080/select/?q=*:* will return all records form solr > > - > Thanx: > Grijesh > http://lucidimagination.com > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Index-Not-Matching-tp2417612p2421560. > html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: Use Parallel Search
Hello, I am not using Nutch. Let me explain more about how to use the lucene. The class has lucene RemoteSearch which a server machine is used to publish its index. RemoteSearchable remote = new RemoteSearchable (parallelSearcher); Naming.rebind ("//"+ LocalIP +"/"+ artPortMap.getNick (), remote); On the client it is necessary only to make lookuo based on IP of the machine. In each class we use the machine ParallelSearch which allows me to do searches in parallel using different processors dirent hds. Logo with 6 hds and a machine with more than 6 processors have a perfect looking into is parallel. In the example below is how to seek the reference of the server machine. Searchable ts = (Searchable) Naming.lookup ("//" + ip + ":" + port + "/" + name); My all document is in XML format. I have a pre processing that converts the HTML document, DOC, PDF to XML. The searches did not use facet, because the lucene not possible. That's one reason I'm studying the SOLR. Today I have the need to use FACET:). Use queries such as sorting, filtering, multiple fields With this architecture have an index of 18 fragments scattered in 18hds of three machines, each index fragment with a size of 10GB, which give me a 180GB total size of index. But I'm afraid because I multiply by 10 the index going from 180GB to 1180GB. Apache SOLR is best suited for this new index size or can I continue using lucene, being only necessary to add more machines? 2011/2/4 Ganesh > I am having similar kind of problem. I need to scale out. Could you explain > how you have done distributed indexing and search using Lucene. > > Regards > Ganesh > > - Original Message - > From: "Gustavo Maia" > To: > Sent: Thursday, February 03, 2011 11:36 PM > Subject: Use Parallel Search > > > > Hello, > > > > Let me give a brief description of my scenario. > > Today I am only using Lucene 2.9.3. I have an index of 30 million > documents > > distributed on three machines and each machine with 6 hds (15k rmp). > > The server queries the search index using the remote class search. And > each > > machine is made to search using the parallel search (search > simultaneously > > in 6 hds). > > So during the search are simulating using the three machines and 18 hds, > > returning me to a very good response time. > > > > > > Today I am studying the SOLR and am interested in knowing more about the > > searches and use of distributed parallel search on the same machine. What > > would be the best scenario using SOLR that is better than I already am > using > > today only with lucene? > > Note: I need to have installed on each machine 6 SOLR instantiate from > my > > server? One for each hd? Or would some other alternative way for me to > use > > the 6 hds without having 6 instances of SORL server? > > > > Another question would be if the SOLR would have some limiting size > index > > for Hard drive? It would be interesting not index too big because when > the > > index increased the longer the search. > > > > Thanks for everything. > > > > > > Gustavo Maia > > > Send free SMS to your Friends on Mobile from your Yahoo! Messenger. > Download Now! http://messenger.yahoo.com/download.php >
Re: What is the best protocol for data transfer rate HTTP or RMI?
Hi Otis, Hello, You have many documents, 2 billion. Could you explain to me how this set yours? The mine is defined as follows, but using lucene. I have 3 machines and each machine with 6 each hds. Each hd this index with afragment of 10GB. Soon I have 3 servers search. Each server uses the lucene classParallelSerach using 6 hds and publish that server using the class RemoteSearch. My client connects these three machines using RMI. Everything is in using lucene.Using the classes that provide it. Please explain how you did the distribution of the index. How many hds you use formachine? What is the maximum size of index you use for HD? Are you using theSORL or lucene? How many instance you have the SOLR server on each machine? Sorry for so many questions. Gustavo Maia 2011/2/4 Otis Gospodnetic > Gustavo, > > I haven't used RMI in 5 years, but last time I used it I remember it being > problematic - this is in the context of Lucene-based search involving some > 40 > different shards/servers, high query rates, and some 2 billion documents, > if I > remember correctly. I remember us wanting to get away from RMI to > something > simpler, less problematic, more HTTP-like. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Gustavo Maia > > To: solr-user@lucene.apache.org > > Sent: Thu, February 3, 2011 1:05:16 PM > > Subject: What is the best protocol for data transfer rate HTTP or RMI? > > > > Hello, > > > > > > > > I am doing a comparative study between Lucene and Solr and wish to > obtain > > more concrete data on the data transfer using the lucene RemoteSearch > that > > uses RMI and data transfer of SOLR that uses the HTTP protocol. > > > > > > > > > > Gustavo Maia > > >
Re: What is the best protocol for data transfer rate HTTP or RMI?
Hi Guys, It depends on what properties you're trying to maximize. I've done several studies of this over the years: http://sunset.usc.edu/~mattmann/pubs/MSST2006.pdf http://sunset.usc.edu/~mattmann/pubs/IWICSS07.pdf http://sunset.usc.edu/~mattmann/pubs/icse-shark08.pdf And if you're really bored, and have time, this one: http://sunset.usc.edu/~mattmann/Dissertation.pdf It would be nice to see how Lucene/Solr as an application that induces distribution scenarios affects the underlying data transfer, similar to the approaches described in the above papers. HTH! Cheers, Chris On Feb 4, 2011, at 3:32 AM, Otis Gospodnetic wrote: > Gustavo, > > I haven't used RMI in 5 years, but last time I used it I remember it being > problematic - this is in the context of Lucene-based search involving some 40 > different shards/servers, high query rates, and some 2 billion documents, if > I > remember correctly. I remember us wanting to get away from RMI to something > simpler, less problematic, more HTTP-like. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message >> From: Gustavo Maia >> To: solr-user@lucene.apache.org >> Sent: Thu, February 3, 2011 1:05:16 PM >> Subject: What is the best protocol for data transfer rate HTTP or RMI? >> >> Hello, >> >> >> >> I am doing a comparative study between Lucene and Solr and wish to obtain >> more concrete data on the data transfer using the lucene RemoteSearch that >> uses RMI and data transfer of SOLR that uses the HTTP protocol. >> >> >> >> >> Gustavo Maia >> ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
prices
Using solr 1.4. I have a price in my schema. Currently it's a tfloat. Somewhere along the way from php, json, solr, and back, extra zeroes are getting truncated along with the decimal point for even dollar amounts. So I have two questions, neither of which seemed to be findable with google. A/ Any way to keep both zeroes going inito a float field? (In the analyzer, with XML output, the values are shown with 1 zero) B/ Can strings be used in range queries like a float and work well for prices? Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die.
RE: DataImportHandler usage with RDF database
Hi Otis... thanks for your thoughts. >I don't think DIH can read from a triple store today. It can read from a >RDBMS, >RSS/Atom feeds, URLs, mail servers, maybe others... >Maybe what you should be looking at is the ManifoldCF instead, although I don't >think it can fetch data from triple stores today either. Ok well a way I can work around this (for the time being) is to pull data from URL's instead. >> without sending an index commit to Solr. As far as I can see >> DataImportHandler >>currently supports full and delta imports which mean I would be indexing. >> > I don't follow what you mean by this and how it relates to the first part. Well as you mentioned below, I'm talking about a custom SearchComponent that reads some data from somewhere (URL for the time being) and then uses it at search time for something. I have no need to index this data, I merely require it at search time. >> So far I have yet to find a requestHandler which is able to read then store >>data in memory, then use this data elsewhere prior to returning documents via >>queryResponseWriter. >I think you are talking about a custom SearchComponent that reads some data >from >somewhere (e.g. your triple store) and then uses it at search time for >something. This sounds doable, although you didn't provide details. For >example, we (Sematext) have implemented custom SearchComponents for e-commerce >customers where frequently-changing information about product availability was >fetched from external stores and applied to search results. I have web based files and the idea is to specify the URLs to the SearchComponent which can then use data within them during search time. Did your plug-in adhere to the general requestHandler design? Can you provide any resource from which I can get started with this? thank you Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html Winner: Times Higher Education’s Outstanding Support for Early Career Researchers of the Year 2010, GCU as a lead with Universities Scotland partners. http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html
Re: prices
On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon wrote: > Using solr 1.4. > > I have a price in my schema. Currently it's a tfloat. Somewhere along the way > from php, json, solr, and back, extra zeroes are getting truncated along with > the decimal point for even dollar amounts. > > So I have two questions, neither of which seemed to be findable with google. > > A/ Any way to keep both zeroes going inito a float field? (In the analyzer, > with > XML output, the values are shown with 1 zero) > B/ Can strings be used in range queries like a float and work well for prices? You could do a copyField into a stored string field and use the tfloat (or tint and store cents) for range queries, searching, etc, and the string field just for display. -Yonik http://lucidimagination.com > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > >
Re: HTTP ERROR 400 undefined field: *
Sorry for the lack of details. It's all clear in my head.. :) We checked out the head revision from the 3.x branch a few weeks ago (https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/). We picked up r1058326. We upgraded from a previous checkout (r960098). I am using our customized schema.xml and the solrconfig.xml from the old revision with the new checkout. After upgrading I just copied the data folders from each core into the new checkout (hoping I wouldn't have to re-index the content, as this takes days). Everything seems to work fine, except that now I can't get the score to return. The stack trace is attached. I also saw this warning in the logs not sure exactly what it's talking about: Feb 3, 2011 8:14:10 PM org.apache.solr.core.Config getLuceneVersion WARNING: the luceneMatchVersion is not specified, defaulting to LUCENE_24 emulation. You should at some point declare and reindex to at least 3.0, because 2.4 emulation is deprecated and will be removed in 4.0. This parameter will be mandatory in 4.0. Here is my request handler, the actual fields here are different than what is in mine, but I'm a little uncomfortable publishing how our companies search service works to the world: explicit edismax true field_a^2 field_b^2 field_c^4 field_d^10 0.1 tvComponent Anyway Hopefully this is enough info, let me know if you need more. Jed. On 02/03/2011 10:29 PM, Chris Hostetter wrote: : I was working on an checkout of the 3.x branch from about 6 months ago. : Everything was working pretty well, but we decided that we should update and : get what was at the head. However after upgrading, I am now getting this FWIW: please be specific. "head" of what? the 3x branch? or trunk? what revision in svn does that corrispond to? (the "svnversion" command will tell you) : HTTP ERROR 400 undefined field: * : : If I clear the fl parameter (default is set to *, score) then it works fine : with one big problem, no score data. If I try and set fl=score I get the same : error except it says undefined field: score?! : : This works great in the older version, what changed? I've googled for about : an hour now and I can't seem to find anything. i can't reproduce this using either trunk (r1067044) or 3x (r1067045) all of these queries work just fine... http://localhost:8983/solr/select/?q=* http://localhost:8983/solr/select/?q=solr&fl=*,score http://localhost:8983/solr/select/?q=solr&fl=score http://localhost:8983/solr/select/?q=solr ...you'll have to proivde us with a *lot* more details to help understand why you might be getting an error (like: what your configs look like, what the request looks like, what the full stack trace of your error is in the logs, etc...) -Hoss 844 Feb 3, 2011 8:16:58 PM org.apache.solr.core.SolrCore execute 845 INFO: [music] webapp=/solr path=/select params={explainOther=&fl=*,score&indent=on&start=0&q=test&hl.fl=&qt=standard&wt=standard&fq=&version=2.2&rows=10} hits=2201 status=400 QTime=143 846 Feb 3, 2011 8:17:00 PM org.apache.solr.core.SolrCore execute 847 INFO: [rovi] webapp=/solr path=/replication params={command=indexversion&wt=javabin} status=0 QTime=0 848 Feb 3, 2011 8:17:00 PM org.apache.solr.core.SolrCore execute 849 INFO: [rovi] webapp=/solr path=/replication params={command=filelist&wt=javabin&indexversion=1277332208072} status=0 QTime=0 850 Feb 3, 2011 8:17:00 PM org.apache.solr.core.SolrCore execute 851 INFO: [rovi] webapp=/solr path=/replication params={command=indexversion&wt=javabin} status=0 QTime=0 852 Feb 3, 2011 8:17:09 PM org.apache.solr.common.SolrException log 853 SEVERE: org.apache.solr.common.SolrException: undefined field: score 854 at org.apache.solr.handler.component.TermVectorComponent.process(TermVectorComponent.java:142) 855 at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) 856 at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) 857 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1357) 858 at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:341) 859 at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:244) 860 at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) 861 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) 862 at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) 863 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) 864 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) 865 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) 866 at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) 867 at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:1
RE: prices
Your prices are just dollars and cents? For actual queries, you might consider an int type rather than a float type. Multiple by a hundred to put it in the index, then multiply your values in queries by a hundred before putting them in the query. Same for range facetting, just divide by 100 before display of anything you get back. Fixed precision values like price values aren't really floats or don't really need floats, and floats sometimes do weird things, as you've noticed. Alternately if your problem is simply that you want to display "2.0" as "2.00" rather than "2" or "2.0", that is something for you to take care of in your PHP app that does the display. PHP will have some function for formatting numbers and saying with what precision you want to display. There is no way to keep two trailing zeroes 'in' a float field, because "2.0" or "2." is the same value as "2.00" or "2.00", so they've all got the same internal representation in the float field. There is no way I know to tell Solr what precision to render floats with in it's responses. From: ysee...@gmail.com [ysee...@gmail.com] On Behalf Of Yonik Seeley [yo...@lucidimagination.com] Sent: Friday, February 04, 2011 1:49 PM To: solr-user@lucene.apache.org Subject: Re: prices On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon wrote: > Using solr 1.4. > > I have a price in my schema. Currently it's a tfloat. Somewhere along the way > from php, json, solr, and back, extra zeroes are getting truncated along with > the decimal point for even dollar amounts. > > So I have two questions, neither of which seemed to be findable with google. > > A/ Any way to keep both zeroes going inito a float field? (In the analyzer, > with > XML output, the values are shown with 1 zero) > B/ Can strings be used in range queries like a float and work well for prices? You could do a copyField into a stored string field and use the tfloat (or tint and store cents) for range queries, searching, etc, and the string field just for display. -Yonik http://lucidimagination.com > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a > better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > >
Re: Using terms and N-gram
Hi Otis, That's good I finally made it. For sematext I am afraid that I am too poor to consider this solution :) (I am doing that for fun) Thank you anyway ! 2011/2/4 Otis Gospodnetic > Hi, > > The main difference is that CommonGrams will take 2 adjacent words and put > them > together, while NGram* stuff will take a single word and chop it up in > sequences > of one or more characters/letters. > > If you are stuck with auto-complete stuff, consider > http://sematext.com/products/autocomplete/index.html > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: openvictor Open > > To: solr-user@lucene.apache.org > > Sent: Thu, February 3, 2011 10:15:47 AM > > Subject: Re: Using terms and N-gram > > > > Thank you, I will do that and hopefuly it will be handy ! > > > > But can someone explain me difference between CommonGramFIlterFactory et > > NGramFilterFactory ? ( Maybe the solution is there) > > > > Thank you all, > > best regards > > > > 2011/2/3 Grijesh > > > > > > > > Use analysis.jsp to see what happening at index time and query time > with > > > your > > > input data.You can use highlighting to see if match found. > > > > > > - > > > Thanx: > > > Grijesh > > > http://lucidimagination.com > > > -- > > > View this message in context: > > > > > > http://lucene.472066.n3.nabble.com/Using-terms-and-N-gram-tp2410938p2411244.html > > > Sent from the Solr - User mailing list archive at Nabble.com. > > > > > >
Re: prices
That's a good idea, Yonik. So, fields that aren't stored don't get displayed, so the float field in the schema never gets seen by the user. Good, I like it. Dennis Gearon Signature Warning It is always a good idea to learn from your own mistakes. It is usually a better idea to learn from others’ mistakes, so you do not have to make them yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' EARTH has a Right To Life, otherwise we all die. - Original Message From: Yonik Seeley To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 10:49:42 AM Subject: Re: prices On Fri, Feb 4, 2011 at 12:56 PM, Dennis Gearon wrote: > Using solr 1.4. > > I have a price in my schema. Currently it's a tfloat. Somewhere along the way > from php, json, solr, and back, extra zeroes are getting truncated along with > the decimal point for even dollar amounts. > > So I have two questions, neither of which seemed to be findable with google. > > A/ Any way to keep both zeroes going inito a float field? (In the analyzer, >with > XML output, the values are shown with 1 zero) > B/ Can strings be used in range queries like a float and work well for prices? You could do a copyField into a stored string field and use the tfloat (or tint and store cents) for range queries, searching, etc, and the string field just for display. -Yonik http://lucidimagination.com > > Dennis Gearon > > > Signature Warning > > It is always a good idea to learn from your own mistakes. It is usually a >better > idea to learn from others’ mistakes, so you do not have to make them yourself. > from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036' > > > EARTH has a Right To Life, > otherwise we all die. > >
Re: Performance optimization of Proximity/Wildcard searches
I know so we are not really using it for regular warm-ups (in any case index is updated on hourly basis). Just tried few times to compare results. The issue is I am not even sure if warming up is useful for such regular updates. On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic wrote: > Salman, > > I only skimmed your email, but wanted to say that this part sounds a little > suspicious: > > > Our warm up script currently executes all distinct queries in our logs > > having count > 5. It was run yesterday (with all the indexing update > every > > It sounds like this will make warmup take a long time, assuming you > have > more than a handful distinct queries in your logs. > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk > > Sent: Tue, January 25, 2011 6:32:48 AM > > Subject: Re: Performance optimization of Proximity/Wildcard searches > > > > By warmed index you only mean warming the SOLR cache or OS cache? As I > said > > our index is updated every hour so I am not sure how much SOLR cache > would > > be helpful but OS cache should still be helpful, right? > > > > I haven't compared the results with a proper script but from manual > testing > > here are some of the observations. > > > > 'Recent' queries which are in cache of course return immediately (only > if > > they are exactly same - even if they took 3-4 mins first time). I will > need > > to test how many recent queries stay in cache but still this would work > only > > for very common queries. User can run different queries and I want at > least > > them to be at 'acceptable' level (5-10 secs) even if not very fast. > > > > Our warm up script currently executes all distinct queries in our logs > > having count > 5. It was run yesterday (with all the indexing update > every > > hour after that) and today when I executed some of the same queries > again > > their time seemed a little less (around 15-20%), I am not sure if this > means > > anything. However, still their time is not acceptable. > > > > What do you think is the best way to compare results? First run all the > warm > > up queries and then execute same randomly and compare? > > > > We are using Windows server, would it make a big difference if we move > to > > Linux? Our load is not high but some queries are really complex. > > > > Also I was hoping to move to SSD in last after trying out all software > > options. Is that an agreed fact that on large indexes (which don't fit > in > > RAM) proximity/wildcard/phrase queries (on common words) would be slow > and > > it can be only improved by cache warm up and better hardware? Otherwise > with > > an index of around 150GB such queries will take more than a min? > > > > If that's the case I know this question is very subjective but if a > single > > query takes 2 min on SAS 10K RPM what would its approx time be on a good > SSD > > (everything else same)? > > > > Thanks! > > > > > > On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen > wrote: > > > > > On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: > > > > Cache warming is a good option too but the index get updated every > hour > > > so > > > > not sure how much would that help. > > > > > > What is the time difference between queries with a warmed index and a > > > cold one? If the warmed index performs satisfactory, then one answer > is > > > to upgrade your underlying storage. As always for IO-caused > performance > > > problem in Lucene/Solr-land, SSD is the answer. > > > > > > > > > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Well I assume many people out there would have indexes larger than 100GB and I don't think so normally you will have more RAM than 32GB or 64! As I mentioned the queries are mostly phrase, proximity, wildcard and combination of these. What exactly do you mean by distribution of documents? On this index our documents are not more than few hundred KB's on average (file system size) and there are around 14 million documents. 80% of the index size is taken up by position file. I am not sure if this is what you asked? On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic wrote: > Hi, > > > > Sharding is an option too but that too comes with limitations so want to > > keep that as a last resort but I think there must be other things coz > 150GB > > is not too big for one drive/server with 32GB Ram. > > Hmm what makes you think 32 GB is enough for your 150 GB index? > It depends on queries and distribution of matching documents, for example. > What's yours like? > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Salman Akram > > To: solr-user@lucene.apache.org > > Sent: Tue, January 25, 2011 4:20:34 AM > > Subject: Performance optimization of Proximity/Wildcard searches > > > > Hi, > > > > I am facing performance issues in three types of queries (and their > > combination). Some of the queries take more than 2-3 mins. Index size is > > around 150GB. > > > > > >- Wildcard > >- Proximity > >- Phrases (with common words) > > > > I know CommonGrams and Stop words are a good way to resolve such issues > but > > they don't fulfill our functional requirements (Common Grams seem to > have > > issues with phrase proximity, stop words have issues with exact match > etc). > > > > Sharding is an option too but that too comes with limitations so want to > > keep that as a last resort but I think there must be other things coz > 150GB > > is not too big for one drive/server with 32GB Ram. > > > > Cache warming is a good option too but the index get updated every hour > so > > not sure how much would that help. > > > > What are the other main tips that can help in performance optimization > of > > the above queries? > > > > Thanks > > > > -- > > Regards, > > > > Salman Akram > > > -- Regards, Salman Akram
NullPointerException on queries to new 3rd core
I just moved to a multi core solr instance a few weeks ago, and it's been working great. I'm trying to add a 3rd core and I can't query against it though. I'm running 1.4.1 (and tried 1.4.0) with the spatial search plugin. This is the section in solr.xml I've removed the index dir and completely rebuilt all three cores from scratch. I can query the old ones, but any query against the new one gives me this error: HTTP ERROR: 500 null java.lang.NullPointerException at org.apache.solr.request.XMLWriter.writePrim(XMLWriter.java:761) at org.apache.solr.request.XMLWriter.writeStr(XMLWriter.java:619) at org.apache.solr.schema.TextField.write(TextField.java:45) at org.apache.solr.schema.SchemaField.write(SchemaField.java:108) at org.apache.solr.request.XMLWriter.writeDoc(XMLWriter.java:311) at org.apache.solr.request.XMLWriter$3.writeDocs(XMLWriter.java:483) at org.apache.solr.request.XMLWriter.writeDocuments(XMLWriter.java:420) at org.apache.solr.request.XMLWriter.writeDocList(XMLWriter.java:457) at org.apache.solr.request.XMLWriter.writeVal(XMLWriter.java:520) at org.apache.solr.request.XMLWriter.writeResponse(XMLWriter.java:130) at org.apache.solr.request.XMLResponseWriter.write(XMLResponseWriter.java:34) at org.apache.solr.servlet.SolrDispatchFilter.writeResponse(SolrDispatchFilter.java:325) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:254) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139) at org.mortbay.jetty.Server.handle(Server.java:285) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442) I'm not finding any reason why this should be happening.
Re: phrase, inidividual term, prefix, fuzzy and stemming search
You mentioned that dismax does not support wildcards, but edismax does. Not sure if dismax would have solved your other problems, or whether you just had to shift gears because of the wildcard issue, but you might want to have a look at edismax. -Jay http://www.lucidimagination.com On Mon, Jan 31, 2011 at 2:22 PM, cyang2010 wrote: > > My current project has the requirement to support search when user inputs > any > number of terms across a few index fields (movie title, actor, director). > > In order to maximize result, I plan to support all those searches listed in > the subject, phrase, individual term, prefix, fuzzy and stemming. Of > course, score relevance in the right order is also important. > > I have considered using dismax query. However, it does not support prefix > query. I am not sure if it supports fuzzy query, my guess is does not. > > Therefore, i still need to use standard query. For example, if someone > searches "deim moer" (typo for demi moore), i compare the phrase and terms > with each searchable fields (title, actor, director): > > > title_display: "deim moer"~30 actors: "deim moer"~30 directors: "deim > moer"~30<-- OR > > title_display: deim<-- OR > actors: deim > directors: deim > > title_display: deim* <-- OR > actors: deim* > directors: deim* > > title_display: deim~0.6 <-- OR > actors: deim~0.6 > directors: deim~0.6 > > title_display: moer<-- OR > actors: moer > directors: moer > > title_display: moer* <-- OR > actors: moer* > directors: moer* > > title_display: moer~0.6<-- OR > actors: moer~0.6 > directors: moer~0.6 > > The solr relevance score is sum for all those OR. In that way, i can make > sure relevance score are in order. For example, for the exact match ("deim > moer"), it will match phrase, term, prefix and fuzzy query all at the same > time. Therefore, it will score higher than some input text only matchs > term, or prefix or fuzzy. At the same time, i can apply boost to a > particular search field if requirement needs. > > > Does it sound right to you? Is there better ways to achieve the same > thing? > My concern is my query is not going to perform, since it tries to do too > much. But isn't that what people want to get (maximize result) when they > just type in a few search words? > > Another question is that: Can i combine the result of two query together? > For example, first i query phrase and term match, next I query for prefix > match. Can I just append the result for prefix match to that for > phrase/term match? I thought two queries have different queryNorm, > therefore, the score is not comparable to each other so as to combine. Is > it correct? > > > Thanks. love to hear what your thought is. > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/phrase-inidividual-term-prefix-fuzzy-and-stemming-search-tp239p239.html > Sent from the Solr - User mailing list archive at Nabble.com. >
WordDelimiterFilterFactory
If i use WordDelimiterFilterFactory during indexing and at query time, will a search for "cls500" find "cls 500" and "cls500x"? If so, will it find and score exact matches higher? If not, how do you get exact matches to display first?
Re: WordDelimiterFilterFactory
You can always try something like this out in the analysis.jsp page, accessible from the Solr Admin home. Check out that page and see how it allows you to enter text to represent what was indexed, and text for a query. You can then see if there are matches. Very handy to see how the various filters in a field type act on text. Make sure to check "verbose output" for both index and query. For this specific issue, yes, a query for "cls500" will match both of those examples. To get the exact match to score higher: - create a text field (or a custom type that uses the WordDelimiterFilterFactory) (let's name the field "foo") - create a string field (let's name it "foo_string") - create a "copyField" with the source being "foo" and the dest being "foo_string". - use dismax (or edismax) to search both of those fields http://localhost:8983/solr/select/?q=cls500&defType=edismax&qf=foofoo_string This should score the string field higher, but you could also add a boost to it to make sure: http://localhost:8983/solr/select/?q=cls500&defType=edismax&qf=foofoo_string^4.0 -Jay http://lucidimagination.com On Fri, Feb 4, 2011 at 4:25 PM, John kim wrote: > If i use WordDelimiterFilterFactory during indexing and at query time, > will a search for "cls500" find "cls 500" and "cls500x"? If so, will > it find and score exact matches higher? If not, how do you get exact > matches to display first? >
Re: What is the best protocol for data transfer rate HTTP or RMI?
Hi Gustavo, I think none of the answers I could give you would be valuable to you now, because they would be from circa 2007 or 2008. We didn't use Solr, just Lucene. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Gustavo Maia > To: solr-user@lucene.apache.org > Sent: Fri, February 4, 2011 10:15:09 AM > Subject: Re: What is the best protocol for data transfer rate HTTP or RMI? > > Hi Otis, > > > Hello, > > You have many documents, 2 billion. Could you explain to me how this set > yours? > > The mine is defined as follows, but using lucene. > I have 3 machines and each machine with 6 each hds. Each hd this index with > afragment of 10GB. Soon I have 3 servers search. Each server uses the lucene > classParallelSerach using 6 hds and publish that server using the class > RemoteSearch. > My client connects these three machines using RMI. Everything is in using > lucene.Using the classes that provide it. > >Please explain how you did the distribution of the index. How many hds > you use formachine? What is the maximum size of index you use for HD? Are > you using theSORL or lucene? How many instance you have the SOLR server on > each machine? > > Sorry for so many questions. > > Gustavo Maia > > > > 2011/2/4 Otis Gospodnetic > > > Gustavo, > > > > I haven't used RMI in 5 years, but last time I used it I remember it being > > problematic - this is in the context of Lucene-based search involving some > > 40 > > different shards/servers, high query rates, and some 2 billion documents, > > if I > > remember correctly. I remember us wanting to get away from RMI to > > something > > simpler, less problematic, more HTTP-like. > > > > Otis > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > - Original Message > > > From: Gustavo Maia > > > To: solr-user@lucene.apache.org > > > Sent: Thu, February 3, 2011 1:05:16 PM > > > Subject: What is the best protocol for data transfer rate HTTP or RMI? > > > > > > Hello, > > > > > > > > > > > > I am doing a comparative study between Lucene and Solr and wish to > > obtain > > > more concrete data on the data transfer using the lucene RemoteSearch > > that > > > uses RMI and data transfer of SOLR that uses the HTTP protocol. > > > > > > > > > > > > > > > Gustavo Maia > > > > > >
Re: Highlighting with/without Term Vectors
Hi Salman, Ah, so in the end you *did* have TV enabled on one of your fields! :) (I think this was a problem we were trying to solve a few weeks ago here) How many docs you have in the index doesn't matter here - only N docs/fields that you need to display on a page with N results need to be reanalyzed for highlighting purposes, so follow Grant's advice, make a small index without TV, and compare highlighting speed with and without TV. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Salman Akram > To: solr-user@lucene.apache.org > Sent: Fri, February 4, 2011 8:03:06 AM > Subject: Re: Highlighting with/without Term Vectors > > Basically Term Vectors are only on one main field i.e. Contents. Average > size of each document would be few KB's but there are around 130 million > documents so what do you suggest now? > > On Fri, Feb 4, 2011 at 5:24 PM, Otis Gospodnetic > wrote: > > > Salman, > > > > It also depends on the size of your documents. Re-analyzing 20 fields of > > 500 > > bytes each will be a lot faster than re-analyzing 20 fields with 50 KB > > each. > > > > Otis > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > - Original Message > > > From: Grant Ingersoll > > > To: solr-user@lucene.apache.org > > > Sent: Wed, January 26, 2011 10:44:09 AM > > > Subject: Re: Highlighting with/without Term Vectors > > > > > > > > > On Jan 24, 2011, at 2:42 PM, Salman Akram wrote: > > > > > > > Hi, > > > > > > > > Does anyone have any benchmarks how much highlighting speeds up with > > Term > > > > Vectors (compared to without it)? e.g. if highlighting on 20 documents > > take > > > > 1 sec with Term Vectors any idea how long it will take without them? > > > > > > > > I need to know since the index used for highlighting has a TVF file of > > > > around 450GB (approx 65% of total index size) so I am trying to see > > whether > > > > the decreasing the index size by dropping TVF would be more helpful > > for > > > > performance (less RAM, should be good for I/O too I guess) or keeping > > it is > > > > still better? > > > > > > > > I know the best way is try it out but indexing takes a very long time > > so > > > > trying to see whether its even worthy or not. > > > > > > > > > Try testing on a smaller set. In general, you are saving the process of > > >re-analyzing the content, so, to some extent it is going to be dependent > > on how > > >fast your analyzer chain is. At the size you are at, I don't know if > > storing > > >TVs is worth it. > > > > > > -- > Regards, > > Salman Akram >
Re: Performance optimization of Proximity/Wildcard searches
Salman, Warming up may be useful if your caches are getting decent hit ratios. Plus, you are warming up the OS cache when you warm up. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Salman Akram > To: solr-user@lucene.apache.org > Sent: Fri, February 4, 2011 3:33:41 PM > Subject: Re: Performance optimization of Proximity/Wildcard searches > > I know so we are not really using it for regular warm-ups (in any case index > is updated on hourly basis). Just tried few times to compare results. The > issue is I am not even sure if warming up is useful for such regular > updates. > > > > On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic > wrote: > > > Salman, > > > > I only skimmed your email, but wanted to say that this part sounds a little > > suspicious: > > > > > Our warm up script currently executes all distinct queries in our logs > > > having count > 5. It was run yesterday (with all the indexing update > > every > > > > It sounds like this will make warmup take a long time, assuming you > > have > > more than a handful distinct queries in your logs. > > > > Otis > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > - Original Message > > > From: Salman Akram > > > To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk > > > Sent: Tue, January 25, 2011 6:32:48 AM > > > Subject: Re: Performance optimization of Proximity/Wildcard searches > > > > > > By warmed index you only mean warming the SOLR cache or OS cache? As I > > said > > > our index is updated every hour so I am not sure how much SOLR cache > > would > > > be helpful but OS cache should still be helpful, right? > > > > > > I haven't compared the results with a proper script but from manual > > testing > > > here are some of the observations. > > > > > > 'Recent' queries which are in cache of course return immediately (only > > if > > > they are exactly same - even if they took 3-4 mins first time). I will > > need > > > to test how many recent queries stay in cache but still this would work > > only > > > for very common queries. User can run different queries and I want at > > least > > > them to be at 'acceptable' level (5-10 secs) even if not very fast. > > > > > > Our warm up script currently executes all distinct queries in our logs > > > having count > 5. It was run yesterday (with all the indexing update > > every > > > hour after that) and today when I executed some of the same queries > > again > > > their time seemed a little less (around 15-20%), I am not sure if this > > means > > > anything. However, still their time is not acceptable. > > > > > > What do you think is the best way to compare results? First run all the > > warm > > > up queries and then execute same randomly and compare? > > > > > > We are using Windows server, would it make a big difference if we move > > to > > > Linux? Our load is not high but some queries are really complex. > > > > > > Also I was hoping to move to SSD in last after trying out all software > > > options. Is that an agreed fact that on large indexes (which don't fit > > in > > > RAM) proximity/wildcard/phrase queries (on common words) would be slow > > and > > > it can be only improved by cache warm up and better hardware? Otherwise > > with > > > an index of around 150GB such queries will take more than a min? > > > > > > If that's the case I know this question is very subjective but if a > > single > > > query takes 2 min on SAS 10K RPM what would its approx time be on a good > > SSD > > > (everything else same)? > > > > > > Thanks! > > > > > > > > > On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen > > wrote: > > > > > > > On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: > > > > > Cache warming is a good option too but the index get updated every > > hour > > > > so > > > > > not sure how much would that help. > > > > > > > > What is the time difference between queries with a warmed index and a > > > > cold one? If the warmed index performs satisfactory, then one answer > > is > > > > to upgrade your underlying storage. As always for IO-caused > > performance > > > > problem in Lucene/Solr-land, SSD is the answer. > > > > > > > > > > > > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > > > -- > Regards, > > Salman Akram >
Re: Performance optimization of Proximity/Wildcard searches
Heh, I'm not sure if this is valid thinking. :) By *matching* doc distribution I meant: what proportion of your millions of documents actually ever get matched and then how many of those make it to the UI. If you have 1000 queries in a day and they all end up matching only 3 of your docs, the system will need less RAM than a system where 1000 queries match 5 different docs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Salman Akram > To: solr-user@lucene.apache.org > Sent: Fri, February 4, 2011 3:38:55 PM > Subject: Re: Performance optimization of Proximity/Wildcard searches > > Well I assume many people out there would have indexes larger than 100GB and > I don't think so normally you will have more RAM than 32GB or 64! > > As I mentioned the queries are mostly phrase, proximity, wildcard and > combination of these. > > What exactly do you mean by distribution of documents? On this index our > documents are not more than few hundred KB's on average (file system size) > and there are around 14 million documents. 80% of the index size is taken up > by position file. I am not sure if this is what you asked? > > On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic > wrote: > > > Hi, > > > > > > > Sharding is an option too but that too comes with limitations so want to > > > keep that as a last resort but I think there must be other things coz > > 150GB > > > is not too big for one drive/server with 32GB Ram. > > > > Hmm what makes you think 32 GB is enough for your 150 GB index? > > It depends on queries and distribution of matching documents, for example. > > What's yours like? > > > > Otis > > > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > > > > > > - Original Message > > > From: Salman Akram > > > To: solr-user@lucene.apache.org > > > Sent: Tue, January 25, 2011 4:20:34 AM > > > Subject: Performance optimization of Proximity/Wildcard searches > > > > > > Hi, > > > > > > I am facing performance issues in three types of queries (and their > > > combination). Some of the queries take more than 2-3 mins. Index size is > > > around 150GB. > > > > > > > > >- Wildcard > > > - Proximity > > >- Phrases (with common words) > > > > > > I know CommonGrams and Stop words are a good way to resolve such issues > > but > > > they don't fulfill our functional requirements (Common Grams seem to > > have > > > issues with phrase proximity, stop words have issues with exact match > > etc). > > > > > > Sharding is an option too but that too comes with limitations so want to > > > keep that as a last resort but I think there must be other things coz > > 150GB > > > is not too big for one drive/server with 32GB Ram. > > > > > > Cache warming is a good option too but the index get updated every hour > > so > > > not sure how much would that help. > > > > > > What are the other main tips that can help in performance optimization > > of > > > the above queries? > > > > > > Thanks > > > > > > -- > > > Regards, > > > > > > Salman Akram > > > > > > > > > -- > Regards, > > Salman Akram >
Re: geodist and spacial search
Why not just: q=*:* fq={!bbox} sfield=store pt=49.45031,11.077721 d=40 fl=store sort=geodist() asc http://localhost:8983/solr/select?q=*:*&sfield=store&pt=49.45031,11.077721&; d=40&fq={!bbox}&sort=geodist%28%29%20asc That will sort, and filter up to 40km. No need for the fq={!func}geodist() sfield=store pt=49.45031,11.077721 Bill On 2/4/11 4:30 AM, "Eric Grobler" wrote: >Hi Grant, > >Thanks for the tip >This seems to work: > >q=*:* >fq={!func}geodist() >sfield=store >pt=49.45031,11.077721 > >fq={!bbox} >sfield=store >pt=49.45031,11.077721 >d=40 > >fl=store >sort=geodist() asc > > >On Thu, Feb 3, 2011 at 7:46 PM, Grant Ingersoll >wrote: > >> Use a filter query? See the {!geofilt} stuff on the wiki page. That >>gives >> you your filter to restrict down your result set, then you can sort by >>exact >> distance to get your sort of just those docs that make it through the >> filter. >> >> >> On Feb 3, 2011, at 10:24 AM, Eric Grobler wrote: >> >> > Hi Erick, >> > >> > Thanks I saw that example, but I am trying to sort by distance AND >> specify >> > the max distance in 1 query. >> > >> > The reason is: >> > running bbox on 2 million documents with a 20km distance takes only >> 200ms. >> > Sorting 2 million documents by distance takes over 1.5 seconds! >> > >> > So it will be much faster for solr to first filter the 20km documents >>and >> > then to sort them. >> > >> > Regards >> > Ericz >> > >> > On Thu, Feb 3, 2011 at 1:27 PM, Erick Erickson >>> >wrote: >> > >> >> Further down that very page ... >> >> >> >> Here's an example of sorting by distance ascending: >> >> >> >> - >> >> >> >> ...&q=*:*&sfield=store&pt=45.15,-93.85&sort=geodist() >> >> asc< >> >> >> >>http://localhost:8983/solr/select?wt=json&indent=true&fl=name,store&q=*:* >>&sfield=store&pt=45.15,-93.85&sort=geodist()%20asc >> >>> >> >> >> >> >> >> >> >> >> >> The key is just the &sort=geodist(), I'm pretty sure that's >>independent >> of >> >> the bbox, but >> >> I could be wrong. >> >> >> >> Best >> >> Erick >> >> >> >> On Wed, Feb 2, 2011 at 11:18 AM, Eric Grobler < >> impalah...@googlemail.com >> >>> wrote: >> >> >> >>> Hi >> >>> >> >>> In http://wiki.apache.org/solr/SpatialSearch >> >>> there is an example of a bbox filter and a geodist function. >> >>> >> >>> Is it possible to do a bbox filter and sort by distance - combine >>the >> >> two? >> >>> >> >>> Thanks >> >>> Ericz >> >>> >> >> >> >> -- >> Grant Ingersoll >> http://www.lucidimagination.com/ >> >> Search the Lucene ecosystem docs using Solr/Lucene: >> http://www.lucidimagination.com/search >> >>
UIMA Error
hi guys i'm trying to use UIMA contrib, but i got the following error ... INFO: [] webapp=/solr path=/select params={clean=false&commit=true&command=status&qt=/dataimport} status=0 QTime=0 05/02/2011 10:54:53 ص org.apache.solr.uima.processor.UIMAUpdateRequestProcessor processText INFO: Analazying text 05/02/2011 10:54:53 ص org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE INFO: setting cat_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe 05/02/2011 10:54:53 ص org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE INFO: setting keyword_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe 05/02/2011 10:54:53 ص org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE INFO: setting concept_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe 05/02/2011 10:54:53 ص org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE INFO: setting entities_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe 05/02/2011 10:54:53 ص org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE INFO: setting lang_apikey : 0449a72fe7ec5cb3497f14e77f338c86f2fe 05/02/2011 10:54:53 ص org.apache.solr.uima.processor.ae.OverridingParamsAEProvider getAE INFO: setting oc_licenseID : g6h9zamsdtwhb93nc247ecrs 05/02/2011 10:54:53 ص WhitespaceTokenizer initialize INFO: "Whitespace tokenizer successfully initialized" 05/02/2011 10:54:56 ص org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params={clean=false&commit=true&command=status&qt=/dataimport} status=0 QTime=0 05/02/2011 10:54:57 ص WhitespaceTokenizer typeSystemInit INFO: "Whitespace tokenizer typesystem initialized" 05/02/2011 10:54:57 ص WhitespaceTokenizer process INFO: "Whitespace tokenizer starts processing" 05/02/2011 10:54:57 ص WhitespaceTokenizer process INFO: "Whitespace tokenizer finished processing" 05/02/2011 10:54:57 ص org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl callAnalysisComponentProcess(405) SEVERE: Exception occurred org.apache.uima.analysis_engine.AnalysisEngineProcessException at org.apache.uima.annotator.calais.OpenCalaisAnnotator.process(OpenCalaisAnnotator.java:206) at org.apache.uima.analysis_component.CasAnnotator_ImplBase.process(CasAnnotator_ImplBase.java:56) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.callAnalysisComponentProcess(PrimitiveAnalysisEngine_impl.java:377) at org.apache.uima.analysis_engine.impl.PrimitiveAnalysisEngine_impl.processAndOutputNewCASes(PrimitiveAnalysisEngine_impl.java:295) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.processUntilNextOutputCas(ASB_impl.java:567) at org.apache.uima.analysis_engine.asb.impl.ASB_impl$AggregateCasIterator.(ASB_impl.java:409) at org.apache.uima.analysis_engine.asb.impl.ASB_impl.process(ASB_impl.java:342) at org.apache.uima.analysis_engine.impl.AggregateAnalysisEngine_impl.processAndOutputNewCASes(AggregateAnalysisEngine_impl.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:267) at org.apache.uima.analysis_engine.impl.AnalysisEngineImplBase.process(AnalysisEngineImplBase.java:280) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processText(UIMAUpdateRequestProcessor.java:122) at org.apache.solr.uima.processor.UIMAUpdateRequestProcessor.processAdd(UIMAUpdateRequestProcessor.java:69) at org.apache.solr.handler.dataimport.SolrWriter.upload(SolrWriter.java:75) at org.apache.solr.handler.dataimport.DataImportHandler$1.upload(DataImportHandler.java:291) at org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:626) at org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:266) at org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185) at org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335) at org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393) at org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374) Caused by: java.net.UnknownHostException: api.opencalais.com at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:177) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at java.net.Socket.connect(Socket.java:478) at sun.net.NetworkClient.doConnect(NetworkClient.java:163) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:975) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:916) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:841) at sun.net.www.protocol.http.HttpURLConnection.getOutputStr