field type definition
Hello, If I define a field like this in the schema ,is this correct ? - <http://sites.google.com/a/impelsys.com/search/phrase-match#> Here I am not differentiating it in terms of query analyzer and the index analyzer and I am assuming that this will be used by both query and index analyzer .Is this correct? Regards Revas
Help on this parsed query
I have the text analyzer defined as follows when i search on this field name simple of the above field type , the term peRsonal *I expect it to search as simple:personal simple :pe simple:rsonal* instead the parsed query string says *simple:peRsonal* * * *simple:peRsonal* * * *MultiPhraseQuery(simple:"(person pe) rsonal")* * * *simple:"(person pe) rsonal"* what is this multiphrase query ,why is this a phrase query istead of simple query? Regards Revas
Number of webapps
Hi I am sure this question has been repeated many times over and there has been several generic answers ,but i am looking for specific answers. I have a single server whose configuration i give below,this being the only server we have at present ,the requirement is everytime we create a new website ,we create two instances for the same one for content search and one for product search ,both have faceting requirements. there are about 25 fields for product schema and abt 20 for content schema ,we do not store the content in the server ,the content is only indexed. Assuming that we currently have 10 websites ,which means we have 20 webapps running on this server each having about 1000 documents and size of the index is approximately 50mb currently .The index size of each is expected to grow continlously as more products are added. We recenlty got the followng error on creation of a new webapp ? SEVERE: Caught exception (java.lang.OutOfMemoryError: PermGen space) executing org.apache.tomcat.util.net.leaderfollowerworkerthr...@1c2534f, terminating thread Feb 24, 2009 6:22:16 AM org.apache.tomcat.util.threads.ThreadPool$ControlRunnable run SEVERE: Caught exception (java.lang.OutOfMemoryError: PermGen space) executing org.apache.tomcat.util.net.leaderfollowerworkerthr...@1c2534f, terminating thread Sent at 12:32 PM on Wednesday What would this mean?Given the above,How many such webapps can we have on this server? *Server config* OS: Red Hat Enterprise Linux ES 4 - 64 Bit # Processor: Dual AMD Opteron Dual Core 270 2.0 GHz # 4GB DDR RAM # Hard Drive: 73GB SCSI # Hard Drive: 73GB SCSI thanks
Re: Number of webapps
thanks will try that .I also have the war file for each solr instance in the home directory of the instance ,would that be the problem ? if i were to have common war file for n instances ,would there be any issue? regards revas On 2/25/09, Michael Della Bitta wrote: > > It's possible I don't know enough about Solr's internals and there's a > better solution than this, and it's surprising me that you're running > out of PermGen space before you're running out of heap, but maybe > you've already increased the general heap size without tweaking > PermGen, and loading all the classes involved in loading 20 contexts > is putting you over. In any case, you might try adding the following > option to CATALINA_OPTS: -XX:MaxPermSize=256m. If you don't know where > to put something like that, you might try adding the following line to > $TOMCAT_HOME/bin/startup.sh: > > export CATALINA_OPTS="-XX:MaxPermSize=256m ${CATALINA_OPTS}" > > If that value (256) doesn't alleviate the problem, you might try increasing > it. > > Hope that helps, > > Michael Della Bitta > > > On Wed, Feb 25, 2009 at 3:08 AM, revas wrote: > > Hi > > > > I am sure this question has been repeated many times over and there has > been > > several generic answers ,but i am looking for specific answers. > > > > I have a single server whose configuration i give below,this being the > only > > server we have at present ,the requirement is everytime we create a new > > website ,we create two instances for the same one for content search and > one > > for product search ,both have faceting requirements. > > > > there are about 25 fields for product schema and abt 20 for content > schema > > ,we do not store the content in the server ,the content is only indexed. > > > > Assuming that we currently have 10 websites ,which means we have 20 > webapps > > running on this server each having about 1000 documents and size of the > > index is approximately 50mb currently .The index size of each is expected > to > > grow continlously as more products are added. > > > > > > We recenlty got the followng error on creation of a new webapp ? > >SEVERE: Caught exception (java.lang.OutOfMemoryError: PermGen space) > > executing org.apache.tomcat.util.net.leaderfollowerworkerthr...@1c2534f, > > terminating thread > > Feb 24, 2009 6:22:16 AM > > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable run > > SEVERE: Caught exception (java.lang.OutOfMemoryError: PermGen space) > > executing org.apache.tomcat.util.net.leaderfollowerworkerthr...@1c2534f, > > terminating thread > > Sent at 12:32 PM on Wednesday > > > >What would this mean?Given the above,How many such webapps can we have > > on this server? > > > > *Server config* > > > > OS: Red Hat Enterprise Linux ES 4 - 64 Bit > > # Processor: Dual AMD Opteron Dual Core 270 2.0 GHz > > # 4GB DDR RAM > > # Hard Drive: 73GB SCSI > > # Hard Drive: 73GB SCSI > > > > thanks > > >
Solr and Zend Lucene
Hi, I have a requirement where i need to search offline.We are thinking of doing this by storing the index terms in a db . Is there a was of accessing the index tokens in solr 1.3 ? The other way is to use Zend_lucene to read the index file of solr as zend lucene has method for doing this.But Zend lucene is not able to open the solr index files ,the error being unsupported format. The final option is to reindex using zend lucene and read the index tokens ,but then facets are not supported by zend-lucene Any body done something similar,please give your thoughts or pointers Regards Revas
change the lucene version
Hi, If i need to change the lucene version of solr ,then how can we do this? Regards Revas
Re: Number of webapps
HI, How do i get the info on the current setting of MaxPermSize? Regards Sujahta On 2/27/09, Alexander Ramos Jardim wrote: > > Another simple solution for your requirement is to use multicore. This way > you will have only one Solr webapp loaded with as many indexes as you need. > > See more at http://wiki.apache.org/solr/MultiCore > > 2009/2/25 Michael Della Bitta > > > Unfortunately, I think the way this works is the container creates a > > Classloader for each context and loads the contents of the .war into > > that, regardless of whether each context references the same .war > > file. All those classes are stored in permanent generation space, and > > I'm fairly sure if you restart a context individually with the manager > > application, a new ClassLoader for the context is created and the > > permanent generation space the old one was consuming is simply leaked. > > > > Something that is crazy enough to work might be to unpack the Solr > > .war and move all the .jar files and class files that don't contain > > servlet API classes to .jars in $TOMCAT_HOME/lib, and then repack the > > .war without these files. These would then be loaded by the common > > classloader once per container, instead of once per context. You can > > read more about this classloader business here: > > http://tomcat.apache.org/tomcat-6.0-doc/class-loader-howto.html (might > > need a different URL depending on the version of Tomcat you're > > running). > > > > Michael > > > > On Wed, Feb 25, 2009 at 11:42 AM, revas wrote: > > > thanks will try that .I also have the war file for each solr instance > in > > the > > > home directory of the instance ,would that be the problem ? > > > > > > if i were to have common war file for n instances ,would there be any > > issue? > > > > > > regards > > > revas > > > > > > On 2/25/09, Michael Della Bitta wrote: > > >> > > >> It's possible I don't know enough about Solr's internals and there's a > > >> better solution than this, and it's surprising me that you're running > > >> out of PermGen space before you're running out of heap, but maybe > > >> you've already increased the general heap size without tweaking > > >> PermGen, and loading all the classes involved in loading 20 contexts > > >> is putting you over. In any case, you might try adding the following > > >> option to CATALINA_OPTS: -XX:MaxPermSize=256m. If you don't know where > > >> to put something like that, you might try adding the following line to > > >> $TOMCAT_HOME/bin/startup.sh: > > >> > > >> export CATALINA_OPTS="-XX:MaxPermSize=256m ${CATALINA_OPTS}" > > >> > > >> If that value (256) doesn't alleviate the problem, you might try > > increasing > > >> it. > > >> > > >> Hope that helps, > > >> > > >> Michael Della Bitta > > >> > > >> > > >> On Wed, Feb 25, 2009 at 3:08 AM, revas wrote: > > >> > Hi > > >> > > > >> > I am sure this question has been repeated many times over and there > > has > > >> been > > >> > several generic answers ,but i am looking for specific answers. > > >> > > > >> > I have a single server whose configuration i give below,this being > the > > >> only > > >> > server we have at present ,the requirement is everytime we create a > > new > > >> > website ,we create two instances for the same one for content search > > and > > >> one > > >> > for product search ,both have faceting requirements. > > >> > > > >> > there are about 25 fields for product schema and abt 20 for content > > >> schema > > >> > ,we do not store the content in the server ,the content is only > > indexed. > > >> > > > >> > Assuming that we currently have 10 websites ,which means we have 20 > > >> webapps > > >> > running on this server each having about 1000 documents and size of > > the > > >> > index is approximately 50mb currently .The index size of each is > > expected > > >> to > > >> > grow continlously as more products are added. > > >> > > > >> > > > >> > We recenlty got the followng error on creation of a new webapp ? > > >> >SEVERE: Caught exception (java.lang.OutOfMemoryError: PermGen > > space) > > >> > executing > > org.apache.tomcat.util.net.leaderfollowerworkerthr...@1c2534f, > > >> > terminating thread > > >> > Feb 24, 2009 6:22:16 AM > > >> > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable run > > >> > SEVERE: Caught exception (java.lang.OutOfMemoryError: PermGen space) > > >> > executing > > org.apache.tomcat.util.net.leaderfollowerworkerthr...@1c2534f, > > >> > terminating thread > > >> > Sent at 12:32 PM on Wednesday > > >> > > > >> >What would this mean?Given the above,How many such webapps can we > > have > > >> > on this server? > > >> > > > >> > *Server config* > > >> > > > >> > OS: Red Hat Enterprise Linux ES 4 - 64 Bit > > >> > # Processor: Dual AMD Opteron Dual Core 270 2.0 GHz > > >> > # 4GB DDR RAM > > >> > # Hard Drive: 73GB SCSI > > >> > # Hard Drive: 73GB SCSI > > >> > > > >> > thanks > > >> > > > >> > > > > > > > > > -- > Alexander Ramos Jardim >
Re: Solr and Zend Lucene
We will be using sqllite for db.This can be used for a cd version where we need to provide search On 3/5/09, Grant Ingersoll wrote: > > > On Mar 5, 2009, at 3:10 AM, revas wrote: > > Hi, >> >> I have a requirement where i need to search offline.We are thinking of >> doing >> this by storing the index terms in a db . >> > > I'm not sure I follow. How is it that Solr would be offline, but your DB > would be online? Can you explain a bit more the problem you are trying to > solve? > > > >> >> Is there a was of accessing the index tokens in solr 1.3 ? >> > > Not in 1.3, but trunk does. Have a look at the TermsComponent ( > http://wiki.apache.org/solr/TermsComponent). I suppose if you got things > in a JSON or binary format, the performance might not be horrible, but it > will depend on the # of terms in the index. Or, you could get things in > stages, i.e. all terms between a and b, etc. It might be back compatible > with 1.3, but I don't know for sure. > > > -Grant >
Re: Solr and Zend Lucene
The luke request handler returns all the tokens from the index ,is this correct? On 3/5/09, revas wrote: > > We will be using sqllite for db.This can be used for a cd version where we > need to provide search > > > On 3/5/09, Grant Ingersoll wrote: >> >> >> On Mar 5, 2009, at 3:10 AM, revas wrote: >> >> Hi, >>> >>> I have a requirement where i need to search offline.We are thinking of >>> doing >>> this by storing the index terms in a db . >>> >> >> I'm not sure I follow. How is it that Solr would be offline, but your DB >> would be online? Can you explain a bit more the problem you are trying to >> solve? >> >> >> >>> >>> Is there a was of accessing the index tokens in solr 1.3 ? >>> >> >> Not in 1.3, but trunk does. Have a look at the TermsComponent ( >> http://wiki.apache.org/solr/TermsComponent). I suppose if you got things >> in a JSON or binary format, the performance might not be horrible, but it >> will depend on the # of terms in the index. Or, you could get things in >> stages, i.e. all terms between a and b, etc. It might be back compatible >> with 1.3, but I don't know for sure. >> >> >> -Grant >> > >
Luke request handler
Hi, I just want to confirm my understanding of luke request handler. It gives us the raw lucene index tokens on a field by field basis. What should be the query to return all tokens for a field . Is there any way to return all the token across all fields Regards Revas
muticore setup with tomcat
Hi, I am trying to do amulticore set up.. I added the following from the 1.3 solr download to new dir called multicore core0 ,core1,solr.xml and solr.war in the tomcat context fragment i have defined as http://localhost:8080/multicore/admin http://localhost:8080/multicore/admin/core0 The above 2 ursl give me resource not found error the solr.xml is the default one from the download. Please tell me as to what needs to be changed to make this work in tomcat Regards Sujatha
Sharding question
Hi, If i were to add a second server for sharding once ,the first server reaches its limit and then if i need to update any document,how can i figure out on which server the document is located? Regards Sujatha
stop word search
Hi, I have a query like this content:the AND iuser_id:5 which means return all docs of user id 5 which have the word "the" in content .Since 'the' is a stop word ,this query executes as just user_id :5 inspite of the "AND" clause ,Whereas the expected result here is since there is no result for "the " ,no results shloud be returned. Am i missing anythin here? Regards
Re: stop word search
Hi Erik, I have now commented the query time stopword analyzer .I restarted the server.But now when i search for a stop word ,i am getting results. We had earlier indexed the content with the stop word analyzer.I dont think we need to reindex after commentting the query analyzer,right? This field is a text field with the defaul analyzer. Please let me know if i have missed something here. Regards Sujatha On 3/17/09, Erick Erickson wrote: > > Well, by definition, using an analyzer that removes stopwords > *should* do this at query time. This assumes that you used > an analyzer that removed stopwords at index and query time. > The stopwords are not in the index. > > You can get the behavior you expect by using an analyzer at > query time that does NOT remove stopwords, and one at > indexing time that *does* remove stopwords. Gut I'm having a > hard time imagining that this would result in a good user experience. > > I mean anytime that you had a stopword in the query where the > stopword was required, no results would be returned. Which would > be hard to explain to a user > > What is it you're trying to accomplish? > > Best > Erick > > > > On Tue, Mar 17, 2009 at 7:40 AM, revas wrote: > > > Hi, > > > > I have a query like this > > > > content:the AND iuser_id:5 > > > > which means return all docs of user id 5 which have the word "the" in > > content .Since 'the' is a stop word ,this query executes as just user_id > :5 > > inspite of the "AND" clause ,Whereas the expected result here is since > > there > > is no result for "the " ,no results shloud be returned. > > > > Am i missing anythin here? > > > > Regards > > >
Re: stop word search
Hi Erick, I still don't get it.The scenario is like this. Intially i indexed the content with the stop word filter at both index time and query time.That means the stop words are not there in the index . Now i removed the stop filter only at query time so that a query like this will not fetch results content:the AND id:8 as due to the stop filter this query becomes just id:8 and returns results. Why would i have to reindex as there shloud not be any stop words in the index in the first place. Thanks for your time . Regards On 3/21/09, Erick Erickson wrote: > > Yes, you do need to reindex after removing the stopword filter > from the configuration. When you indexed the first time using > the stopword filter, the words were NOT indexed, so they won't > be found now that they're getting through the query analyzer. > > Best > Erick > > On Fri, Mar 20, 2009 at 1:02 PM, revas wrote: > > > Hi Erik, > > > > I have now commented the query time stopword analyzer .I restarted the > > server.But now when i search for a stop word ,i am getting results. > > > > We had earlier indexed the content with the stop word analyzer.I dont > think > > we need to reindex after commentting the query analyzer,right? > > > > This field is a text field with the defaul analyzer. > > > > Please let me know if i have missed something here. > > > > Regards > > Sujatha > > > > > > On 3/17/09, Erick Erickson wrote: > > > > > > Well, by definition, using an analyzer that removes stopwords > > > *should* do this at query time. This assumes that you used > > > an analyzer that removed stopwords at index and query time. > > > The stopwords are not in the index. > > > > > > You can get the behavior you expect by using an analyzer at > > > query time that does NOT remove stopwords, and one at > > > indexing time that *does* remove stopwords. Gut I'm having a > > > hard time imagining that this would result in a good user experience. > > > > > > I mean anytime that you had a stopword in the query where the > > > stopword was required, no results would be returned. Which would > > > be hard to explain to a user > > > > > > What is it you're trying to accomplish? > > > > > > Best > > > Erick > > > > > > > > > > > > On Tue, Mar 17, 2009 at 7:40 AM, revas wrote: > > > > > > > Hi, > > > > > > > > I have a query like this > > > > > > > > content:the AND iuser_id:5 > > > > > > > > which means return all docs of user id 5 which have the word "the" in > > > > content .Since 'the' is a stop word ,this query executes as just > > user_id > > > :5 > > > > inspite of the "AND" clause ,Whereas the expected result here is > since > > > > there > > > > is no result for "the " ,no results shloud be returned. > > > > > > > > Am i missing anythin here? > > > > > > > > Regards > > > > > > > > > >
caching
If i don't explicity set any default query in the solrconfig.xml for caching and make use of the default config file,does solr do the caching automatically based on the query? Thanks
Facets drill down
Hi, I typically issue a facetdrill down query thus q=somequery and Facetfield:facetval . Is there any issues with the above approach as opposed to &fq=facetfield:value in terms of memory consumption and the use of cache. Regards Suajatha
Multi-language support
Hi, To reframe my earlier question Some languages have just analyzers only but nostemmer from snowball porter,then does the analyzer take care of stemming as well? Some languages only have the stemmer from snowball but no analyzer? Some have both. Can we say then that solr supports all the above languages .Will search be same across all the above cases? thanks revas
Analyzers and stemmer
Hi , With respect to language support in solr ,we have analyzers for some languages and stemmers for certain langauges.Do we say that solr supports this particular language only if we have both analyzer and stemmer for the language or also for which we have analyzer but not stemmer Regards Sujatha
Solr Cache Usage
Hi, We are running several webapps under a single container roughly about 40 -50 .All have similar schema .Under this circumstance , how would i calculate the cache memory allocation?The number of documents per webapps is roughly abt 1000 currently but like ly to increase in future. Would it make sense to enble caching with these many apps? For example would filter cahce size be :no of unique facet fileds in each of the webapps? In the doucment cache =max results* max concurrent users ,would this be based on the average or would it be a multiplication for each webapp combined together in which case we need to have atleast that much RAM. Suppose we have only 2gb ram on a dual core ,then what happens in typical wiki based document cache entry ?how would we have documents from each webapp in the cache .Would this value need to be multiplied by number of webapps ? Thanks Regards Revas
Compund file format
What is the draw back in using compunf file format for indexing when we have several webapps in a sinle container Regards Sujatha
query issue /special character and case
Hi , When i give a query like the following ,why does it become a phrase query as shown below? The field type is the default text field in the schema. volker-blanz PhraseQuery(content:"volker blanz") Also when i have special characters in the query as SCHÖLKOPF , i am not able to convert the "o" with spl character to lower case on my unix os/it works fine on windows xp OS .Also if i have a spl character in my query ,i would like to search for it wihtout the special character as SCHOLKOPF ,this works fine in windows with strtr (string translate php fucntion) ,but again not in windows OS. Any pointers Regards Revas
Re: query issue /special character and case
On Sat, Jun 6, 2009 at 11:40 AM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Sat, May 30, 2009 at 9:48 AM, revas wrote: > > > Hi , > > > > When i give a query like the following ,why does it become a phrase query > > as shown below? > > The field type is the default text field in the schema. > > > > volker-blanz > > PhraseQuery(content:"volker blanz") > > > > What is the query that was sent to Solr? The query is content:volker-blanz and this is a default text field > > > > > Also when i have special characters in the query as SCHÖLKOPF , i am not > > able to convert the "o" with spl character to lower case on my unix > os/it > > works fine on windows xp OS .Also if i have a spl character in my query > ,i > > would like to search for it wihtout the special character as SCHOLKOPF > > ,this works fine in windows with strtr (string translate php fucntion) > ,but > > again not in windows OS. > > > > Hmm, not sure. If you are using Tomcat, have you enabled UTF-8? > > > http://wiki.apache.org/solr/SolrTomcat#head-20147ee4d9dd5ca83ed264898280ab60457847c4 > > You can try using the analysis.jsp on the text field with this token and > see > how it is being analyzed. See if that gives some hints. Yes i am using tomcat and have enabled utf -8 in tomcat. > > > -- > Regards, > Shalin Shekhar Mangar. >
spellcheck /too many open files
Hi , 1)Does the spell check component support all languages? 2) I have a scnenario where i have abt 20 webapps in a single container.We get too many open files at index time /while restarting tomcat. The mergefactor is at default. If i reduce the merge factor to 2 and optimize the index ,will the open files be closed automatically or would i have to reindex to close the open files or how do i close the already opened files.This is on linux with solr 1.3 and tomcat 5.5 Regards Revas
Re: spellcheck /too many open files
But the spell check componenet uses the n-gram analyzer and henc should work for any language ,is this correct ,also we can refer an extern dictionary for suggestions ,could this be in any language? The open files is not because of spell check as we have not yet implemented this yet, every time we restart solr we need to up the ulimit ,otherwise it does not work,so is there any workaround to permanently close this open files ,does optmizing the index close it? Regards Sujatha On Tue, Jun 9, 2009 at 12:53 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Tue, Jun 9, 2009 at 11:15 AM, revasHi wrote: > > > > > 1)Does the spell check component support all languages? > > > > SpellCheckComponent relies on Lucene/Solr analyzers and tokenizers. So if > you can find an analyzer/tokenizer for your language, spell checker can > work. > > > > 2) I have a scnenario where i have abt 20 webapps in a single > container.We > > get too many open files at index time /while restarting tomcat. > > > Is that because of SpellCheckComponent? > > > > The mergefactor is at default. > > > > If i reduce the merge factor to 2 and optimize the index ,will the open > > files be closed automatically or would i have to reindex to close the > open > > files or how do i close the already opened files.This is on linux with > > solr > > 1.3 and tomcat 5.5 > > > > Lucene/Solr does not keep any file opened longer than it is necessary. But > decreasing merge factor should help. You can also increase the open file > limit on your system. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: spellcheck /too many open files
Thanks ShalinWhen we use the external file dictionary (if there is one),then it should work fine ,right for spell check,also is there any format for this file Regards Sujatha On Tue, Jun 9, 2009 at 3:03 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Tue, Jun 9, 2009 at 2:56 PM, revas wrote: > > > But the spell check componenet uses the n-gram analyzer and henc should > > work > > for any language ,is this correct ,also we can refer an extern dictionary > > for suggestions ,could this be in any language? > > > > Yes it does use n-grams but there's an analysis step before the n-grams are > created. For example, if you are creating your spell check index from a > Solr > field, SpellCheckComponent uses that field's index time analyzer. So you > should create your language-specific fields in such a way that the analysis > works correctly for that language. > > > > The open files is not because of spell check as we have not yet > implemented > > this yet, every time we restart solr we need to up the ulimit ,otherwise > it > > does not work,so is there any workaround to permanently close this open > > files ,does optmizing the index close it? > > > > Optimization merges the segments of the index into one big segment. So it > will reduce the number of files. However, during the merge it may create > many more files. The old files after the merge are cleanup by Lucene in a > while (unless you have changed the defaults in the IndexDeletionPolicy > section in solrconfig.xml). > > -- > Regards, > Shalin Shekhar Mangar. >
Re: spellcheck /too many open files
Thanks On Tue, Jun 9, 2009 at 5:14 PM, Shalin Shekhar Mangar < shalinman...@gmail.com> wrote: > On Tue, Jun 9, 2009 at 4:32 PM, revas wrote: > > > Thanks ShalinWhen we use the external file dictionary (if there is > > one),then it should work fine ,right for spell check,also is there any > > format for this file > > > > The external file should have one token per line. See > http://wiki.apache.org/solr/FileBasedSpellChecker > > The default analyzer is WhitespaceAnalyzer. So all tokens in the file will > be split on whitespace and the resulting tokens will be used for giving > suggestions. If you want to change the analyzer, specify fieldType in the > spell checker configuration and the component will use the analyzer > configured for that field type. > > -- > Regards, > Shalin Shekhar Mangar. >
Re: Customizing results
Hi Michael, What is GNU gettext and how this can be used in a multilanguage scenario? Regards Revas On Wed, Jun 10, 2009 at 8:10 PM, Michael Ludwig wrote: > Manepalli, Kalyan schrieb: > >> Hi, >> I am trying to customize the response that I receive from Solr. In the >> index I have multiple fields that contain the same data in different >> language. >> At the query time client specifies the language. Based on this param, >> I want to return the value, copied into a different field. >> Eg: >> Lubang, Filippinerne >> Lubang, Philippinen >> Lubang, Philippines >> Lubang, Filipinas >> >> If the user specifies language as de_de, then I want to return the >> result as Lubang, Philippinen >> > > If you control how the client works, you could also consider using an > internationalization technology such as GNU Gettext for this purpose. > May or may not make sense in your particular situation. > > Michael Ludwig >
solr Analyzer help
Hi , In the solr 1.3 download ,under the folder src/java/org/apache/solr/analysis I find the following tokenizer classes for other languages (other than English) 1.Chinese tokenizer 2.cjk tokenizer which is not expected to work very well with Japanese for Chinese we already have the Chinese tokenizer only the above 2 tokenizer are there for the languages I also see stem filter factory and palin filtet factory for some languages like DutchStemFilterFactory,BrazilianStemFilterFactory.java GermanStemFilterFactory etc and the plain filter like ChineseFilterFactory.java What is the stem filter factory does it stem the words without including the snowball porter filter factory what is the simple filter factories do ? where do i look for analyzers for other languages and also the information on for which languages i can use the standard analyzers? For example given only all the above for German language analysis am i to use the standardard anlyzer with German filter factory and German stemmers ? are there more language specific tokenizers in lucene and if so what are the steps to integrate into solr? Regards Revas
Migration to Solr 1.4
Hello, I would like to know if by just copying the solr.war file to my existing solr 1.3 installation ,lucene version is also upgraded to current 2.9 ? I believe reindex is not necessary ,is that correct? Is there anything else apart form this that i need to do to upgrade to the latest lucene version? Regards Sujatha
Re: Migration to Solr 1.4
Thanks ,Erik. On Fri, Jan 8, 2010 at 4:34 PM, Erik Hatcher wrote: > > On Jan 8, 2010, at 4:14 AM, revas wrote: > >> I would like to know if by just copying the solr.war file to my existing >> solr 1.3 installation ,lucene version is also upgraded to current 2.9 ? >> > > Yes, Lucene 2.9 is built into solr.war, so you're automatically upgrading > that too. > > > I believe reindex is not necessary ,is that correct? >> > > Correct. > > Though for peace of mind it isn't a bad idea to reindex. But your testing > will tell you all is well, or not. > > > Is there anything else apart form this that i need to do to upgrade to the >> latest lucene version? >> > > I'd encourage you to compare your solrconfig.xml and schema.xml files to > the ones that ship with Solr 1.4's example. You may want to adjust your > configurations a bit. > >Erik > >
Overlapping onDeckSearchers=2
Hello, We have a server with many solr instances running (around 40-50) . We are committing documents ,sometimes one or sometimes around 200 documents at a time .to only one instance at a time When i run 2 -3 commits parallely to diff instances or same instance I get this error PERFORMANCE WARNING: Overlapping onDeckSearchers=2 What is the Best approach to solve this Regards revas
Re: Overlapping onDeckSearchers=2
Thanks for the repsonse .What happens in this scenario? Does the commit happen in this case or does the search server hang or just throws an error without committing Regards Sujatha On Mon, May 3, 2010 at 11:41 PM, Chris Hostetter wrote: > : When i run 2 -3 commits parallely to diff instances or same instance I > get > : this error > : > : PERFORMANCE WARNING: Overlapping onDeckSearchers=2 > : > : What is the Best approach to solve this > > > http://wiki.apache.org/solr/FAQ#What_does_.22PERFORMANCE_WARNING:_Overlapping_onDeckSearchers.3DX.22_mean_in_my_logs.3F > > > > -Hoss > >
Re: DocValue field & commit
Thanks, Eric. 1) We are using dynamic string field for faceting where indexing =false and stored=false . By default docValues are enabled for primitive fields (solr 6.6.), so not explicitly defined in schema. Do you think its wrong assumption? Also I do not this field listed in feild cache, but dont see any dynamic fields listed. 2) Autowarm count is at 32 for both and autowarm time is 25 for queryresult and 17 3)Can you elaborate what you mean here On Mon, Mar 30, 2020 at 1:43 PM Erick Erickson wrote: > Response spikes after commits are almost always something to do > with autowarming or docValues being set to false. So here’s what > I’d look at, in order. > > 1> are the fields used defined with docValues=true? They should be. > With this much variance it sounds like you don’t have that value set. > You’ll have to rebuild your entire index, first deleting all documents… > > You assert that they are all docValues, but the variance is so > high that I wonder whether they _all_ are. They may very well be, but > I’ve been tripped up by things I know are true that aren’t too often ;) > > You can insure this by setting 'uninvertible=“true” ‘ in your field type, > see: https://issues.apache.org/jira/browse/SOLR-12962 if you’re on > 7.6 or later. > > 2>what are your autowarming settings for queryResultCache and/or > filterCache. Start with a relatively small number, say 16 and look at > your autowarm times to insure they aren’t excessive. > > 3> if autowarming doesn’t help, consider specifying a newSearcher > event in solrconfig.xml that exercises the facets. > > NOTE: <2> and <3> will mask any fields that are docValues=false that > slipped through the cracks, so I’d double check <1> first. > > Best, > Erick > > > On Mar 30, 2020, at 12:20 PM, sujatha arun wrote: > > > > A facet heavy query which uses docValue fields for faceting returns > about > > 5k results executes between 10ms to 5 secs and the 5 secs time seems to > > coincide with after a hard commit. > > > > Does that have any relation? Why the fluctuation in execution time? > > > > Thanks, > > Revas > >
Re: DocValue field & commit
Correcting some typos ... Thanks, Eric. 1) We are using dynamic string field for faceting where indexing =false and stored=false . By default docValues are enabled for primitive fields (solr 6.6.), so not explicitly defined in schema. Do you think its wrong assumption? Also I do not see this field listed in feild cache, but don't see any dynamic fields listed. 2) Autowarm count is at 32 for both and autowarm time is 25 for query-result cache and 1724 for filter cache 3)Can you elaborate what you mean here. We have hard-commit every 5 mins with opensearcher=false and soft-commit every 2 secs. On Mon, Mar 30, 2020 at 4:06 PM Revas wrote: > Thanks, Eric. > > 1) We are using dynamic string field for faceting where indexing =false > and stored=false . By default docValues are enabled for primitive fields > (solr 6.6.), so not explicitly defined in schema. Do you think its wrong > assumption? Also I do not this field listed in feild cache, but dont see > any dynamic fields listed. > 2) Autowarm count is at 32 for both and autowarm time is 25 for > queryresult and 17 > 3)Can you elaborate what you mean here > > > > On Mon, Mar 30, 2020 at 1:43 PM Erick Erickson > wrote: > >> Response spikes after commits are almost always something to do >> with autowarming or docValues being set to false. So here’s what >> I’d look at, in order. >> >> 1> are the fields used defined with docValues=true? They should be. >> With this much variance it sounds like you don’t have that value set. >> You’ll have to rebuild your entire index, first deleting all documents… >> >> You assert that they are all docValues, but the variance is so >> high that I wonder whether they _all_ are. They may very well be, but >> I’ve been tripped up by things I know are true that aren’t too often ;) >> >> You can insure this by setting 'uninvertible=“true” ‘ in your field type, >> see: https://issues.apache.org/jira/browse/SOLR-12962 if you’re on >> 7.6 or later. >> >> 2>what are your autowarming settings for queryResultCache and/or >> filterCache. Start with a relatively small number, say 16 and look at >> your autowarm times to insure they aren’t excessive. >> >> 3> if autowarming doesn’t help, consider specifying a newSearcher >> event in solrconfig.xml that exercises the facets. >> >> NOTE: <2> and <3> will mask any fields that are docValues=false that >> slipped through the cracks, so I’d double check <1> first. >> >> Best, >> Erick >> >> > On Mar 30, 2020, at 12:20 PM, sujatha arun wrote: >> > >> > A facet heavy query which uses docValue fields for faceting returns >> about >> > 5k results executes between 10ms to 5 secs and the 5 secs time seems to >> > coincide with after a hard commit. >> > >> > Does that have any relation? Why the fluctuation in execution time? >> > >> > Thanks, >> > Revas >> >>
Re: DocValue field & commit
Thanks, Erick, The process time execution based on debugQuery between the query and facets is as follows query 10ms facets 4900ms since max time is spent on facet processing (docValues enabled), query and filter cache do no apply to this, correct? - Autowarm count is at 32 for both and autowarm time is 25 for query-result cache and 1724 for filter cache - We have hard-commit every 5 mins with opensearcher=false and soft-commit every 2 secs. - facet are a mix of pivot facets,range facets and facet queries - when the same facets criteria bring a smaller result set, response is much faster On Mon, Mar 30, 2020 at 4:47 PM Erick Erickson wrote: > OK, sounds like docValues is set. > > Sure, in solrconfig.xml, there are two sections “firstSearcher” and > “newSearcher”. > These are queries (or lists of queries) that are fired as part of > autowarming > when Solr is first started (firstSearcher) or when a commit happens that > opens > a new searcher (newSearcher). These are hand-crafted static queries. So > create one or more newSearcher sections in that block that exercise your > faceting and it’ll be fired as part of autowarming. That should smooth out > the delay your user’s experience when commits happen. > > Best, > Erick > > > On Mar 30, 2020, at 4:06 PM, Revas wrote: > > > > Thanks, Eric. > > > > 1) We are using dynamic string field for faceting where indexing =false > and > > stored=false . By default docValues are enabled for primitive fields > (solr > > 6.6.), so not explicitly defined in schema. Do you think its wrong > > assumption? Also I do not this field listed in feild cache, but dont see > > any dynamic fields listed. > > 2) Autowarm count is at 32 for both and autowarm time is 25 for > queryresult > > and 17 > > 3)Can you elaborate what you mean here > > > > > > > > On Mon, Mar 30, 2020 at 1:43 PM Erick Erickson > > wrote: > > > >> Response spikes after commits are almost always something to do > >> with autowarming or docValues being set to false. So here’s what > >> I’d look at, in order. > >> > >> 1> are the fields used defined with docValues=true? They should be. > >> With this much variance it sounds like you don’t have that value set. > >> You’ll have to rebuild your entire index, first deleting all documents… > >> > >> You assert that they are all docValues, but the variance is so > >> high that I wonder whether they _all_ are. They may very well be, but > >> I’ve been tripped up by things I know are true that aren’t too often ;) > >> > >> You can insure this by setting 'uninvertible=“true” ‘ in your field > type, > >> see: https://issues.apache.org/jira/browse/SOLR-12962 if you’re on > >> 7.6 or later. > >> > >> 2>what are your autowarming settings for queryResultCache and/or > >> filterCache. Start with a relatively small number, say 16 and look at > >> your autowarm times to insure they aren’t excessive. > >> > >> 3> if autowarming doesn’t help, consider specifying a newSearcher > >> event in solrconfig.xml that exercises the facets. > >> > >> NOTE: <2> and <3> will mask any fields that are docValues=false that > >> slipped through the cracks, so I’d double check <1> first. > >> > >> Best, > >> Erick > >> > >>> On Mar 30, 2020, at 12:20 PM, sujatha arun > wrote: > >>> > >>> A facet heavy query which uses docValue fields for faceting returns > >> about > >>> 5k results executes between 10ms to 5 secs and the 5 secs time seems > to > >>> coincide with after a hard commit. > >>> > >>> Does that have any relation? Why the fluctuation in execution time? > >>> > >>> Thanks, > >>> Revas > >> > >> > >
Re: DocValue field & commit
Hi Erick, Thanks. We do have NRT requirement in our application that updates be immediately visible. We do have constant updates. The push is for even faster visibility but we are holding off at 2 secs soft-commit for now. What I am not able to understand is that as per query debugging, the facet processing time varies between a few ms to secs . Why would there be a variability in facet processing time if they are based of docvalues and how a newsearcher would help? We do have 8 core CPU and lot of RAM in our server as we host multiple collections. On Mon, Mar 30, 2020 at 7:08 PM Erick Erickson wrote: > Oh dear. Your autowarming is almost, but not quite totally, useless given > your 2 second soft commit interval. See: > > https://lucidworks.com/post/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ > > So autowarming is probably not a cure, when you originally said “commit” I > was assuming > that was one that opened a new searcher, if that’s not true autowarming > isn’t a cure. > > Do you _really_ require 2 second soft commit intervals? I would not be > surprised if you also see “too many on deck searcher” warnings in your > logs at times. This is one of my hot buttons, having very short soft commit > intervals is something people do without understanding the tradeoffs, > one of which is that your caches are probably getting a poor utilization > rate. Often > the recommendation for short intervals like this is to not use the caches > at all. > > The newSearcher is a full query. Go ahead and add facets. But again, this > probably > isn’t going to help much. > > But really, revisit your autocommit settings. Taking 1.7 seconds to > autowarm > means that you have roughly this. > - commit > - 1.7 seconds later, the new searcher is open for business. > - 0.3 seconds after that a new searcher is open, which takes another 1.7 > seconds to autowarm. > > I doubt your hard commit is really the culprit here _unless_ you’re > running on an under-powered > machine. The hard commit will trigger segment merging, which is CPU and > I/O intensive. If > you’re using a machine that can’t afford the cycles to be taken up by > merging, that could account > for what you see, but new searchers are being opened every 2 seconds > (assuming a relatively > constant indexing load). > > Best, > Erick > > > On Mar 30, 2020, at 6:42 PM, Revas wrote: > > > > Thanks, Erick, > > > > The process time execution based on debugQuery between the query and > facets > > is as follows > > > > query 10ms > > facets 4900ms > > > > since max time is spent on facet processing (docValues enabled), query > and > > filter cache do no apply to this, correct? > > > > > > - Autowarm count is at 32 for both and autowarm time is 25 for > > query-result cache and 1724 for filter cache > > - We have hard-commit every 5 mins with opensearcher=false and > > soft-commit every 2 secs. > > - facet are a mix of pivot facets,range facets and facet queries > > - when the same facets criteria bring a smaller result set, response is > > much faster > > > > > > > > > > On Mon, Mar 30, 2020 at 4:47 PM Erick Erickson > > wrote: > > > >> OK, sounds like docValues is set. > >> > >> Sure, in solrconfig.xml, there are two sections “firstSearcher” and > >> “newSearcher”. > >> These are queries (or lists of queries) that are fired as part of > >> autowarming > >> when Solr is first started (firstSearcher) or when a commit happens that > >> opens > >> a new searcher (newSearcher). These are hand-crafted static queries. So > >> create one or more newSearcher sections in that block that exercise your > >> faceting and it’ll be fired as part of autowarming. That should smooth > out > >> the delay your user’s experience when commits happen. > >> > >> Best, > >> Erick > >> > >>> On Mar 30, 2020, at 4:06 PM, Revas wrote: > >>> > >>> Thanks, Eric. > >>> > >>> 1) We are using dynamic string field for faceting where indexing =false > >> and > >>> stored=false . By default docValues are enabled for primitive fields > >> (solr > >>> 6.6.), so not explicitly defined in schema. Do you think its wrong > >>> assumption? Also I do not this field listed in feild cache, but dont > see > >>> any dynamic fields listed. > >>> 2) Autowarm count is at 32 for both and autowarm time is 25 for > >> queryresult > >>> and 17 > >>> 3)Can you elabora
searcher
Hi I am seeing from my logs searcher referenced as main and realtime .Do they correspond to hard vs sofCommit. I do not see the co-relation to that based on our commit settings. Opening [Searcher@538abc62[xx_shard1_replica2] main] Opening [Searcher@2e151991[ xx _shard1_replica1] realtime] Thanks
facets & docValues
We have faceting fields that have been defined as indexed=false, stored=false and docValues=true However we use a lot of subfacets using json facets and facet ranges using facet.queries. We see that after every soft-commit our performance worsens and performs ideal between commits how is that docValue fields are affected by soft-commit and do we need to enable indexing if we use subfacets and facet query to improve performance? Tha
Re: facets & docValues
Hi Erick, You are correct, we have only about 1.8M documents so far and turning on the indexing on the facet fields helped improve the timings of the facet query a lot which has (sub facets and facet queries). So does docValues help at all for sub facets and facet query, our tests revealed further query time improvement when we turned off the docValues. is that the right approach? Currently we have only 1 shard and we are thinking of scaling by increasing the number of shards when we see a deterioration on query time. Any suggestions? Thanks. On Wed, Apr 15, 2020 at 8:21 AM Erick Erickson wrote: > In a word, “yes”. I also suspect your corpus isn’t very big. > > I think the key is the facet queries. Now, I’m talking from > theory rather than diving into the code, but querying on > a docValues=true, indexed=false field is really doing a > search. And searching on a field like that is effectively > analogous to a table scan. Even if somehow an internal > structure would be constructed to deal with it, it would > probably be on the heap, where you don’t want it. > > So the test would be to take the queries out and measure > performance, but I think that’s the root issue here. > > Best, > Erick > > > On Apr 14, 2020, at 11:51 PM, Revas wrote: > > > > We have faceting fields that have been defined as indexed=false, > > stored=false and docValues=true > > > > However we use a lot of subfacets using json facets and facet ranges > > using facet.queries. We see that after every soft-commit our performance > > worsens and performs ideal between commits > > > > how is that docValue fields are affected by soft-commit and do we need to > > enable indexing if we use subfacets and facet query to improve > performance? > > > > Tha > >
Re: facets & docValues
Hi Erick, Thanks for the explanation and advise. With facet queries, does doc Values help at all ? 1) indexed=true, docValues=true => all facets 2) - indexed=true , docValues=true => only for subfacets - inexed=true, docValues=false=> facet query - docValues=true, indexed=false=> term facets In case of 1 above, => Indexing slowed considerably. over all facet performance improved many fold In case of 2=> over all performance showed only slight improvement Does that mean turning on docValues even for facet query helps improve the performance, fetching from docValues for facet query is faster than fetching from stored fields ? Thanks On Thu, Apr 16, 2020 at 1:50 PM Erick Erickson wrote: > DocValues should help when faceting over fields, i.e. facet.field=blah. > > I would expect docValues to help with sub facets and, but don’t know > the code well enough to say definitely one way or the other. > > The empirical approach would be to set “uninvertible=true” (Solr 7.6) and > turn docValues off. What that means is that if any operation tries to > uninvert > the index on the Java heap, you’ll get an exception like: > "can not sort on a field w/o docValues unless it is indexed=true > uninvertible=true and the type supports Uninversion:” > > See SOLR-12962 > > Speed is only one issue. The entire point of docValues is to not “uninvert” > the field on the heap. This used to lead to very significant memory > pressure. So when turning docValues off, you run the risk of > reverting back to the old behavior and having unexpected memory > consumption, not to mention slowdowns when the uninversion > takes place. > > Also, unless your documents are very large, this is a tiny corpus. It can > be > quite hard to get realistic numbers, the signal gets lost in the noise. > > You should only shard when your individual query times exceed your > requirement. Say you have a 95%tile requirement of 1 second response time. > > Let’s further say that you can meet that requirement with 50 > queries/second, > but when you get to 75 queries/second your response time exceeds your > requirements. Do NOT shard at this point. Add another replica instead. > Sharding adds inevitable overhead and should only be considered when > you can’t get adequate response time even under fairly light query loads > as a general rule. > > Best, > Erick > > > On Apr 16, 2020, at 12:08 PM, Revas wrote: > > > > Hi Erick, You are correct, we have only about 1.8M documents so far and > > turning on the indexing on the facet fields helped improve the timings of > > the facet query a lot which has (sub facets and facet queries). So does > > docValues help at all for sub facets and facet query, our tests > > revealed further query time improvement when we turned off the docValues. > > is that the right approach? > > > > Currently we have only 1 shard and we are thinking of scaling by > > increasing the number of shards when we see a deterioration on query > time. > > Any suggestions? > > > > Thanks. > > > > > > On Wed, Apr 15, 2020 at 8:21 AM Erick Erickson > > wrote: > > > >> In a word, “yes”. I also suspect your corpus isn’t very big. > >> > >> I think the key is the facet queries. Now, I’m talking from > >> theory rather than diving into the code, but querying on > >> a docValues=true, indexed=false field is really doing a > >> search. And searching on a field like that is effectively > >> analogous to a table scan. Even if somehow an internal > >> structure would be constructed to deal with it, it would > >> probably be on the heap, where you don’t want it. > >> > >> So the test would be to take the queries out and measure > >> performance, but I think that’s the root issue here. > >> > >> Best, > >> Erick > >> > >>> On Apr 14, 2020, at 11:51 PM, Revas wrote: > >>> > >>> We have faceting fields that have been defined as indexed=false, > >>> stored=false and docValues=true > >>> > >>> However we use a lot of subfacets using json facets and facet ranges > >>> using facet.queries. We see that after every soft-commit our > performance > >>> worsens and performs ideal between commits > >>> > >>> how is that docValue fields are affected by soft-commit and do we need > to > >>> enable indexing if we use subfacets and facet query to improve > >> performance? > >>> > >>> Tha > >> > >> > >
Re: facets & docValues
Hi joel, No, we have not, we have softCommit requirement of 2 secs. On Tue, May 5, 2020 at 3:31 PM Joel Bernstein wrote: > Have you configured static warming queries for the facets? This will warm > the cache structures for the facet fields. You just want to make sure you > commits are spaced far enough apart that the warming completes before a new > searcher starts warming. > > > Joel Bernstein > http://joelsolr.blogspot.com/ > > > On Mon, May 4, 2020 at 10:27 AM Revas wrote: > > > Hi Erick, Thanks for the explanation and advise. With facet queries, does > > doc Values help at all ? > > > > 1) indexed=true, docValues=true => all facets > > > > 2) > > > >- indexed=true , docValues=true => only for subfacets > >- inexed=true, docValues=false=> facet query > >- docValues=true, indexed=false=> term facets > > > > > > > > In case of 1 above, => Indexing slowed considerably. over all facet > > performance improved many fold > > In case of 2=> over all performance showed only slight > > improvement > > > > Does that mean turning on docValues even for facet query helps improve > the > > performance, fetching from docValues for facet query is faster than > > fetching from stored fields ? > > > > Thanks > > > > > > On Thu, Apr 16, 2020 at 1:50 PM Erick Erickson > > wrote: > > > > > DocValues should help when faceting over fields, i.e. facet.field=blah. > > > > > > I would expect docValues to help with sub facets and, but don’t know > > > the code well enough to say definitely one way or the other. > > > > > > The empirical approach would be to set “uninvertible=true” (Solr 7.6) > and > > > turn docValues off. What that means is that if any operation tries to > > > uninvert > > > the index on the Java heap, you’ll get an exception like: > > > "can not sort on a field w/o docValues unless it is indexed=true > > > uninvertible=true and the type supports Uninversion:” > > > > > > See SOLR-12962 > > > > > > Speed is only one issue. The entire point of docValues is to not > > “uninvert” > > > the field on the heap. This used to lead to very significant memory > > > pressure. So when turning docValues off, you run the risk of > > > reverting back to the old behavior and having unexpected memory > > > consumption, not to mention slowdowns when the uninversion > > > takes place. > > > > > > Also, unless your documents are very large, this is a tiny corpus. It > can > > > be > > > quite hard to get realistic numbers, the signal gets lost in the noise. > > > > > > You should only shard when your individual query times exceed your > > > requirement. Say you have a 95%tile requirement of 1 second response > > time. > > > > > > Let’s further say that you can meet that requirement with 50 > > > queries/second, > > > but when you get to 75 queries/second your response time exceeds your > > > requirements. Do NOT shard at this point. Add another replica instead. > > > Sharding adds inevitable overhead and should only be considered when > > > you can’t get adequate response time even under fairly light query > loads > > > as a general rule. > > > > > > Best, > > > Erick > > > > > > > On Apr 16, 2020, at 12:08 PM, Revas wrote: > > > > > > > > Hi Erick, You are correct, we have only about 1.8M documents so far > and > > > > turning on the indexing on the facet fields helped improve the > timings > > of > > > > the facet query a lot which has (sub facets and facet queries). So > does > > > > docValues help at all for sub facets and facet query, our tests > > > > revealed further query time improvement when we turned off the > > docValues. > > > > is that the right approach? > > > > > > > > Currently we have only 1 shard and we are thinking of scaling by > > > > increasing the number of shards when we see a deterioration on query > > > time. > > > > Any suggestions? > > > > > > > > Thanks. > > > > > > > > > > > > On Wed, Apr 15, 2020 at 8:21 AM Erick Erickson < > > erickerick...@gmail.com> > > > > wrote: > > > > > > > >> In a word, “yes”. I also suspect your corpus isn’t very big. > > > >> > > >
Re: when to use docvalue
Erick, Can you also explain how to optimize facet query and range facets as they dont use docValues and contribute to higher response time? On Tue, May 19, 2020 at 5:55 PM Erick Erickson wrote: > They are _absolutely_ able to be used together. Background: > > “In the bad old days”, there was no docValues. So whenever you needed > to facet/sort/group/use function queries Solr (well, Lucene) had to take > the inverted structure resulting from “index=true” and “uninvert” it on the > Java heap. > > docValues essentially does the “uninverting” at index time and puts > that structure in a separate file for each segment. So rather than uninvert > the index on the heap, Lucene can just read it in from disk in > MMapDirectory > (i.e. OS) memory space. > > The downside is that your index will be bigger when you do both, that is > the > size on disk will be bigger. But, it’ll be much faster to load, much > faster to > autowarm, and will move the structures necessary to do faceting/sorting/etc > into OS memory where the garbage collection is vastly more efficient than > Javas. > > And frankly I don’t think the increased size on disk is a downside. You’ll > have > to have the memory anyway, and having it used on the OS memory space is > so much more efficient than on Java’s heap that it’s a win-win IMO. > > Oh, and if you never sort/facet/group/use function queries, then the > docValues structures are never even read into MMapDirectory space. > > So yes, freely do both. > > Best, > Erick > > > > On May 19, 2020, at 5:41 PM, matthew sporleder > wrote: > > > > You can index AND docvalue? For some reason I thought they were > exclusive > > > > On Tue, May 19, 2020 at 5:36 PM Erick Erickson > wrote: > >> > >> Yes. You should also index them…. > >> > >> Here’s the way I think of it. > >> > >> For questions “For term X, which docs contain that value?” means > index=true. This is a search. > >> > >> For questions “Does doc X have value Y in field Z”, means > docValues=true. > >> > >> what’s the difference? Well, the first one is to get the result set. > The second is for, given a result set, > >> count/sort/whatever. > >> > >> fq clauses are searches, so index=true. > >> > >> sorting, faceting, grouping and function queries are “for each doc in > the result set, what values does field Y contain?” > >> > >> Maybe that made things clear as mud, but it’s the way I think of it ;) > >> > >> Best, > >> Erick > >> > >> > >> > >> fq clauses are searches. Indexed=true is for searching. > >> > >> sort > >> > >>> On May 19, 2020, at 4:00 PM, matthew sporleder > wrote: > >>> > >>> I have quite a few numeric / meta-data type fields in my schema and > >>> pretty much only use them in fq=, sort=, and friends. Should I always > >>> use DocValue on these if i never plan to q=search: on them? Are there > >>> any drawbacks? > >>> > >>> Thanks, > >>> Matt > >> > >
Re: when to use docvalue
Thanks, Erick. Its just when we enable both index=true and docValues=true, it increases the index time by 2x atleast for full re-index. On Wed, May 20, 2020 at 2:30 PM Erick Erickson wrote: > Revas: > > Facet queries are just queries that are constrained by the total result > set of your > primary query, so the answer to that would be the same as speeding up > regular > queries. As far as range facets are concerned, I believe they _do_ use > docValues, > after all they have to answer the exact same question: For doc X in the > result set, > what is the value of field Y? The only difference is it has to bucket a > bunch of them. > > Rahul: Please don;’t hijack threads, it makes it difficult to find things > later. Start > a separate e-mail thread. > > The answer to your question is, of course, “it depends” on a number of > things and > changes with the query. First of all, multivalued fields don’t qualify > because > docValues are a sorted set, meaning the return is sorted and deduplicated. > So if > the input has f values in it, b c d c d, what you’d get back from DV is b > c d. > > So let’s go with primitive, single-valued types. It still depends, but > Solr does > the right thing, or tries. Here’s the scoop. stored fields for any single > doc are > stored as a contiguous, compressed bit of memory. So if any _one_ field > needs > to be read from the stored data, the entire block is decompressed and Solr > will > preferentially fetch the value from the decompressed data as it’s pretty > certain > to be at least as cheap as fetching from DV. However, the reverse is true > if _all_ > the returned values are single-valued DV fields. Then it’s more efficient > to fetch > the DV values as they’re MMapped, and won’t cost the seek-and-decompress > cycle. > > Unless space is a real consideration for you, I’d set both index and > docValues to > true… > > Best, > Erick > > > On May 20, 2020, at 10:45 AM, Rahul Goswami > wrote: > > > > Eric, > > Thanks for that explanation. I have a follow up question on that. I find > > the scenario of stored=true and docValues=true to be tricky at times... > > would like to know when is each of these scenarios preferred over the > other > > two for primitive datatypes: > > > > 1) stored=true and docValues=false > > 2) stored=false and docValues=true > > 3) stored=true and docValues=true > > > > Thanks, > > Rahul > > > > On Tue, May 19, 2020 at 5:55 PM Erick Erickson > > wrote: > > > >> They are _absolutely_ able to be used together. Background: > >> > >> “In the bad old days”, there was no docValues. So whenever you needed > >> to facet/sort/group/use function queries Solr (well, Lucene) had to take > >> the inverted structure resulting from “index=true” and “uninvert” it on > the > >> Java heap. > >> > >> docValues essentially does the “uninverting” at index time and puts > >> that structure in a separate file for each segment. So rather than > uninvert > >> the index on the heap, Lucene can just read it in from disk in > >> MMapDirectory > >> (i.e. OS) memory space. > >> > >> The downside is that your index will be bigger when you do both, that is > >> the > >> size on disk will be bigger. But, it’ll be much faster to load, much > >> faster to > >> autowarm, and will move the structures necessary to do > faceting/sorting/etc > >> into OS memory where the garbage collection is vastly more efficient > than > >> Javas. > >> > >> And frankly I don’t think the increased size on disk is a downside. > You’ll > >> have > >> to have the memory anyway, and having it used on the OS memory space is > >> so much more efficient than on Java’s heap that it’s a win-win IMO. > >> > >> Oh, and if you never sort/facet/group/use function queries, then the > >> docValues structures are never even read into MMapDirectory space. > >> > >> So yes, freely do both. > >> > >> Best, > >> Erick > >> > >> > >>> On May 19, 2020, at 5:41 PM, matthew sporleder > >> wrote: > >>> > >>> You can index AND docvalue? For some reason I thought they were > >> exclusive > >>> > >>> On Tue, May 19, 2020 at 5:36 PM Erick Erickson < > erickerick...@gmail.com> > >> wrote: > >>>> > >>>> Yes. You should also index them…. > >>>> > >>>> Here’s the way I think of it. > &g
Collection Creation across DC
Hello, Can we create a collection across data Center ( shard replica is in a different data center) for HA ? Thanks Revas