Re: Korean Tokenizer in solr
You sure, it's not a spelling error or something other weird like that? Because Solr ships with that filter in it's example schema: So, you can compare what you are doing differently with that. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay wrote: > I have upgrade the solr version to 4.8.1. But after making changes in the > schema file i am getting the below error > Error instantiating class: > 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' > I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in > 4.8.1. Do I need to make any configuration changes to get this working. > > Please advice. > > Regards, > Poornima > > > On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch > wrote: > > > > I would suggest you read through all 12 (?) articles in this series: > http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html > . It will probably lay out most of the issues for you. > > And if you are starting, I would really suggest using the latest Solr > (4.9). A lot more people remember what the latest version has then > what was in 3.6. And, as the series above will tell you, some relevant > issues had been fixed in more recent Solr versions. > > Regards, >Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > > On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay > wrote: >> Till now I was thinking solr will support KoreanTokenizer. I haven't used >> any other 3rd party one. >> Actually the issue i am facing is I need to integrate English, Chinese, >> Japanese and Korean language search in a single site. Based on the user's >> selected language to search the fields will be queried appropriately. >> >> I tried using cjk for all the 3 languages like below but only few search >> terms work for Chinese and Japanese. nothing works for Korean. >> >> > positionIncrementGap="1" autoGeneratePhraseQueries="false"> >> >> >> >> > class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> >> > id="Traditional-Simplified"/> >> > id="Katakana-Hiragana"/> >> >> > hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> >> >> >> >> So i tried to implement individual fieldtype for each language as below >> >> Chinese >> > positionIncrementGap="1000" autoGeneratePhraseQueries="false"> >> >> >> >> >> >> >> >> >> Japanese >> > autoGeneratePhraseQueries="false"> >> >> >> >> > tags="stoptags_ja.txt" /> >> >> > words="stopwords_ja.txt" /> >> > minimumLength="4"/> >> >> >> >> >> Korean >> > autoGeneratePhraseQueries="false"> >> >> >> > hasCNoun="true" bigrammable="true"/> >> >> > words="stopwords_kr.txt"/> >> >> >> >> > hasCNoun="false" bigrammable="false"/> >> >> > words="stopwords_kr.txt"/> >> >> >> >> I am really struck how to implement this. Please help me. >> >> Thanks, >> Poornima >> >> >> >> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch >> wrote: >> >> >> >> I don't think Solr ships with Korean Tokenizer, does it? >> >> If you are using a 3rd party one, you need to give full class name, >> not just solr.Korean... And you need the library added in the lib >> statement in solrconfig.xml (at least in Solr 4). >> >> Regards, >>Alex. >> Personal website: http://www.outerthoughts.com/ >> Current project: http://www.solr-start.com/ - Accelerating your Solr >> proficiency >> >> >> >> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay >> wrote: >>> I have defined the fieldtype inside the fields section. When i checked the >>> error log i found the below error >>> >>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory >>> >>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or >>> tokenizer & filter list >>> >>> >>> Do i need to add any libraries for koreanTokenizer? >>> >>> Regards, >>> Poornima >>> >>> >>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch >>> wrote: >>> >>> >>> >>> Double check your xml file that you don't - for example - define your >>> fieldType outside of fields section. Or maybe you have exception >>> earlier about some component in the type definition. >>> >>> This is not about Korean language, it seems. Something more >>> fundamentally about XML config. >>> >>> Regards, >>>Alex. >>> Personal website: http://www.outerthoughts.com/ >>> Current project: http://www.solr-start.com/ - Accelerating your Solr >>> proficiency >>> >>> >>> >>> On Thu, Jul 10, 2014 at
Re: Korean Tokenizer in solr
Yes, Below is my defined fieldtype Please correct me if I am doing anything wrong here Regards, Poornima On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch wrote: You sure, it's not a spelling error or something other weird like that? Because Solr ships with that filter in it's example schema: So, you can compare what you are doing differently with that. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay wrote: > I have upgrade the solr version to 4.8.1. But after making changes in the > schema file i am getting the below error > Error instantiating class: > 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' > I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in > 4.8.1. Do I need to make any configuration changes to get this working. > > Please advice. > > Regards, > Poornima > > > On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch > wrote: > > > > I would suggest you read through all 12 (?) articles in this series: > http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html > . It will probably lay out most of the issues for you. > > And if you are starting, I would really suggest using the latest Solr > (4.9). A lot more people remember what the latest version has then > what was in 3.6. And, as the series above will tell you, some relevant > issues had been fixed in more recent Solr versions. > > Regards, > Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > > On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay > wrote: >> Till now I was thinking solr will support KoreanTokenizer. I haven't used >> any other 3rd party one. >> Actually the issue i am facing is I need to integrate English, Chinese, >> Japanese and Korean language search in a single site. Based on the user's >> selected language to search the fields will be queried appropriately. >> >> I tried using cjk for all the 3 languages like below but only few search >> terms work for Chinese and Japanese. nothing works for Korean. >> >> > positionIncrementGap="1" autoGeneratePhraseQueries="false"> >> >> >> >> >class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> >> >id="Traditional-Simplified"/> >> >id="Katakana-Hiragana"/> >> >> >hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> >> >> >> >> So i tried to implement individual fieldtype for each language as below >> >> Chinese >> >positionIncrementGap="1000" autoGeneratePhraseQueries="false"> >> >> >> >> >> >> >> >> >> Japanese >> > autoGeneratePhraseQueries="false"> >> >> >> >> >tags="stoptags_ja.txt" /> >> >> >words="stopwords_ja.txt" /> >> >minimumLength="4"/> >> >> >> >> >> Korean >> > autoGeneratePhraseQueries="false"> >> >> >> >hasCNoun="true" bigrammable="true"/> >> >> >words="stopwords_kr.txt"/> >> >> >> >> >hasCNoun="false" bigrammable="false"/> >> >> >words="stopwords_kr.txt"/> >> >> >> >> I am really struck how to implement this. Please help me. >> >> Thanks, >> Poornima >> >> >> >> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch >> wrote: >> >> >> >> I don't think Solr ships with Korean Tokenizer, does it? >> >> If you are using a 3rd party one, you need to give full class name, >> not just solr.Korean... And you need the library added in the lib >> statement in solrconfig.xml (at least in Solr 4). >> >> Regards, >> Alex. >> Personal website: http://www.outerthoughts.com/ >> Current project: http://www.solr-start.com/ - Accelerating your Solr >> proficiency >> >> >> >> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay >> wrote: >>> I have defined the fieldtype inside the fields section. When i checked the >>> error log i found the below error >>> >>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory >>> >>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or >>> tokenizer & filter list >>> >>> >>> Do i need to add any libraries for koreanTokenizer? >>> >>> Regards, >>> Poornima >>> >>> >>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch >>> wrote: >>> >>> >>> >>> Double check your xml file that you don't - for example - define your >>> fieldType outside of fields section. Or maybe you have exception >>> earlier about some component in the type definition. >>> >>> This is not about Korean language, it seems. S
Re: Korean Tokenizer in solr
What happens if you have a new collection with absolute minimum in it and then add the definition? Start from something like: https://github.com/arafalov/simplest-solr-config . Also, is there a long exception earlier in a log. It may have more clues. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 2:15 PM, Poornima Jay wrote: > Yes, Below is my defined fieldtype > > positionIncrementGap="100"> > > > han="true"/> > generateWordParts="1" generateNumberParts="1" catenateWords="1" > catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" > preserveOriginal="1"/> > > > > han="true"/> > generateWordParts="1" generateNumberParts="1" catenateWords="0" > catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" > preserveOriginal="1"/> > > > > Please correct me if I am doing anything wrong here > > Regards, > Poornima > > > On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch > wrote: > > > > You sure, it's not a spelling error or something other weird like > that? Because Solr ships with that filter in it's example schema: > > > So, you can compare what you are doing differently with that. > > Regards, >Alex. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > > On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay > wrote: >> I have upgrade the solr version to 4.8.1. But after making changes in the >> schema file i am getting the below error >> Error instantiating class: >> 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' >> I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in >> 4.8.1. Do I need to make any configuration changes to get this working. >> >> Please advice. >> >> Regards, >> Poornima >> >> >> On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch >> wrote: >> >> >> >> I would suggest you read through all 12 (?) articles in this series: >> http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html >> . It will probably lay out most of the issues for you. >> >> And if you are starting, I would really suggest using the latest Solr >> (4.9). A lot more people remember what the latest version has then >> what was in 3.6. And, as the series above will tell you, some relevant >> issues had been fixed in more recent Solr versions. >> >> Regards, >>Alex. >> Personal website: http://www.outerthoughts.com/ >> Current project: http://www.solr-start.com/ - Accelerating your Solr >> proficiency >> >> >> >> On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay >> wrote: >>> Till now I was thinking solr will support KoreanTokenizer. I haven't used >>> any other 3rd party one. >>> Actually the issue i am facing is I need to integrate English, Chinese, >>> Japanese and Korean language search in a single site. Based on the user's >>> selected language to search the fields will be queried appropriately. >>> >>> I tried using cjk for all the 3 languages like below but only few search >>> terms work for Chinese and Japanese. nothing works for Korean. >>> >>> >> positionIncrementGap="1" autoGeneratePhraseQueries="false"> >>> >>> >>> >>> >> class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> >>> >> id="Traditional-Simplified"/> >>> >> id="Katakana-Hiragana"/> >>> >>> >> hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> >>> >>> >>> >>> So i tried to implement individual fieldtype for each language as below >>> >>> Chinese >>> >> positionIncrementGap="1000" autoGeneratePhraseQueries="false"> >>> >>> >>> >>> >>> >>> >>> >>> >>> Japanese >>> >> autoGeneratePhraseQueries="false"> >>> >>> >>> >>> >> tags="stoptags_ja.txt" /> >>> >>> >> words="stopwords_ja.txt" /> >>> >> minimumLength="4"/> >>> >>> >>> >>> >>> Korean >>> >> positionIncrementGap="1000" autoGeneratePhraseQueries="false"> >>> >>> >>> >> hasCNoun="true" bigrammable="true"/> >>> >>> >> words="stopwords_kr.txt"/> >>> >>> >>> >>> >> hasCNoun="false" bigrammable="false"/> >>> >>> >> words="stopwords_kr.txt"/> >>> >>> >>> >>> I am really struck how to implement this. Please help me. >>> >>> Thanks, >>> Poornima >>> >>> >>> >>> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch >>> wrote: >>> >>> >>> >>> I don't think Solr ships with Korean Tokenizer, does it? >>> >>> If you are using a 3rd party one, you need to give full class name, >
Re: Korean Tokenizer in solr
When I am trying to index the below error comes java.io.FileNotFoundException: /home/searchuser/multicore/apac_content/data/tlog/tlog.000 (No such file or directory) On Monday, 14 July 2014 2:07 PM, Poornima Jay wrote: Yes, Below is my defined fieldtype Please correct me if I am doing anything wrong here Regards, Poornima On Monday, 14 July 2014 12:33 PM, Alexandre Rafalovitch wrote: You sure, it's not a spelling error or something other weird like that? Because Solr ships with that filter in it's example schema: So, you can compare what you are doing differently with that. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Mon, Jul 14, 2014 at 1:58 PM, Poornima Jay wrote: > I have upgrade the solr version to 4.8.1. But after making changes in the > schema file i am getting the below error > Error instantiating class: > 'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory' > I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in > 4.8.1. Do I need to make any configuration changes to get this working. > > Please advice. > > Regards, > Poornima > > > On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch > wrote: > > > > I would suggest you read through all 12 (?) articles in this series: > http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html > . It will probably lay out most of the issues for you. > > And if you are starting, I would really suggest using the latest Solr > (4.9). A lot more people remember what the latest version has then > what was in 3.6. And, as the series above will tell you, some relevant > issues had been fixed in more recent Solr versions. > > Regards, > Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > > On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay > wrote: >> Till now I was thinking solr will support KoreanTokenizer. I haven't used >> any other 3rd party one. >> Actually the issue i am facing is I need to integrate English, Chinese, >> Japanese and Korean language search in a single site. Based on the user's >> selected language to search the fields will be queried appropriately. >> >> I tried using cjk for all the 3 languages like below but only few search >> terms work for Chinese and Japanese. nothing works for Korean. >> >> > positionIncrementGap="1" autoGeneratePhraseQueries="false"> >> >> >> >> >class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/> >> >id="Traditional-Simplified"/> >> >id="Katakana-Hiragana"/> >> >> >hiragana="true" katakana="true" hangul="true" outputUnigrams="true" /> >> >> >> >> So i tried to implement individual fieldtype for each language as below >> >> Chinese >> >positionIncrementGap="1000" autoGeneratePhraseQueries="false"> >> >> >> >> >> >> >> >> >> Japanese >> > autoGeneratePhraseQueries="false"> >> >> >> >> >tags="stoptags_ja.txt" /> >> >> >words="stopwords_ja.txt" /> >> >minimumLength="4"/> >> >> >> >> >> Korean >> > autoGeneratePhraseQueries="false"> >> >> >> >hasCNoun="true" bigrammable="true"/> >> >> >words="stopwords_kr.txt"/> >> >> >> >> >hasCNoun="false" bigrammable="false"/> >> >> >words="stopwords_kr.txt"/> >> >> >> >> I am really struck how to implement this. Please help me. >> >> Thanks, >> Poornima >> >> >> >> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch >> wrote: >> >> >> >> I don't think Solr ships with Korean Tokenizer, does it? >> >> If you are using a 3rd party one, you need to give full class name, >> not just solr.Korean... And you need the library added in the lib >> statement in solrconfig.xml (at least in Solr 4). >> >> Regards, >> Alex. >> Personal website: http://www.outerthoughts.com/ >> Current project: http://www.solr-start.com/ - Accelerating your Solr >> proficiency >> >> >> >> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay >> wrote: >>> I have defined the fieldtype inside the fields section. When i checked the >>> error log i found the below error >>> >>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory >>> >>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or >>> tokenizer & filter list >>> >>> >>> Do i need to add any libraries for koreanTokenizer? >>> >>> Regards, >>> Poornima >>> >>> >>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch >>> wrote: >>> >>> >>> >>> D
Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem
Thanks IJ for the link. I am not sure this can solve my problem, because I have only one machine in play anyway. Harald. On 12.07.2014 20:49, IJ wrote: GUess - I had the same issues as you. Was resolved http://lucene.472066.n3.nabble.com/Slow-QTimes-5-seconds-for-Small-sized-Collections-td4143681.html was resolved by adding an explicit host mapping entry on /etc/hosts for inter node solr communication - thereby bypassing DNS Lookups. -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-irregularly-having-QTime-5ms-stracing-solr-cures-the-problem-tp4146047p4146858.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Reference numbers for major page fauls per seconds, index size, query throughput
Hello Erik, thanks for the reply. Indeed the CPUs are kind of idling during the load test. They are not <20% but clearly don't get far beyond 40%. Changing the number of threads in jmeter has minor effects only on the qps, but increases the average latency, as soon as the threads outnumber the CPUs --- expected behavior I would say. I varied the number of results returned between 20 and 10 with no remarkable changes in performance. I restricted to fl=id and even this increased the throughput only minimally (meantime the index has 16 million, increase from 2.x qps to 3). Jmeter reported a reduction in average transferred size from 10kByes to 2.5kBytes. This is not really the issue here and in the end we need more than the IDs in production anyway. What really bugs me currently is that htop reports an IORR (supposed to be read(2) calls) of between 100 to 200 MByte/s during the load test. This somehow runs contrary to my understanding of why Solr uses mmapped files. There should be no read(2) calls and certainly not 200 MB/s :-/ And this did not drop when I restricted to fl=id. I will try to check this with strace to see were it is reading from. Hints appreciated. With a bit of luck, I'll get more RAM and can compare then. Thanks, Harald. On 12.07.2014 17:58, Erick Erickson wrote: If the stats you're reporting are during the load test, your CPU is kind of idling along at < 20% which supports your theory. Just to cover all bases, when you bump the number of threads jmeter is firing does it make any difference? And how many rows are you returning? This latter is important because to return documents, Solr needs to go out to disk, possibly generating your page faults (guessing here). One note about your index size it's largely useless to measure index on disk if for no other reason than the _stored_ data doesn't really count towards memory requirements for search. The *.fdt an d*.fdx segment files contain the stored data, so subtract them out Speaking of which, try just returning the id (&fl=id). That should reduce the disk seeks due to assembling the docs. But 4 qps for simple term queries seems very slow at first blush. FWIW, Erick On Thu, Jul 10, 2014 at 7:30 AM, Harald Kirsch wrote: Hi everyone, currently I am taking some performance measurements on a Solr installation and I am trying to figure out if what I see mostly fits expectations: The data is as follows: - solr 4.8.1 - 8 millon documents - mostly office documents with real text content, stored - index size on disk 90G - full index memory mapped into virtual memory: - this is a on a vmware server, 4 cores, 16 GB RAM PID PR NI VIRT RES SHR S %CPU %MEMTIME+ nFLT 961 20 0 93.9g 10g 6.0g S 19 64.5 718:39.81 757k When I start running a jmeter query test sending requests as fast a possible with a few threads, it peaks at about 4 qps with a real-world query replay of mostly 1, 2, sometimes more terms. What I see are around 150 to 200 major page faults per second, meaning that Solr is not really happy with what happens to be in memory at any instance in time. My hunch is that this hints at a too small RAM footprint. Much more RAM is needed to get the number of major page faults down. Would anyone agree or disagree with this analysis. Someone out there saying "200 major page faults/second are normal, there must be another problem"? Thanks, Harald.
Re: Solr irregularly having QTime > 50000ms, stracing solr cures the problem
This problem seems to completely disappear under load. I started making load tests despite fearing them to be useless. It turns out that there are no more 5 ms delays under load. Harald. On 09.07.2014 09:50, Harald Kirsch wrote: Good point. I will see if I can get the necessary access rights on this machine to run tcpdump. Thanks for the suggestion, Harald. On 09.07.2014 00:32, Steve McKay wrote: Sure sounds like a socket bug, doesn't it? I turn to tcpdump when Solr starts behaving strangely in a socket-related way. Knowing exactly what's happening at the transport level is worth a month of guessing and poking. On Jul 8, 2014, at 3:53 AM, Harald Kirsch wrote: Hi all, This is what happens when I run a regular wget query to log the current number of documents indexed: 2014-07-08:07:23:28 QTime=20 numFound="5720168" 2014-07-08:07:24:28 QTime=12 numFound="5721126" 2014-07-08:07:25:28 QTime=19 numFound="5721126" 2014-07-08:07:27:18 QTime=50071 numFound="5721126" 2014-07-08:07:29:08 QTime=50058 numFound="5724494" 2014-07-08:07:30:58 QTime=50033 numFound="5730710" 2014-07-08:07:31:58 QTime=13 numFound="5730710" 2014-07-08:07:33:48 QTime=50065 numFound="5734069" 2014-07-08:07:34:48 QTime=16 numFound="5737742" 2014-07-08:07:36:38 QTime=50037 numFound="5737742" 2014-07-08:07:37:38 QTime=12 numFound="5738190" 2014-07-08:07:38:38 QTime=23 numFound="5741208" 2014-07-08:07:40:29 QTime=50034 numFound="5742067" 2014-07-08:07:41:29 QTime=12 numFound="5742067" 2014-07-08:07:42:29 QTime=17 numFound="5742067" 2014-07-08:07:43:29 QTime=20 numFound="5745497" 2014-07-08:07:44:29 QTime=13 numFound="5745981" 2014-07-08:07:45:29 QTime=23 numFound="5746420" As you can see, the QTime is just over 50 seconds at irregular intervals. This happens independent of whether I am indexing documents with around 20 dps or not. First I thought about a dependence on the auto-commit of 5 minutes, but the the 50 seconds hits are too irregular. Furthermore, and this is *really strange*: when hooking strace on the solr process, the 50 seconds QTimes disappear completely and consistently --- a real Heisenbug. Nevertheless, strace shows that there is a socket timeout of 50 seconds defined in calls like this: [pid 1253] 09:09:37.857413 poll([{fd=96, events=POLLIN|POLLERR}], 1, 5) = 1 ([{fd=96, revents=POLLIN}]) <0.40> where the fd=96 is the result of [pid 25446] 09:09:37.855235 accept(122, {sa_family=AF_INET, sin_port=htons(57236), sin_addr=inet_addr("ip address of local host")}, [16]) = 96 <0.54> where again fd=122 is the TCP port on which solr was started. My hunch is that this is communication between the cores of solr. I tried to search the internet for such a strange connection between socket timeouts and strace, but could not find anything (the stackoverflow entry from yesterday is my own :-( This smells a bit like a race condition/deadlock kind of thing which is broken up by timing differences introduced by stracing the process. Any hints appreciated. For completeness, here is my setup: - solr-4.8.1, - cloud version running - 10 shards on 10 cores in one instance - hosted on SUSE Linux Enterprise Server 11 (x86_64), VERSION 11, PATCHLEVEL 2 - hosted on a vmware, 4 CPU cores, 16 GB RAM - single digit million docs indexed, exact number does not matter - zero query load Harald.
Of, To, and Other Small Words
Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like "of" or "to" that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" --data-binary '100blah blah blah knowledge of science blah blah blah' Then, using a broswer: http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100 I get zero hits. Search for "knowledge" or "science" and I'll get hits. "knowledge of" or "of science" and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to" and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague
Re: Of, To, and Other Small Words
Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James wrote: > Hello all, > > I am working with Solr 4.9.0 and am searching for phrases that contain words > like "of" or "to" that Solr seems to be ignoring at index time. Here's what > I tried: > > curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" > --data-binary '100 name="content">blah blah blah knowledge of science blah blah > blah' > > Then, using a broswer: > > http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100 > > I get zero hits. Search for "knowledge" or "science" and I'll get hits. > "knowledge of" or "of science" and I get zero hits. I don't want to use > proximity if I can avoid it, as this may introduce too many undesireable > results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to" > and possibly more words that I have not discovered through testing yet. Is > there some other configuration file that contains these small words? Is > there any way to force Solr to pay attention to them and not drop them from > the phrase? Any advice is appreciated! Thanks! > > -Teague > > -- Anshum Gupta http://www.anshumgupta.net
Re: Of, To, and Other Small Words
Or, if you happen to leave off the "words" attribute of the stop filter (or misspell the attribute name), it will use the internal Lucene hardwired list of stop words. -- Jack Krupansky -Original Message- From: Anshum Gupta Sent: Monday, July 14, 2014 4:03 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James wrote: Hello all, I am working with Solr 4.9.0 and am searching for phrases that contain words like "of" or "to" that Solr seems to be ignoring at index time. Here's what I tried: curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" --data-binary '100blah blah blah knowledge of science blah blah blah' Then, using a broswer: http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=id:100 I get zero hits. Search for "knowledge" or "science" and I'll get hits. "knowledge of" or "of science" and I get zero hits. I don't want to use proximity if I can avoid it, as this may introduce too many undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring "of" and "to" and possibly more words that I have not discovered through testing yet. Is there some other configuration file that contains these small words? Is there any way to force Solr to pay attention to them and not drop them from the phrase? Any advice is appreciated! Thanks! -Teague -- Anshum Gupta http://www.anshumgupta.net
Strategies for effective prefix queries?
I'm working on using Solr for autocompleting usernames. I'm running into a problem with the wildcard queries (e.g. username:al*). We are tokenizing usernames so that a username like "solr-user" will be tokenized into "solr" and "user", and will match both "sol" and "use" prefixes. The problem is when we get "solr-u" as a prefix, I'm having to split that up on the client side before I construct a query "username:solr* username:u*". I'm basically using a regex as a poor man's tokenizer. Is there a better way to approach this? Is there a way to tell Solr to tokenize a string and use the parts as prefixes? - Hayden
RE: Of, To, and Other Small Words
Hi Anshum, Thanks for replying and suggesting this, but the field type I am using (a modified text_general) in my schema has the file set to 'stopwords.txt'. Just to be double sure I cleared the list in stopwords_en.txt, restarted Solr, re-indexed, and searched with still zero results. Any other suggestions on where I might be able to control this behavior? -Teague -Original Message- From: Anshum Gupta [mailto:ans...@anshumgupta.net] Sent: Monday, July 14, 2014 4:04 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James wrote: > Hello all, > > I am working with Solr 4.9.0 and am searching for phrases that contain > words like "of" or "to" that Solr seems to be ignoring at index time. > Here's what I tried: > > curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" > --data-binary '100 name="content">blah blah blah knowledge of science blah blah > blah' > > Then, using a broswer: > > http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i > d:100 > > I get zero hits. Search for "knowledge" or "science" and I'll get hits. > "knowledge of" or "of science" and I get zero hits. I don't want to > use proximity if I can avoid it, as this may introduce too many > undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring > "of" and "to" > and possibly more words that I have not discovered through testing > yet. Is there some other configuration file that contains these small > words? Is there any way to force Solr to pay attention to them and not > drop them from the phrase? Any advice is appreciated! Thanks! > > -Teague > > -- Anshum Gupta http://www.anshumgupta.net
Re: Of, To, and Other Small Words
Have you tried the Admin UI's Analyze screen. Because it will show you what happens to the text as it progresses through the tokenizers and filters. No need to reindex. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:10 AM, Teague James wrote: > Hi Anshum, > > Thanks for replying and suggesting this, but the field type I am using (a > modified text_general) in my schema has the file set to 'stopwords.txt'. > > positionIncrementGap="100"> > > > ignoreCase="true" words="stopwords.txt" /> > > > > minGramSize="3" maxGramSize="10" /> > > > > > > ignoreCase="true" words="stopwords.txt" /> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > > > > > > > Just to be double sure I cleared the list in stopwords_en.txt, restarted > Solr, re-indexed, and searched with still zero results. Any other suggestions > on where I might be able to control this behavior? > > -Teague > > > -Original Message- > From: Anshum Gupta [mailto:ans...@anshumgupta.net] > Sent: Monday, July 14, 2014 4:04 PM > To: solr-user@lucene.apache.org > Subject: Re: Of, To, and Other Small Words > > Hi Teague, > > The StopFilterFactory (which I think you're using) by default uses > lang/stopwords_en.txt (which wouldn't be empty if you check). > What you're looking at is the stopword.txt. You could either empty that file > out or change the field type for your field. > > > On Mon, Jul 14, 2014 at 12:53 PM, Teague James > wrote: >> Hello all, >> >> I am working with Solr 4.9.0 and am searching for phrases that contain >> words like "of" or "to" that Solr seems to be ignoring at index time. >> Here's what I tried: >> >> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" >> --data-binary '100> name="content">blah blah blah knowledge of science blah blah >> blah' >> >> Then, using a broswer: >> >> http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i >> d:100 >> >> I get zero hits. Search for "knowledge" or "science" and I'll get hits. >> "knowledge of" or "of science" and I get zero hits. I don't want to >> use proximity if I can avoid it, as this may introduce too many >> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring >> "of" and "to" >> and possibly more words that I have not discovered through testing >> yet. Is there some other configuration file that contains these small >> words? Is there any way to force Solr to pay attention to them and not >> drop them from the phrase? Any advice is appreciated! Thanks! >> >> -Teague >> >> > > > > -- > > Anshum Gupta > http://www.anshumgupta.net >
Re: Strategies for effective prefix queries?
Search against both fields (one split, one not split)? Keep original and tokenized form? I am doing something similar with class name autocompletes here: https://github.com/arafalov/Solr-Javadoc/blob/master/JavadocIndex/JavadocCollection/conf/schema.xml#L24 Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:04 AM, Hayden Muhl wrote: > I'm working on using Solr for autocompleting usernames. I'm running into a > problem with the wildcard queries (e.g. username:al*). > > We are tokenizing usernames so that a username like "solr-user" will be > tokenized into "solr" and "user", and will match both "sol" and "use" > prefixes. The problem is when we get "solr-u" as a prefix, I'm having to > split that up on the client side before I construct a query "username:solr* > username:u*". I'm basically using a regex as a poor man's tokenizer. > > Is there a better way to approach this? Is there a way to tell Solr to > tokenize a string and use the parts as prefixes? > > - Hayden
RE: Of, To, and Other Small Words
Jack, Thanks for replying and the suggestion. I replied to another suggestion with my field type and I do have . There's nothing in the stopwords.txt. I even cleaned out stopwords_en.txt just to be certain. Any other suggestions on how to control this behavior? -Teague -Original Message- From: Jack Krupansky [mailto:j...@basetechnology.com] Sent: Monday, July 14, 2014 4:26 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Or, if you happen to leave off the "words" attribute of the stop filter (or misspell the attribute name), it will use the internal Lucene hardwired list of stop words. -- Jack Krupansky -Original Message- From: Anshum Gupta Sent: Monday, July 14, 2014 4:03 PM To: solr-user@lucene.apache.org Subject: Re: Of, To, and Other Small Words Hi Teague, The StopFilterFactory (which I think you're using) by default uses lang/stopwords_en.txt (which wouldn't be empty if you check). What you're looking at is the stopword.txt. You could either empty that file out or change the field type for your field. On Mon, Jul 14, 2014 at 12:53 PM, Teague James wrote: > Hello all, > > I am working with Solr 4.9.0 and am searching for phrases that contain > words like "of" or "to" that Solr seems to be ignoring at index time. > Here's what I tried: > > curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" > --data-binary '100 name="content">blah blah blah knowledge of science blah blah > blah' > > Then, using a broswer: > > http://localhost/solr/collection1/select?q="knowledge+of+science"&fq=i > d:100 > > I get zero hits. Search for "knowledge" or "science" and I'll get hits. > "knowledge of" or "of science" and I get zero hits. I don't want to > use proximity if I can avoid it, as this may introduce too many > undesireable results. Stopwords.txt is blank, yet clearly Solr is > ignoring "of" and "to" > and possibly more words that I have not discovered through testing > yet. Is there some other configuration file that contains these small > words? Is there any way to force Solr to pay attention to them and not > drop them from the phrase? Any advice is appreciated! Thanks! > > -Teague > > -- Anshum Gupta http://www.anshumgupta.net
RE: Of, To, and Other Small Words
Alex, Thanks! Great suggestion. I figured out that it was the EdgeNGramFilterFactory. Taking that out of the mix did it. -Teague -Original Message- From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] Sent: Monday, July 14, 2014 9:14 PM To: solr-user Subject: Re: Of, To, and Other Small Words Have you tried the Admin UI's Analyze screen. Because it will show you what happens to the text as it progresses through the tokenizers and filters. No need to reindex. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:10 AM, Teague James wrote: > Hi Anshum, > > Thanks for replying and suggesting this, but the field type I am using (a > modified text_general) in my schema has the file set to 'stopwords.txt'. > > positionIncrementGap="100"> > > > ignoreCase="true" words="stopwords.txt" /> > > > > minGramSize="3" maxGramSize="10" /> > > > > > > ignoreCase="true" words="stopwords.txt" /> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> > > > > > > > Just to be double sure I cleared the list in stopwords_en.txt, restarted > Solr, re-indexed, and searched with still zero results. Any other suggestions > on where I might be able to control this behavior? > > -Teague > > > -Original Message- > From: Anshum Gupta [mailto:ans...@anshumgupta.net] > Sent: Monday, July 14, 2014 4:04 PM > To: solr-user@lucene.apache.org > Subject: Re: Of, To, and Other Small Words > > Hi Teague, > > The StopFilterFactory (which I think you're using) by default uses > lang/stopwords_en.txt (which wouldn't be empty if you check). > What you're looking at is the stopword.txt. You could either empty that file > out or change the field type for your field. > > > On Mon, Jul 14, 2014 at 12:53 PM, Teague James > wrote: >> Hello all, >> >> I am working with Solr 4.9.0 and am searching for phrases that >> contain words like "of" or "to" that Solr seems to be ignoring at index time. >> Here's what I tried: >> >> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" >> --data-binary '100> name="content">blah blah blah knowledge of science blah blah >> blah' >> >> Then, using a broswer: >> >> >> i >> d:100 >> >> I get zero hits. Search for "knowledge" or "science" and I'll get hits. >> "knowledge of" or "of science" and I get zero hits. I don't want to >> use proximity if I can avoid it, as this may introduce too many >> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring >> "of" and "to" >> and possibly more words that I have not discovered through testing >> yet. Is there some other configuration file that contains these small >> words? Is there any way to force Solr to pay attention to them and >> not drop them from the phrase? Any advice is appreciated! Thanks! >> >> -Teague >> >> > > > > -- > > Anshum Gupta > http://www.anshumgupta.net >
Re: Of, To, and Other Small Words
You could try experimenting with CommonGramsFilterFactory and CommonGramsQueryFilter (slightly different). There is actually a lot of cool analyzers bundled with Solr. You can find full list on my site at: http://www.solr-start.com/info/analyzers Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 On Tue, Jul 15, 2014 at 8:42 AM, Teague James wrote: > Alex, > > Thanks! Great suggestion. I figured out that it was the > EdgeNGramFilterFactory. Taking that out of the mix did it. > > -Teague > > -Original Message- > From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] > Sent: Monday, July 14, 2014 9:14 PM > To: solr-user > Subject: Re: Of, To, and Other Small Words > > Have you tried the Admin UI's Analyze screen. Because it will show you what > happens to the text as it progresses through the tokenizers and filters. No > need to reindex. > > Regards, >Alex. > Personal: http://www.outerthoughts.com/ and @arafalov Solr resources: > http://www.solr-start.com/ and @solrstart Solr popularizers community: > https://www.linkedin.com/groups?gid=6713853 > > > On Tue, Jul 15, 2014 at 8:10 AM, Teague James > wrote: >> Hi Anshum, >> >> Thanks for replying and suggesting this, but the field type I am using (a >> modified text_general) in my schema has the file set to 'stopwords.txt'. >> >> > positionIncrementGap="100"> >> >> >> > ignoreCase="true" words="stopwords.txt" /> >> >> >> >> > minGramSize="3" maxGramSize="10" /> >> >> >> >> >> >> > ignoreCase="true" words="stopwords.txt" /> >> > synonyms="synonyms.txt" ignoreCase="true" expand="true"/> >> >> >> >> >> >> >> Just to be double sure I cleared the list in stopwords_en.txt, restarted >> Solr, re-indexed, and searched with still zero results. Any other >> suggestions on where I might be able to control this behavior? >> >> -Teague >> >> >> -Original Message- >> From: Anshum Gupta [mailto:ans...@anshumgupta.net] >> Sent: Monday, July 14, 2014 4:04 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Of, To, and Other Small Words >> >> Hi Teague, >> >> The StopFilterFactory (which I think you're using) by default uses >> lang/stopwords_en.txt (which wouldn't be empty if you check). >> What you're looking at is the stopword.txt. You could either empty that file >> out or change the field type for your field. >> >> >> On Mon, Jul 14, 2014 at 12:53 PM, Teague James >> wrote: >>> Hello all, >>> >>> I am working with Solr 4.9.0 and am searching for phrases that >>> contain words like "of" or "to" that Solr seems to be ignoring at index >>> time. >>> Here's what I tried: >>> >>> curl http://localhost/solr/update?commit=true -H "Content-Type: text/xml" >>> --data-binary '100>> name="content">blah blah blah knowledge of science blah blah >>> blah' >>> >>> Then, using a broswer: >>> >>> >>> i >>> d:100 >>> >>> I get zero hits. Search for "knowledge" or "science" and I'll get hits. >>> "knowledge of" or "of science" and I get zero hits. I don't want to >>> use proximity if I can avoid it, as this may introduce too many >>> undesireable results. Stopwords.txt is blank, yet clearly Solr is ignoring >>> "of" and "to" >>> and possibly more words that I have not discovered through testing >>> yet. Is there some other configuration file that contains these small >>> words? Is there any way to force Solr to pay attention to them and >>> not drop them from the phrase? Any advice is appreciated! Thanks! >>> >>> -Teague >>> >>> >> >> >> >> -- >> >> Anshum Gupta >> http://www.anshumgupta.net >> >
Re: External File Field eating memory
Hey Kamal, What all config changes have you done to establish replication of external files and how have you disabled role reloading? On Wed, Jul 9, 2014 at 11:30 AM, Kamal Kishore Aggarwal < kkroyal@gmail.com> wrote: > Hi All, > > It was found that external file, which was getting replicated after every > 10 minutes was reloading the core as well. This was increasing the query > time. > > Thanks > Kamal Kishore > > > > On Thu, Jul 3, 2014 at 12:48 PM, Kamal Kishore Aggarwal < > kkroyal@gmail.com> wrote: > > > With the above replication configuration, the eff file is getting > > replicated at core/conf/data/external_eff_views (new dir data is being > > created in conf dir) location, but it is not getting replicated at > core/data/external_eff_views > > on slave. > > > > Please help. > > > > > > On Thu, Jul 3, 2014 at 12:21 PM, Kamal Kishore Aggarwal < > > kkroyal@gmail.com> wrote: > > > >> Thanks for your guidance Alexandre Rafalovitch. > >> > >> I am looking into this seriously. > >> > >> Another question is that I facing error in replication of eff file > >> > >> This is master replication configuration: > >> > >> core/conf/solrconfig.xml > >> > >> > >>> > >>> commit > >>> startup > >>> ../data/external_eff_views > >>> > >>> > >> > >> > >> The eff file is present at core/data/external_eff_views location. > >> > >> > >> On Thu, Jul 3, 2014 at 11:50 AM, Shalin Shekhar Mangar < > >> shalinman...@gmail.com> wrote: > >> > >>> This might be related: > >>> > >>> https://issues.apache.org/jira/browse/SOLR-3514 > >>> > >>> > >>> On Sat, Jun 28, 2014 at 5:34 PM, Kamal Kishore Aggarwal < > >>> kkroyal@gmail.com> wrote: > >>> > >>> > Hi Team, > >>> > > >>> > I have recently implemented EFF in solr. There are about 1.5 > >>> lacs(unsorted) > >>> > values in the external file. After this implementation, the server > has > >>> > become slow. The solr query time has also increased. > >>> > > >>> > Can anybody confirm me if these issues are because of this > >>> implementation. > >>> > Is that memory does EFF eats up? > >>> > > >>> > Regards > >>> > Kamal Kishore > >>> > > >>> > >>> > >>> > >>> -- > >>> Regards, > >>> Shalin Shekhar Mangar. > >>> > >> > >> > > > -- Thanks & Regards, Apoorva