maxwarmingSearchers and memory leak
We have maxwarmingSearchers set to 2 and field value cache set to initial size of 64. We saw that by taking a heap dump that our caches consume 70% of the heap size, by looking into the dump we saw that fieldValueCache has 6 occurences of org.apache.solr.util.concurrentCache. When we have maxWarmingSearches=2 we would expect to have only 3 (maybe 4 before GC has been launched). What can it be? We use solr4.10.1 -- View this message in context: http://lucene.472066.n3.nabble.com/maxwarmingSearchers-and-memory-leak-tp4321937.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Replicas fail immediately in new collection
SOLR-9739 changed the writeStr method to accept a CharSequence from String in 6.4 so my guess is that your classpath has a newer (6.4+) solrj version but an older solr-core jar that cannot find this new method. On Sat, Feb 18, 2017 at 5:16 AM, Walter Underwood wrote: > Any idea why I would be getting this on a brand new, empty collection on the > first update? > > HTTP ERROR 500 > Problem accessing /solr/tutors_shard1_replica9/update. Reason: > Server ErrorCaused > by:java.lang.NoSuchMethodError: > org.apache.solr.update.TransactionLog$LogCodec.writeStr(Ljava/lang/String;)V > at > org.apache.solr.update.TransactionLog.writeCommit(TransactionLog.java:457) > > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > -- Regards, Shalin Shekhar Mangar.
Re: Replicas fail immediately in new collection
I finally figured this out yesterday. Because the jar files have the version in the file name, I had a mix of jars from different versions. Depending on the load order, Solr could get into a situation where it was calling something that didn’t exist. That was mysterious. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 23, 2017, at 6:55 AM, Shalin Shekhar Mangar > wrote: > > SOLR-9739 changed the writeStr method to accept a CharSequence from > String in 6.4 so my guess is that your classpath has a newer (6.4+) > solrj version but an older solr-core jar that cannot find this new > method. > > On Sat, Feb 18, 2017 at 5:16 AM, Walter Underwood > wrote: >> Any idea why I would be getting this on a brand new, empty collection on the >> first update? >> >> HTTP ERROR 500 >> Problem accessing /solr/tutors_shard1_replica9/update. Reason: >> Server ErrorCaused >> by:java.lang.NoSuchMethodError: >> org.apache.solr.update.TransactionLog$LogCodec.writeStr(Ljava/lang/String;)V >>at >> org.apache.solr.update.TransactionLog.writeCommit(TransactionLog.java:457) >> >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> > > > > -- > Regards, > Shalin Shekhar Mangar.
Re: Interval Facets with JSON
Hi Deniz, Interval Facets is currently not supported with JSON Facets as Tom said. Could you create a Jira issue? On Fri, Feb 10, 2017 at 6:16 AM, Tom Evans wrote: > On Wed, Feb 8, 2017 at 11:26 PM, deniz wrote: > > Tom Evans-2 wrote > >> I don't think there is such a thing as an interval JSON facet. > >> Whereabouts in the documentation are you seeing an "interval" as JSON > >> facet type? > >> > >> > >> You want a range facet surely? > >> > >> One thing with range facets is that the gap is fixed size. You can > >> actually do your example however: > >> > >> json.facet={hieght_facet:{type:range, gap:20, start:160, end:190, > >> hardend:True, field:height}} > >> > >> If you do require arbitrary bucket sizes, you will need to do it by > >> specifying query facets instead, I believe. > >> > >> Cheers > >> > >> Tom > > > > > > nothing other than > > https://cwiki.apache.org/confluence/display/solr/Faceting#Faceting- > IntervalFaceting > > for documentation on intervals... i am ok with range queries as well but > > intervals would fit better because of different sizes... > > That documentation is not for JSON facets though. You can't pick and > choose features from the old facet system and use them in JSON facets > unless they are mentioned in the JSON facet documentation: > > https://cwiki.apache.org/confluence/display/solr/JSON+Request+API > > and (not official documentation) > > http://yonik.com/json-facet-api/ > > Cheers > > Tom >
SOLRCloud on 6.4 on Ubuntu
I'm trying to find a good beginner level guide to setting up SolrCloud NOT using the example configs that are provided with SOLR. Here are my goals (and the steps I have done so far!): 1. Use an external Zookeeper server a. wget http://apache.claz.org/zookeeper/zookeeper-3.3.6/zookeeper-3.3.6.tar.gz b. uncompress into /apps folder (Our company uses this type of standard folder, so I'm following suit here) c. Copy zoo_sample.cfg to zoo.cfg d. Update data folder to: /apps/zookeeperData e. Bin/zkServer.sh start 2. Install SOLR on both nodes a. wget http://www.us.apache.org/dist/lucene/solr/6.4.1/solr-6.4.1.tgz b. tar xzf solr-6.4.1.tgz solr-6.4.1/bin/install_solr_service.sh --strip-components=2 c. ./install_solr_service.sh solr-6.4.1.tgz d. Update solr.in.sh to include the ZKHome variable set to my ZK server's ip on port 2181 Now it seems if I start SOLR manually with bin/solr start -c -p 8080 -z :2181 then it will actually load, but if I let it auto start, I get an HTTP 500 error on the Admin UI for SOLR. I also can't seem to figure out what I need to upload into Zookeeper as far as configuration files go. I created a test collection on the instance when I got it up one time...but it has yet to start properly again for me. Are there any GOOD tutorials out there? I have read most of the documentation I can get my hands on thus far from Apache, and blogs and such, but the light bulb still has not lit up for me yet and I feel like a n00b ;-) My company is currently running SOLR in the old master/slave config and I'm trying to setup a SOLRCloud so that we can toy with it in a Dev/QA Environment and see what it's capable of. We're currently running 4 separate master/slave SOLR server pairs in production to spread out the load a bit, but I'd rather see us migrate towards a cluster/cloud scenario to gain some computing power here! Any help is GREATLY appreciated! Scott
Phrase field matches not counting towards minimum match
Ok let me explain what I am trying to do first since there may be a better approach. Recently I had been trying to increase solr's matching precision by requiring that all of the words in a field match before allowing a match on a field. I am using edismax as my query parser and since it tokenizes on white space there's no way to make sure that if my query is q=foo bar and I have a field named somefield indexed as a text field with foo bar that foo doesn't match and bar doesn't match but the phrase "foo bar" does match. I feel like I'm not explaining this very well but basically what I want to do has already been done by Lucid works: https://lucidworks.com/2014/07/02/automatic-phrase-tokenization-improving-lucene-search-precision-by-more-precise-linguistic-analysis/ However their solution requires that you use a pluggable query parser which is not an extension of edismax. Now I haven't done a deep comparison but I'm assuming I would lose access to all of edismax's parameters if I used their pluggable query parser. So instead I tried to replicate this functionality using edismax's pf2 and pf3 parameters. It all works beautifully the way I have it setup except that phrase field matches don't count towards my mm count. Ok so now I will go into detail about how I have my index setup for this specific example. I am using solr's default text field to index a field named manufacturer2 here are the relevant parameters of my search q=livex lighting 8193 qf=productid, manufacturer_stop pf2=manufacturer2 mm=3<-1 5<-2 6<90% now I am stopping the word lighting from my manufacturer_stop field using stopwords so only livex is matching in the manufacturer_stop field However "livex lighting" is matching in the manufacturer2 field using phrase field matching in the pf2 parameter. so my matches are the following: MATCH livex in manufacturer_stop field MATCH 8193 in productid field MATCH "livex lighting" in manufacturer 2 field as a phrase field match so I have three matches... however the phrase field match doesn't seem be be counting towards my mm match requirement of 3 tokens passed 3 must match. If I change my mm to require only 2 tokens must match I get the expected result. But I want my phrase field to count towards my mm match requirement since lighting is matching in my phrase field. Any assistance would be appreciated Or if someone could suggest a better approach that would also be appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Phrase-field-matches-not-counting-towards-minimum-match-tp4322066.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLRCloud on 6.4 on Ubuntu
I don't know which of these you read, so it is a bit of a grab bag. And I haven't reviewed some of them in depth. But hopefully, there is a nugget of gold somewhere in there for you: https://github.com/LucidWorks/solr-scale-tk https://www.slideshare.net/thelabdude/apache-con-managingsolrcloudinthecloud https://systemsarchitect.net/2013/04/06/painless-guide-to-solr-cloud-configuration/ https://github.com/bloomreach/solrcloud-haft http://www.francelabs.com/blog/tutorial-solrcloud-5-amazon-ec2/ (oldish) https://github.com/freedev/solrcloud-zookeeper-docker https://sematext.com/blog/2016/12/13/solr-master-slave-solrcloud-migration/ http://dlightdaily.com/2016/11/30/solr-cloud-installation-zookeeper/ https://sbdevel.wordpress.com/2016/11/30/70tb-16b-docs-4-machines-1-solrcloud/ (just to drool, but it may also be useful) Hope it helps, Alex. http://www.solr-start.com/ - Resources for Solr users, new and experienced On 23 February 2017 at 16:12, Pouliot, Scott wrote: > I'm trying to find a good beginner level guide to setting up SolrCloud NOT > using the example configs that are provided with SOLR. > > Here are my goals (and the steps I have done so far!): > > > 1. Use an external Zookeeper server > > a. wget > http://apache.claz.org/zookeeper/zookeeper-3.3.6/zookeeper-3.3.6.tar.gz > > b. uncompress into /apps folder (Our company uses this type of standard > folder, so I'm following suit here) > > c. Copy zoo_sample.cfg to zoo.cfg > > d. Update data folder to: /apps/zookeeperData > > e. Bin/zkServer.sh start > > 2. Install SOLR on both nodes > > a. wget http://www.us.apache.org/dist/lucene/solr/6.4.1/solr-6.4.1.tgz > > b. tar xzf solr-6.4.1.tgz solr-6.4.1/bin/install_solr_service.sh > --strip-components=2 > > c. ./install_solr_service.sh solr-6.4.1.tgz > > d. Update solr.in.sh to include the ZKHome variable set to my ZK > server's ip on port 2181 > > Now it seems if I start SOLR manually with bin/solr start -c -p 8080 -z IP>:2181 then it will actually load, but if I let it auto start, I get an > HTTP 500 error on the Admin UI for SOLR. > > I also can't seem to figure out what I need to upload into Zookeeper as far > as configuration files go. I created a test collection on the instance when > I got it up one time...but it has yet to start properly again for me. > > Are there any GOOD tutorials out there? I have read most of the > documentation I can get my hands on thus far from Apache, and blogs and such, > but the light bulb still has not lit up for me yet and I feel like a n00b ;-) > > My company is currently running SOLR in the old master/slave config and I'm > trying to setup a SOLRCloud so that we can toy with it in a Dev/QA > Environment and see what it's capable of. We're currently running 4 separate > master/slave SOLR server pairs in production to spread out the load a bit, > but I'd rather see us migrate towards a cluster/cloud scenario to gain some > computing power here! > > Any help is GREATLY appreciated! > > Scott
Subsciption to group
Hi; I want to be part of solr user group. Can you add me.
Re: SOLRCloud on 6.4 on Ubuntu
Getting configs up (and down) from solr is done either with zkCli or bin/solr. Personally I find the latter easier if only because it's in a single place. Try bin/solr zk -help and you'll see a bunch of options. Once you do upload the config, you must reload the collection for it to "take". Best, Erick On Thu, Feb 23, 2017 at 1:51 PM, Alexandre Rafalovitch wrote: > I don't know which of these you read, so it is a bit of a grab bag. > And I haven't reviewed some of them in depth. But hopefully, there is > a nugget of gold somewhere in there for you: > > https://github.com/LucidWorks/solr-scale-tk > https://www.slideshare.net/thelabdude/apache-con-managingsolrcloudinthecloud > https://systemsarchitect.net/2013/04/06/painless-guide-to-solr-cloud-configuration/ > https://github.com/bloomreach/solrcloud-haft > http://www.francelabs.com/blog/tutorial-solrcloud-5-amazon-ec2/ (oldish) > https://github.com/freedev/solrcloud-zookeeper-docker > https://sematext.com/blog/2016/12/13/solr-master-slave-solrcloud-migration/ > http://dlightdaily.com/2016/11/30/solr-cloud-installation-zookeeper/ > https://sbdevel.wordpress.com/2016/11/30/70tb-16b-docs-4-machines-1-solrcloud/ > (just to drool, but it may also be useful) > > Hope it helps, >Alex. > > http://www.solr-start.com/ - Resources for Solr users, new and experienced > > > On 23 February 2017 at 16:12, Pouliot, Scott > wrote: >> I'm trying to find a good beginner level guide to setting up SolrCloud NOT >> using the example configs that are provided with SOLR. >> >> Here are my goals (and the steps I have done so far!): >> >> >> 1. Use an external Zookeeper server >> >> a. wget >> http://apache.claz.org/zookeeper/zookeeper-3.3.6/zookeeper-3.3.6.tar.gz >> >> b. uncompress into /apps folder (Our company uses this type of >> standard folder, so I'm following suit here) >> >> c. Copy zoo_sample.cfg to zoo.cfg >> >> d. Update data folder to: /apps/zookeeperData >> >> e. Bin/zkServer.sh start >> >> 2. Install SOLR on both nodes >> >> a. wget http://www.us.apache.org/dist/lucene/solr/6.4.1/solr-6.4.1.tgz >> >> b. tar xzf solr-6.4.1.tgz solr-6.4.1/bin/install_solr_service.sh >> --strip-components=2 >> >> c. ./install_solr_service.sh solr-6.4.1.tgz >> >> d. Update solr.in.sh to include the ZKHome variable set to my ZK >> server's ip on port 2181 >> >> Now it seems if I start SOLR manually with bin/solr start -c -p 8080 -z > IP>:2181 then it will actually load, but if I let it auto start, I get an >> HTTP 500 error on the Admin UI for SOLR. >> >> I also can't seem to figure out what I need to upload into Zookeeper as far >> as configuration files go. I created a test collection on the instance when >> I got it up one time...but it has yet to start properly again for me. >> >> Are there any GOOD tutorials out there? I have read most of the >> documentation I can get my hands on thus far from Apache, and blogs and >> such, but the light bulb still has not lit up for me yet and I feel like a >> n00b ;-) >> >> My company is currently running SOLR in the old master/slave config and I'm >> trying to setup a SOLRCloud so that we can toy with it in a Dev/QA >> Environment and see what it's capable of. We're currently running 4 >> separate master/slave SOLR server pairs in production to spread out the load >> a bit, but I'd rather see us migrate towards a cluster/cloud scenario to >> gain some computing power here! >> >> Any help is GREATLY appreciated! >> >> Scott
Re: Question about best way to architect a Solr application with many data sources
Alfresco has spent ten+ years building a content management system that follows this basic design: 1) Original bytes (PDF, Word Doc, image file) are stored in a filesystem based content store. 2) Meta-data is stored in a relational database, normalized. 3) Content is transformed to text and meta-data is de-normalized and is sent to Solr for indexing. 4) Solr keeps a copy of the de-normalized, pre-analyzed content on disk next to the indexes for re-indexing and other purposes. 5) Sor analyzes and indexes the content. This all happens automatically when the content is added to Alfresco. ACL lists are also stored along with documents and passed to Solr to support document level access control during the search. Joel Bernstein http://joelsolr.blogspot.com/ On Wed, Feb 22, 2017 at 3:01 PM, Tim Casey wrote: > I would possibly extend this a bit futher. There is the source, then the > 'normalized' version of the data, then the indexed version. > Sometimes you realize you miss something in the normalized view and you > have to go back to the actual source. > > This will be as likely as there are number of sources for data. I would > expect the "DB" version of the data would be the normalized view. > It is also possible, the DB holds the raw bytes of the source which are > then transformed and into a normalized view. Indexing always happens from > the normalized view. In this scheme, frequently there is a way to mark > what failed normalization so you can go back and recapture the data for a > re-index. > > Also, if you are dealing with timely data, being able to reindex helps > removing stale information from the search index. In the pipeline of > captured source -> normalized -> analyzed -> information, where analyzed is > indexed here, what you do with the data over a year or more becomes part of > the thinking. > > > > On Tue, Feb 21, 2017 at 8:24 PM, Walter Underwood > wrote: > > > Reindexing is exactly why you want the Single Source of Truth to be in a > > repository outside of Solr. > > > > For our slowly-changing data sets, we have an intermediate JSONL batch. > > That is created from the source repositories and saved in Amazon S3. Then > > we load it into Solr nightly. That allows us to reload whenever we need > to, > > like loading prod data in test or moving search to a different Amazon > > region. > > > > wunder > > Walter Underwood > > wun...@wunderwood.org > > http://observer.wunderwood.org/ (my blog) > > > > > > > On Feb 21, 2017, at 7:34 PM, Erick Erickson > > wrote: > > > > > > Dave: > > > > > > Oh, I agree that a DB is a perfectly valid place to store the data and > > > you're absolutely right that it allows better interaction than flat > > > files; you can ask questions of an RDBMS that you can't easily ask the > > > disk ;). Storing to disk is an alternative if you're unwilling to deal > > > with a DB is all. > > > > > > But the main point is you'll change your schema sometime and have to > > > re-index. Having the data you're indexing stored locally in whatever > > > form will allow much faster turn-around rather than re-crawling. Of > > > course it'll result in out of date data so you'll have to refresh > > > somehow sometime. > > > > > > Erick > > > > > > On Tue, Feb 21, 2017 at 6:07 PM, Dave > > wrote: > > >> Ha I think I went to one of your training seminars in NYC maybe 4 > years > > ago Eric. I'm going to have to respectfully disagree about the rdbms. > It's > > such a well know data format that you could hire a high school programmer > > to help with the db end if you knew how to flatten it to solr. Besides > it's > > easy to visualize and interact with the data before it goes to solr. A > > Json/Nosql format would work just as well, but I really think a database > > has its place in a scenario like this > > >> > > >>> On Feb 21, 2017, at 8:20 PM, Erick Erickson > > > wrote: > > >>> > > >>> I'll add that I _guarantee_ you'll want to re-index the data as you > > >>> change your schema > > >>> and the like. You'll be able to do that much more quickly if the data > > >>> is stored locally somehow. > > >>> > > >>> A RDBMS is not necessary however. You could simply store the data on > > >>> disk in some format > > >>> you could re-read and send to Solr. > > >>> > > >>> Best, > > >>> Erick > > >>> > > On Tue, Feb 21, 2017 at 5:17 PM, Dave > > > wrote: > > B is a better option long term. Solr is meant for retrieving flat > > data, fast, not hierarchical. That's what a database is for and trust me > > you would rather have a real database on the end point. Each tool has a > > purpose, solr can never replace a relational database, and a relational > > database could not replace solr. Start with the slow model (database) for > > control/display and enhance with the fast model (solr) for > retrieval/search > > > > > > > > > On Feb 21, 2017, at 7:57 PM, Robert Hume > wrote: > > > > > > To learn how to properly use Solr, I'm building a little
Setting Solr data dir isn't really working (6.3.0)
I did this in the solrconfig.xml for both collections (tutors and questions). /solr/data I deleted the old collection indexes, reloaded, restarted, and created a new collection for “tutors". And I see this on the disk. [wunder@new-solr-c02.test3]# ls -l /solr/data total 36 drwxr-xr-x 2 bin bin 20480 Feb 23 17:40 index drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 snapshot_metadata drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 suggest_subject_names_fuzzy drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 suggest_subject_names_infix drwxr-xr-x 2 bin bin 4096 Feb 23 17:40 tlog [wunder@new-solr-c02.test3]# ls -l /apps/solr6/server/solr total 12 drwxr-xr-x 5 bin bin 93 Jul 14 2016 configsets -rw-r--r-- 1 bin bin 3037 Jul 14 2016 README.txt -rw-r--r-- 1 bin bin 2117 Aug 31 20:13 solr.xml drwxr-xr-x 2 bin bin 28 Feb 23 15:57 tutors_shard1_replica5 -rw-r--r-- 1 bin bin 501 Jul 14 2016 zoo.cfg [wunder@new-solr-c02.test3]# Seems pretty broken to me. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)
Re: Setting Solr data dir isn't really working (6.3.0)
Not quite sure what you're complaint is. Is it that you've get the index directory under /solr/data and not under, say, /solr/data/tutors? Or that /apps/solr6/server/solr/tutors_shard1_replica5 exists at all? And what's in tutors_shard1_replica5 anyway? Just the core.properties file? Erick On Thu, Feb 23, 2017 at 5:41 PM, Walter Underwood wrote: > I did this in the solrconfig.xml for both collections (tutors and questions). > > /solr/data > > I deleted the old collection indexes, reloaded, restarted, and created a new > collection for “tutors". And I see this on the disk. > > [wunder@new-solr-c02.test3]# ls -l /solr/data > total 36 > drwxr-xr-x 2 bin bin 20480 Feb 23 17:40 index > drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 snapshot_metadata > drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 suggest_subject_names_fuzzy > drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 suggest_subject_names_infix > drwxr-xr-x 2 bin bin 4096 Feb 23 17:40 tlog > [wunder@new-solr-c02.test3]# ls -l /apps/solr6/server/solr > total 12 > drwxr-xr-x 5 bin bin 93 Jul 14 2016 configsets > -rw-r--r-- 1 bin bin 3037 Jul 14 2016 README.txt > -rw-r--r-- 1 bin bin 2117 Aug 31 20:13 solr.xml > drwxr-xr-x 2 bin bin 28 Feb 23 15:57 tutors_shard1_replica5 > -rw-r--r-- 1 bin bin 501 Jul 14 2016 zoo.cfg > [wunder@new-solr-c02.test3]# > > Seems pretty broken to me. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) >
Re: Setting Solr data dir isn't really working (6.3.0)
The bug is that the dataDir is /solr/data and the index data is in /apps/solr6/server/solr. Except for the suggest data. No index data should be outside the dataDir, right? wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 23, 2017, at 6:11 PM, Erick Erickson wrote: > > Not quite sure what you're complaint is. Is it that > you've get the index directory under /solr/data and > not under, say, /solr/data/tutors? Or that > /apps/solr6/server/solr/tutors_shard1_replica5 exists at all? > > And what's in tutors_shard1_replica5 anyway? Just the > core.properties file? > > Erick > > On Thu, Feb 23, 2017 at 5:41 PM, Walter Underwood > wrote: >> I did this in the solrconfig.xml for both collections (tutors and questions). >> >> /solr/data >> >> I deleted the old collection indexes, reloaded, restarted, and created a new >> collection for “tutors". And I see this on the disk. >> >> [wunder@new-solr-c02.test3]# ls -l /solr/data >> total 36 >> drwxr-xr-x 2 bin bin 20480 Feb 23 17:40 index >> drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 snapshot_metadata >> drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 suggest_subject_names_fuzzy >> drwxr-xr-x 2 bin bin 4096 Feb 23 15:57 suggest_subject_names_infix >> drwxr-xr-x 2 bin bin 4096 Feb 23 17:40 tlog >> [wunder@new-solr-c02.test3]# ls -l /apps/solr6/server/solr >> total 12 >> drwxr-xr-x 5 bin bin 93 Jul 14 2016 configsets >> -rw-r--r-- 1 bin bin 3037 Jul 14 2016 README.txt >> -rw-r--r-- 1 bin bin 2117 Aug 31 20:13 solr.xml >> drwxr-xr-x 2 bin bin 28 Feb 23 15:57 tutors_shard1_replica5 >> -rw-r--r-- 1 bin bin 501 Jul 14 2016 zoo.cfg >> [wunder@new-solr-c02.test3]# >> >> Seems pretty broken to me. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >>
Index Segments not Merging
We have solr with the index stored in HDFS. We are running MapReduce jobs to build the index using the MapReduceIndexerTool from Cloudera with the go-live option to merge into our live index. We are seeing an issue where the number of segments in the index never reduces. It continues to grow until we manually do an optimize. We are using the following solr config for merge policy *101016* If we add documents into solr without using MapReduce the segments merge properly as expected. Any ideas on why we see this behavior? Does the solr index merge prevent the segments from merging? Thanks, Jordan
Re: Arabic words search in solr
Hi Mohan, I indexed your 9 examples as simple documents after mapping dynamic field “*_ar” to the “text_ar” field type: - [{"id":"1", "name_ar":"المؤسسة التجارية العمانية"}, {"id":"2", "name_ar":"شركة التأمين الأهلية ش.م.ع.م"}, {"id":"3", "name_ar":"شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء"}, {"id":"4", "name_ar":"شركة ظفار للتأمين ش.م.ع.ع"}, {"id":"5", "name_ar":"طوارئ المستشفيات - طوارئ مستشفى صحار"}, {"id":"6", "name_ar":"شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي"}, {"id":"7", "name_ar":"المؤسسة التجارية العمانية"}, {"id":"8", "name_ar":"وزارة الصحة - المديرية العامة للخدمات الصحية محافظة الداخلية - - مستشفى إزكي (البدالة) - الطوارئ"}, {"id":"9", "name_ar":"أسعار المكالمات الدولية - مونتسرات - - مونتسرات”}] - Then when I search from the Admin UI for “name_ar:شرطة ازكي” (the query in one of your screenshots with numFound=0) I get the following results: - { "responseHeader": { "status": 0, "QTime": 1, "params": { "indent": "true", "q": "name_ar:شرطة ازكي", "_": "1487912340325", "wt": "json" } }, "response": { "numFound": 2, "start": 0, "docs": [ { "id": "6", "name_ar": [ "شرطة عمان السلطانية - قيادة شرطة محافظة الداخلية - - مركز شرطة إزكي" ], "_version_": 1560170434794619000 }, { "id": "3", "name_ar": [ "شرطة عمان السلطانية - قيادة شرطة محافظة شمال الشرقية - - مركز شرطة إبراء" ], "_version_": 1560170434793570300 } ] } } - So I cannot reproduce the failures you’re seeing. In fact, I tried all 9 of the queries you listed as not working, and all of them matched at least one of the above 9 documents, except for case 5 (which I give details for below). Are you absolutely sure that you reindexed your data with the ICUFF last? The one query that didn’t return any matches for me is “name_ar:طوارى صحار”. Here’s why: Indexed original: طوارئ صحار Indexed analyzed: طواري صحار Query original: طوارى صحار Query analyzed: طوار صحار In the analyzed indexed form, the “ئ” (yeh with hamza above) is left intact by ArabicNormalizationFilter and ArabicStemFilter, and then the ICUFoldingFilter converts it to “ي” (yeh without the hamza). In the analyzed query, ArabicNormalizationFilter converts “طوارى” to “طواري” (alef maksura->yeh), which ArabicStemFilter converts to “طوار” by removing the trailing yeh. I don’t know what the correct thing to do is to make alef maksura and yeh match each other, but one possibility is adding a char filter that converts all alefs maksura into yehs with hamza, like this: