Re: Faceting Word Count

Emir Arnautović Mon, 06 Nov 2017 04:16:04 -0800

Hi Wael,
You are faceting on analyzed field. This results in field being uninverted - 
fieldValueCache being built - on first call after every commit. This is both 
time and memory consuming (you can check in admin console in stats how much 
memory it took). 
What you need to do is to create multivalue string field (not text) and parse 
values (do analysis steps) on client side and store it like that. This will 
allow you to enable docValues on that field and avoid building fieldValueCache.


HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 6 Nov 2017, at 13:06, Wael Kader <[email protected]> wrote:
> 
> Hi,
> 
> I am using a custom field. Below is the field definition.
> I am using this because I don't want stemming.
> 
> 
>    <fieldType name="text_no_stem2" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>        <filter class="solr.WordDelimiterFilterFactory"
>                protected="protwords.txt"
>                generateWordParts="0"
>                generateNumberParts="1"
>                catenateWords="1"
>                catenateNumbers="1"
>                catenateAll="0"
>                splitOnCaseChange="1"
>                preserveOriginal="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
> <!--ORIGINAL                generateNumberParts="1"-->
>        <filter class="solr.WordDelimiterFilterFactory"
>                protected="protwords.txt"
>                generateWordParts="0"
>                catenateWords="0"
>                catenateNumbers="0"
>                catenateAll="0"
>                splitOnCaseChange="1"
>                preserveOriginal="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory"
> language="English" protected="protwords.txt"/-->
>        <!-- Webel: switch off Porter-stemmer algorithm to enforce whole
> word match -->
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
> 
> 
> Regards,
> Wael
> 
> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
> [email protected]> wrote:
> 
>> Hi Wael,
>> Can you provide your field definition and sample query.
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 6 Nov 2017, at 08:30, Wael Kader <[email protected]> wrote:
>>> 
>>> Hello,
>>> 
>>> I am having an index with around 100 Million documents.
>>> I have a multivalued column that I am saving big chunks of text data in.
>> It
>>> has around 20 GB of RAM and 4 CPU's.
>>> 
>>> I was doing faceting on it to get word cloud but it was taking around 1
>>> second to retrieve when the data was 5-10 Million .
>>> Now I have more data and its taking minutes to get the results (that is
>> if
>>> it gets it and SOLR doesn't crash). Whats the best way to make it run or
>>> maybe its not scalable to make it run on my current schema and design
>> with
>>> News articles.
>>> 
>>> I am looking to find the best solution for this. Maybe create another
>> index
>>> to split the data while inserting it or maybe if I change some settings
>> in
>>> SolrConfig or add some RAM, it would perform better.
>>> 
>>> --
>>> Regards,
>>> Wael
>> 
>> 
> 
> 
> -- 
> Regards,
> Wael

Re: Faceting Word Count

Reply via email to