Re: Faceting Word Count

Emir Arnautović Wed, 08 Nov 2017 01:07:46 -0800

Hi Wael,
You can try out JSON faceting - it’s not just about rq/resp format, but it uses 
different implementation as well. In any case you will have to index documents 
differently in order to be able to use docValues.


HTH
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 7 Nov 2017, at 09:26, Wael Kader <[email protected]> wrote:
> 
> Hi,
> 
> The whole index has 100M but when I add the criteria, it will filter the
> data to maybe 10k as a max number of rows.
> The facet isn't working when the total number of records in the index is
> 100M but it was working at 5M.
> 
> I have social media & RSS data in the index and I am trying to get the word
> count for a specific user on specific date intervals.
> 
> Regards,
> Wael
> 
> On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson <[email protected]>
> wrote:
> 
>> _Why_ do you want to get the word counts? Faceting on all of the
>> tokens for 100M docs isn't something Solr is ordinarily used for. As
>> Emir says it'll take a huge amount of memory. You can use one of the
>> function queries (termfreq IIRC) that will give you the count of any
>> individual term you have and will be very fast.
>> 
>> But getting all of the word counts in the index is probably not
>> something I'd use Solr for.
>> 
>> This may be an XY problem, you're asking how to do something specific
>> (X) without explaining what the problem you're trying to solve is (Y).
>> Perhaps there's another way to accomplish (Y) if we knew more about
>> what it is.
>> 
>> Best,
>> Erick
>> 
>> 
>> 
>> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
>> <[email protected]> wrote:
>>> Hi Wael,
>>> You are faceting on analyzed field. This results in field being
>> uninverted - fieldValueCache being built - on first call after every
>> commit. This is both time and memory consuming (you can check in admin
>> console in stats how much memory it took).
>>> What you need to do is to create multivalue string field (not text) and
>> parse values (do analysis steps) on client side and store it like that.
>> This will allow you to enable docValues on that field and avoid building
>> fieldValueCache.
>>> 
>>> HTH,
>>> Emir
>>> --
>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>>> 
>>> 
>>> 
>>>> On 6 Nov 2017, at 13:06, Wael Kader <[email protected]> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> I am using a custom field. Below is the field definition.
>>>> I am using this because I don't want stemming.
>>>> 
>>>> 
>>>>   <fieldType name="text_no_stem2" class="solr.TextField"
>>>> positionIncrementGap="100">
>>>>     <analyzer type="index">
>>>>       <charFilter class="solr.MappingCharFilterFactory"
>>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>> 
>>>>       <filter class="solr.StopFilterFactory"
>>>>               ignoreCase="true"
>>>>               words="stopwords.txt"
>>>>               enablePositionIncrements="true"
>>>>               />
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>>               protected="protwords.txt"
>>>>               generateWordParts="0"
>>>>               generateNumberParts="1"
>>>>               catenateWords="1"
>>>>               catenateNumbers="1"
>>>>               catenateAll="0"
>>>>               splitOnCaseChange="1"
>>>>               preserveOriginal="1"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>> 
>>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>     </analyzer>
>>>>     <analyzer type="query">
>>>>       <charFilter class="solr.MappingCharFilterFactory"
>>>> mapping="mapping-ISOLatin1Accent.txt"/>
>>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>       <filter class="solr.SynonymFilterFactory"
>> synonyms="synonyms.txt"
>>>> ignoreCase="true" expand="true"/>
>>>>       <filter class="solr.StopFilterFactory"
>>>>               ignoreCase="true"
>>>>               words="stopwords.txt"
>>>>               enablePositionIncrements="true"
>>>>               />
>>>> <!--ORIGINAL                generateNumberParts="1"-->
>>>>       <filter class="solr.WordDelimiterFilterFactory"
>>>>               protected="protwords.txt"
>>>>               generateWordParts="0"
>>>>               catenateWords="0"
>>>>               catenateNumbers="0"
>>>>               catenateAll="0"
>>>>               splitOnCaseChange="1"
>>>>               preserveOriginal="1"/>
>>>>       <filter class="solr.LowerCaseFilterFactory"/>
>>>>       <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory"
>>>> language="English" protected="protwords.txt"/-->
>>>>       <!-- Webel: switch off Porter-stemmer algorithm to enforce whole
>>>> word match -->
>>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>>>     </analyzer>
>>>>   </fieldType>
>>>> 
>>>> 
>>>> Regards,
>>>> Wael
>>>> 
>>>> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
>>>> [email protected]> wrote:
>>>> 
>>>>> Hi Wael,
>>>>> Can you provide your field definition and sample query.
>>>>> 
>>>>> Thanks,
>>>>> Emir
>>>>> --
>>>>> Monitoring - Log Management - Alerting - Anomaly Detection
>>>>> Solr & Elasticsearch Consulting Support Training -
>> http://sematext.com/
>>>>> 
>>>>> 
>>>>> 
>>>>>> On 6 Nov 2017, at 08:30, Wael Kader <[email protected]> wrote:
>>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I am having an index with around 100 Million documents.
>>>>>> I have a multivalued column that I am saving big chunks of text data
>> in.
>>>>> It
>>>>>> has around 20 GB of RAM and 4 CPU's.
>>>>>> 
>>>>>> I was doing faceting on it to get word cloud but it was taking around
>> 1
>>>>>> second to retrieve when the data was 5-10 Million .
>>>>>> Now I have more data and its taking minutes to get the results (that
>> is
>>>>> if
>>>>>> it gets it and SOLR doesn't crash). Whats the best way to make it run
>> or
>>>>>> maybe its not scalable to make it run on my current schema and design
>>>>> with
>>>>>> News articles.
>>>>>> 
>>>>>> I am looking to find the best solution for this. Maybe create another
>>>>> index
>>>>>> to split the data while inserting it or maybe if I change some
>> settings
>>>>> in
>>>>>> SolrConfig or add some RAM, it would perform better.
>>>>>> 
>>>>>> --
>>>>>> Regards,
>>>>>> Wael
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Regards,
>>>> Wael
>>> 
>> 
> 
> 
> 
> -- 
> Regards,
> Wael

Re: Faceting Word Count

Reply via email to