Re: Faceting Word Count

Wael Kader Tue, 07 Nov 2017 00:27:07 -0800

Hi,

The whole index has 100M but when I add the criteria, it will filter the
data to maybe 10k as a max number of rows.
The facet isn't working when the total number of records in the index is
100M but it was working at 5M.


I have social media & RSS data in the index and I am trying to get the word
count for a specific user on specific date intervals.

Regards,
Wael

On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

> _Why_ do you want to get the word counts? Faceting on all of the
> tokens for 100M docs isn't something Solr is ordinarily used for. As
> Emir says it'll take a huge amount of memory. You can use one of the
> function queries (termfreq IIRC) that will give you the count of any
> individual term you have and will be very fast.
>
> But getting all of the word counts in the index is probably not
> something I'd use Solr for.
>
> This may be an XY problem, you're asking how to do something specific
> (X) without explaining what the problem you're trying to solve is (Y).
> Perhaps there's another way to accomplish (Y) if we knew more about
> what it is.
>
> Best,
> Erick
>
>
>
> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
> <emir.arnauto...@sematext.com> wrote:
> > Hi Wael,
> > You are faceting on analyzed field. This results in field being
> uninverted - fieldValueCache being built - on first call after every
> commit. This is both time and memory consuming (you can check in admin
> console in stats how much memory it took).
> > What you need to do is to create multivalue string field (not text) and
> parse values (do analysis steps) on client side and store it like that.
> This will allow you to enable docValues on that field and avoid building
> fieldValueCache.
> >
> > HTH,
> > Emir
> > --
> > Monitoring - Log Management - Alerting - Anomaly Detection
> > Solr & Elasticsearch Consulting Support Training - http://sematext.com/
> >
> >
> >
> >> On 6 Nov 2017, at 13:06, Wael Kader <w...@softech-lb.com> wrote:
> >>
> >> Hi,
> >>
> >> I am using a custom field. Below is the field definition.
> >> I am using this because I don't want stemming.
> >>
> >>
> >>    <fieldType name="text_no_stem2" class="solr.TextField"
> >> positionIncrementGap="100">
> >>      <analyzer type="index">
> >>        <charFilter class="solr.MappingCharFilterFactory"
> >> mapping="mapping-ISOLatin1Accent.txt"/>
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="stopwords.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >>        <filter class="solr.WordDelimiterFilterFactory"
> >>                protected="protwords.txt"
> >>                generateWordParts="0"
> >>                generateNumberParts="1"
> >>                catenateWords="1"
> >>                catenateNumbers="1"
> >>                catenateAll="0"
> >>                splitOnCaseChange="1"
> >>                preserveOriginal="1"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>      <analyzer type="query">
> >>        <charFilter class="solr.MappingCharFilterFactory"
> >> mapping="mapping-ISOLatin1Accent.txt"/>
> >>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>        <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> >> ignoreCase="true" expand="true"/>
> >>        <filter class="solr.StopFilterFactory"
> >>                ignoreCase="true"
> >>                words="stopwords.txt"
> >>                enablePositionIncrements="true"
> >>                />
> >> <!--ORIGINAL                generateNumberParts="1"-->
> >>        <filter class="solr.WordDelimiterFilterFactory"
> >>                protected="protwords.txt"
> >>                generateWordParts="0"
> >>                catenateWords="0"
> >>                catenateNumbers="0"
> >>                catenateAll="0"
> >>                splitOnCaseChange="1"
> >>                preserveOriginal="1"/>
> >>        <filter class="solr.LowerCaseFilterFactory"/>
> >>        <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory"
> >> language="English" protected="protwords.txt"/-->
> >>        <!-- Webel: switch off Porter-stemmer algorithm to enforce whole
> >> word match -->
> >>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>      </analyzer>
> >>    </fieldType>
> >>
> >>
> >> Regards,
> >> Wael
> >>
> >> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
> >> emir.arnauto...@sematext.com> wrote:
> >>
> >>> Hi Wael,
> >>> Can you provide your field definition and sample query.
> >>>
> >>> Thanks,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
> >>>> On 6 Nov 2017, at 08:30, Wael Kader <w...@softech-lb.com> wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> I am having an index with around 100 Million documents.
> >>>> I have a multivalued column that I am saving big chunks of text data
> in.
> >>> It
> >>>> has around 20 GB of RAM and 4 CPU's.
> >>>>
> >>>> I was doing faceting on it to get word cloud but it was taking around
> 1
> >>>> second to retrieve when the data was 5-10 Million .
> >>>> Now I have more data and its taking minutes to get the results (that
> is
> >>> if
> >>>> it gets it and SOLR doesn't crash). Whats the best way to make it run
> or
> >>>> maybe its not scalable to make it run on my current schema and design
> >>> with
> >>>> News articles.
> >>>>
> >>>> I am looking to find the best solution for this. Maybe create another
> >>> index
> >>>> to split the data while inserting it or maybe if I change some
> settings
> >>> in
> >>>> SolrConfig or add some RAM, it would perform better.
> >>>>
> >>>> --
> >>>> Regards,
> >>>> Wael
> >>>
> >>>
> >>
> >>
> >> --
> >> Regards,
> >> Wael
> >
>



-- 
Regards,
Wael

Re: Faceting Word Count

Reply via email to