Hi, I want to know the best option for getting word cloud in SOLR. Is it saving the data as multivalued, using vector, JSON faceting(didn't work with me)? Terms doesn't work because I can't provide any criteria.
I don't mind changing the design but I need to know the best feasible way that won't make any problems on the long run. I want to be able to get the word frequency based on a criteria. Facets are taking around 1 minute to return data now. Regards, Wael On Wed, Nov 8, 2017 at 11:06 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Hi Wael, > You can try out JSON faceting - it’s not just about rq/resp format, but it > uses different implementation as well. In any case you will have to index > documents differently in order to be able to use docValues. > > HTH > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 7 Nov 2017, at 09:26, Wael Kader <w...@softech-lb.com> wrote: > > > > Hi, > > > > The whole index has 100M but when I add the criteria, it will filter the > > data to maybe 10k as a max number of rows. > > The facet isn't working when the total number of records in the index is > > 100M but it was working at 5M. > > > > I have social media & RSS data in the index and I am trying to get the > word > > count for a specific user on specific date intervals. > > > > Regards, > > Wael > > > > On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson <erickerick...@gmail.com> > > wrote: > > > >> _Why_ do you want to get the word counts? Faceting on all of the > >> tokens for 100M docs isn't something Solr is ordinarily used for. As > >> Emir says it'll take a huge amount of memory. You can use one of the > >> function queries (termfreq IIRC) that will give you the count of any > >> individual term you have and will be very fast. > >> > >> But getting all of the word counts in the index is probably not > >> something I'd use Solr for. > >> > >> This may be an XY problem, you're asking how to do something specific > >> (X) without explaining what the problem you're trying to solve is (Y). > >> Perhaps there's another way to accomplish (Y) if we knew more about > >> what it is. > >> > >> Best, > >> Erick > >> > >> > >> > >> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović > >> <emir.arnauto...@sematext.com> wrote: > >>> Hi Wael, > >>> You are faceting on analyzed field. This results in field being > >> uninverted - fieldValueCache being built - on first call after every > >> commit. This is both time and memory consuming (you can check in admin > >> console in stats how much memory it took). > >>> What you need to do is to create multivalue string field (not text) and > >> parse values (do analysis steps) on client side and store it like that. > >> This will allow you to enable docValues on that field and avoid building > >> fieldValueCache. > >>> > >>> HTH, > >>> Emir > >>> -- > >>> Monitoring - Log Management - Alerting - Anomaly Detection > >>> Solr & Elasticsearch Consulting Support Training - > http://sematext.com/ > >>> > >>> > >>> > >>>> On 6 Nov 2017, at 13:06, Wael Kader <w...@softech-lb.com> wrote: > >>>> > >>>> Hi, > >>>> > >>>> I am using a custom field. Below is the field definition. > >>>> I am using this because I don't want stemming. > >>>> > >>>> > >>>> <fieldType name="text_no_stem2" class="solr.TextField" > >>>> positionIncrementGap="100"> > >>>> <analyzer type="index"> > >>>> <charFilter class="solr.MappingCharFilterFactory" > >>>> mapping="mapping-ISOLatin1Accent.txt"/> > >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>>> > >>>> <filter class="solr.StopFilterFactory" > >>>> ignoreCase="true" > >>>> words="stopwords.txt" > >>>> enablePositionIncrements="true" > >>>> /> > >>>> <filter class="solr.WordDelimiterFilterFactory" > >>>> protected="protwords.txt" > >>>> generateWordParts="0" > >>>> generateNumberParts="1" > >>>> catenateWords="1" > >>>> catenateNumbers="1" > >>>> catenateAll="0" > >>>> splitOnCaseChange="1" > >>>> preserveOriginal="1"/> > >>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>> > >>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > >>>> </analyzer> > >>>> <analyzer type="query"> > >>>> <charFilter class="solr.MappingCharFilterFactory" > >>>> mapping="mapping-ISOLatin1Accent.txt"/> > >>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/> > >>>> <filter class="solr.SynonymFilterFactory" > >> synonyms="synonyms.txt" > >>>> ignoreCase="true" expand="true"/> > >>>> <filter class="solr.StopFilterFactory" > >>>> ignoreCase="true" > >>>> words="stopwords.txt" > >>>> enablePositionIncrements="true" > >>>> /> > >>>> <!--ORIGINAL generateNumberParts="1"--> > >>>> <filter class="solr.WordDelimiterFilterFactory" > >>>> protected="protwords.txt" > >>>> generateWordParts="0" > >>>> catenateWords="0" > >>>> catenateNumbers="0" > >>>> catenateAll="0" > >>>> splitOnCaseChange="1" > >>>> preserveOriginal="1"/> > >>>> <filter class="solr.LowerCaseFilterFactory"/> > >>>> <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory" > >>>> language="English" protected="protwords.txt"/--> > >>>> <!-- Webel: switch off Porter-stemmer algorithm to enforce whole > >>>> word match --> > >>>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > >>>> </analyzer> > >>>> </fieldType> > >>>> > >>>> > >>>> Regards, > >>>> Wael > >>>> > >>>> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović < > >>>> emir.arnauto...@sematext.com> wrote: > >>>> > >>>>> Hi Wael, > >>>>> Can you provide your field definition and sample query. > >>>>> > >>>>> Thanks, > >>>>> Emir > >>>>> -- > >>>>> Monitoring - Log Management - Alerting - Anomaly Detection > >>>>> Solr & Elasticsearch Consulting Support Training - > >> http://sematext.com/ > >>>>> > >>>>> > >>>>> > >>>>>> On 6 Nov 2017, at 08:30, Wael Kader <w...@softech-lb.com> wrote: > >>>>>> > >>>>>> Hello, > >>>>>> > >>>>>> I am having an index with around 100 Million documents. > >>>>>> I have a multivalued column that I am saving big chunks of text data > >> in. > >>>>> It > >>>>>> has around 20 GB of RAM and 4 CPU's. > >>>>>> > >>>>>> I was doing faceting on it to get word cloud but it was taking > around > >> 1 > >>>>>> second to retrieve when the data was 5-10 Million . > >>>>>> Now I have more data and its taking minutes to get the results (that > >> is > >>>>> if > >>>>>> it gets it and SOLR doesn't crash). Whats the best way to make it > run > >> or > >>>>>> maybe its not scalable to make it run on my current schema and > design > >>>>> with > >>>>>> News articles. > >>>>>> > >>>>>> I am looking to find the best solution for this. Maybe create > another > >>>>> index > >>>>>> to split the data while inserting it or maybe if I change some > >> settings > >>>>> in > >>>>>> SolrConfig or add some RAM, it would perform better. > >>>>>> > >>>>>> -- > >>>>>> Regards, > >>>>>> Wael > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Regards, > >>>> Wael > >>> > >> > > > > > > > > -- > > Regards, > > Wael > > -- Regards, Wael