Hi Ahmet, Please let me know if I am not clear
Thanks Rajesh CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -----Original Message----- From: G, Rajesh [mailto:r...@cebglobal.com] Sent: Friday, May 6, 2016 1:08 PM To: Ahmet Arslan <iori...@yahoo.com>; solr-user@lucene.apache.org Subject: RE: Facet ignoring repeated word Hi Ahmet, Sorry it is Word Cloud https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_webhp-3Fsourceid-3Dchrome-2Dinstant-26ion-3D1-26espv-3D2-26ie-3DUTF-2D8-23newwindow-3D1-26q-3Dword-2Bcloud&d=CwIGaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=k-w03YA11ltRmGgXa55Yx2gs1Jk1QowoFIE32lm9QMU&s=X_BPC_BR1vgdcijmmd50zYBOnIP97BfPfS2H7MxC9V4&e= We have comments from survey. We want to build word cloud using the filed comments e.g For question 1 the comments are Comment 1.Projects, technology, features, performance Comment 2.Too many projects and technology, not enough people to run projects I want to run a query for question 1 that will produce the below result projects: 3 technology:2 features:1 performance:1 Too:1 Many:1 Enough:1 People:1 Run:1 .... Facet produces the result but ignores repeated words in a document[projects count will be 2 instead of 3]. projects: 2 technology:2 features:1 performance:1 Too:1 Many:1 Enough:1 People:1 Run:1 TeamVectorComponent produces the result as expected but they are not grouped by words, instead they are grouped by id. <lst name="1"> <str name="uniqueKey">1</str> <lst name="comments"> <lst name="projects"> <int name="tf">1</int> </lst> </lst> </lst> <lst name="2"> <str name="uniqueKey">2</str> <lst name="comments"> <lst name="projects"> <int name="tf">2</int> </lst> </lst> </lst> I wanted to know if it is possible to produce a result that is grouped by word and also does not ignore repeated words in a document. If it is not possible then I have to write some script that will take the above result from solr group words and sum the count Thanks Rajesh CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -----Original Message----- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Friday, May 6, 2016 12:39 PM To: G, Rajesh <r...@cebglobal.com>; solr-user@lucene.apache.org Subject: Re: Facet ignoring repeated word Hi Rajesh, Can you please explain what do you mean by "tag cloud"? How it is related to a query? Please explain your requirements. Ahmet On Friday, May 6, 2016 8:44 AM, "G," <r...@cebglobal.com> wrote: Hi, Can you please help? If there is a solution then It will be easy, else I have to create a script in python that can process the results from TermVectorComponent and group the result by words in different documents to find the word count. The Python script will accept the exported Solr result as input Thanks Rajesh CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -----Original Message----- From: G, Rajesh [mailto:r...@cebglobal.com] Sent: Thursday, May 5, 2016 4:29 PM To: Ahmet Arslan <iori...@yahoo.com>; solr-user@lucene.apache.org; erickerick...@gmail.com Subject: RE: Facet ignoring repeated word Hi, TermVectorComponent works. I am able to find the repeating words within the same document...that facet was not able to. The problem I see is TermVectorComponent produces result by a document e.g. and I have to combine the counts i.e count of word my is=6 in the list of documents. Can you please suggest a solution to group count by word across documents?. Basically we want to build word cloud from Solr result <lst name="1675"> <str name="uniqueKey">1675</str> <lst name="comments"> <lst name="my"> <int name="tf">4</int> </lst> </lst> </lst> <lst name="1781"> <str name="uniqueKey">1675</str> <lst name="comments"> <lst name="my"> <int name="tf">2</int> </lst> </lst> </lst> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_tvrh-3Fq-3D-2A-3A-2A-26tv-3Dtrue-26tv.fl-3Dcomments-26tv.tf-3Dtrue-26fl-3Dcomments-26rows-3D1000&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=W1Ti2_egOYFBVpBB11wxKQZqf8RGf5FkM22HrMI6eiY&e= Hi Erick, I need the count of repeated words to build word cloud Thanks Rajesh CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including SHL. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -----Original Message----- From: Ahmet Arslan [mailto:iori...@yahoo.com] Sent: Tuesday, May 3, 2016 6:19 AM To: solr-user@lucene.apache.org; G, Rajesh <r...@cebglobal.com> Subject: Re: Facet ignoring repeated word Hi, StatsComponent does not respect the query parameter. However you can feed a function query (e.g., termfreq) to it. Instead consider using TermVectors or MLT's interesting terms. https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerm-2BVector-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=96tOS2bK5hyC4pncDqAVvO4eUQ3uDFk_WE9xuOFqWck&e= https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_MoreLikeThis&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Agd0JeOWCUWrCU2PxyFWTbwVxAP7mzVVVd7-105NJtM&e= Ahmet On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <r...@cebglobal.com> wrote: Hi Erick/ Ahmet, Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count of comments for a question id not all. When I include the query q=questionid=123 I still see count of all https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_terms-3Fterms.fl-3Dcomments-26terms-3Dtrue-26terms.limit-3D1000-26q-3Dquestionid-3D123&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ya0KmfIVVtTMgcIYpXe0pN_VwdEwXqJkF9iDhF2xOOU&e= StatsComponent is not supporting text fields Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100, class=solr.TextField}} is not currently supported <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> Thanks Rajesh CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India. This e-mail and/or its attachments are intended only for the use of the addressee(s) and may contain confidential and legally privileged information belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services. If you have received this e-mail in error, please notify the sender and immediately, destroy all copies of this email and its attachments. The publication, copying, in whole or in part, or use or dissemination in any other way of this e-mail and attachments by anyone other than the intended person(s) is prohibited. -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Friday, April 29, 2016 9:16 PM To: solr-user <solr-user@lucene.apache.org>; Ahmet Arslan <iori...@yahoo.com> Subject: Re: Facet ignoring repeated word That's the way faceting is designed to work. It counts the _documents_ that a term appears in that satisfy your query, if a word appears multiple times in a doc, it'll only count it once. For the general use-case it'd be unsettling for a user to see a facet count of 500, then click on it and discover that the number of docs in the corpus was really 345 or something. Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies the use case. Best, Erick On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <iori...@yahoo.com.invalid> wrote: > Hi, > > Depending on your requirements; StatsComponent, TermsComponent, > LukeRequestHandler can also be used. > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerms-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=wumoMAx5ahS9S8tDmQAAOqTZCPa3t_VpgDtj7awpUfI&e= > https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_LukeRequestHandler&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ca7XObSJb3GieteQwRbLQSmBThqpW3eovVMEkK4NnU4&e= > https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BStats-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=NgH0cqmhy8GcSfG4VDoxd5Y9tCAsoZEmwqE8_4UKISo&e= > Ahmet > > > > On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <r...@cebglobal.com> wrote: > Hi, > > I am trying to implement word > cloud<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_imgres-3Fimgurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fsites-252Fdefault-252Ffiles-252Fother-252Fsotu-5Fwordle.png-26imgrefurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fblog-252F2011-252F01-252F26-252Fstate-2Dunion-2Dword-2Dcloud-2Djobs-2Damerica-2Dpeople-2Dnew-26docid-3DeZ-5FHvQpd9FRBKM-26tbnid-3DqyIc-2Delv6z-2D0iM-253A-26w-3D895-26h-3D406-26bih-3D643-26biw-3D1366-26ved-3D0ahUKEwie-5F8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA-26iact-3Dmrc-26uact-3D8&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Cjao8wJV-9kqmiNXxqmEkdzC746qLdQdiCbjlRAjaA0&e= > > using Solr. The problem I have is Solr facet query ignores repeated > words in a document eg. > > I have indexed the text : > It seems that the harder I work, the more work I get for the same > compensation and reward. The more work I take on gets absorbed into my > "normal" workload and I'm not recognized for working harder than my peers, > which makes me not want to work to my potential. I am very underwhelmed by > the evaluation process and bonus structure. I don't believe the current > structure rewards strong performers. I am confident that the company could > not hire someone with my talent to replace me if I left, but I don't think > the company realizes that. > > The indexed content has word my and the count the is 3 but when I run the > query > https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_select-3Ffacet-3Dtrue-26facet.field-3Dcomments-26rows-3D0-26indent-3Don-26q-3Dquestionid-3A3956-26wt-3Djson&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=eAPRQ47qzgCQed7F0hYces46xDxPvqeBxQG4JCM7RpE&e= > the count of word my is 1 and not 3. Can you please help? > > Also please suggest If there is a better way to implement word cloud in Solr > other than using facet? > > "facet_fields":{ > "comments":[ > "absorbed",1, > "am",1, > "believe",1, > "bonus",1, > "company",1, > "compensation",1, > "confident",1, > "could",1, > "current",1, > "don't",1, > "evaluation",1, > "get",1, > "gets",1, > "harder",1, > "hire",1, > "i",1, > "i'm",1, > "left",1, > "makes",1, > "me",1, > "more",1, > "my",1, > "normal",1, > "peers",1, > "performers",1, > "potential",1, > "process",1, > "realizes",1, > "recognized",1, > "replace",1, > "reward",1, > "rewards",1, > "same",1, > "seems",1, > "someone",1, > "strong",1, > "structure",1, > "take",1, > "talent",1, > "than",1, > "think",1, > "underwhelmed",1, > "very",1, > "want",1, > "which",1, > "work",1, > "working",1, > "workload",1] > } > > > > > CEB India Private Limited. Registration No: U741040HR2004PTC035324. > Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, > Gurgaon, Haryana-122002, India.. > > > > This e-mail and/or its attachments are intended only for the use of the > addressee(s) and may contain confidential and legally privileged information > belonging to CEB and/or its subsidiaries, including CEB subsidiaries that > offer SHL Talent Measurement products and services. If you have received this > e-mail in error, please notify the sender and immediately, destroy all copies > of this email and its attachments. The publication, copying, in whole or in > part, or use or dissemination in any other way of this e-mail and attachments > by anyone other than the intended person(s) is prohibited.