RE: Facet ignoring repeated word

G, Rajesh Mon, 09 May 2016 01:56:32 -0700

Hi Ahmet,

Please let me know if I am not clear

Thanks
Rajesh

CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including SHL. If you have received 
this e-mail in error, please notify the sender and immediately, destroy all 
copies of this email and its attachments. The publication, copying, in whole or 
in part, or use or dissemination in any other way of this e-mail and 
attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh [mailto:r...@cebglobal.com]
Sent: Friday, May 6, 2016 1:08 PM
To: Ahmet Arslan <iori...@yahoo.com>; solr-user@lucene.apache.org
Subject: RE: Facet ignoring repeated word

Hi Ahmet,

Sorry it is Word Cloud  
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_webhp-3Fsourceid-3Dchrome-2Dinstant-26ion-3D1-26espv-3D2-26ie-3DUTF-2D8-23newwindow-3D1-26q-3Dword-2Bcloud&d=CwIGaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=k-w03YA11ltRmGgXa55Yx2gs1Jk1QowoFIE32lm9QMU&s=X_BPC_BR1vgdcijmmd50zYBOnIP97BfPfS2H7MxC9V4&e=

We have comments from survey. We want to build word cloud using the filed 
comments

e.g For question 1 the comments are

    Comment 1.Projects, technology, features, performance

    Comment 2.Too many projects and technology, not enough people to run 
projects

I want to run a query for question 1 that will produce the below result

projects: 3

technology:2

features:1

performance:1

Too:1

Many:1

Enough:1

People:1

Run:1

....

Facet produces the result but ignores repeated words in a document[projects 
count will be 2 instead of 3].

projects: 2

technology:2

features:1

performance:1

Too:1

Many:1

Enough:1

People:1

Run:1

TeamVectorComponent produces the result as expected but they are not grouped by 
words, instead they are grouped by id.

<lst name="1">

<str name="uniqueKey">1</str>

        <lst name="comments">

                <lst name="projects">

                        <int name="tf">1</int>

                </lst>

        </lst>

</lst>

<lst name="2">

<str name="uniqueKey">2</str>

        <lst name="comments">

                <lst name="projects">

                        <int name="tf">2</int>

                </lst>

        </lst>

</lst>

I wanted to know if it is possible to produce a result that is grouped by word 
and also does not ignore repeated words in a document. If it is not possible 
then I have to write some script that will take the above result from solr 
group words and sum the count

Thanks

Rajesh

CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including SHL. If you have received 
this e-mail in error, please notify the sender and immediately, destroy all 
copies of this email and its attachments. The publication, copying, in whole or 
in part, or use or dissemination in any other way of this e-mail and 
attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----

From: Ahmet Arslan [mailto:iori...@yahoo.com]

Sent: Friday, May 6, 2016 12:39 PM

To: G, Rajesh <r...@cebglobal.com>; solr-user@lucene.apache.org

Subject: Re: Facet ignoring repeated word

Hi Rajesh,

Can you please explain what do you mean by "tag cloud"?

How it is related to a query?

Please explain your requirements.

Ahmet

On Friday, May 6, 2016 8:44 AM, "G," <r...@cebglobal.com> wrote:

Hi,

Can you please help? If there is a solution then It will be easy, else I have 
to create a script in python that can process the results from 
TermVectorComponent and group the result by words in different documents to 
find the word count. The Python script will accept the exported Solr result as 
input

Thanks

Rajesh

CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including SHL. If you have received 
this e-mail in error, please notify the sender and immediately, destroy all 
copies of this email and its attachments. The publication, copying, in whole or 
in part, or use or dissemination in any other way of this e-mail and 
attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----

From: G, Rajesh [mailto:r...@cebglobal.com]

Sent: Thursday, May 5, 2016 4:29 PM

To: Ahmet Arslan <iori...@yahoo.com>; solr-user@lucene.apache.org; 
erickerick...@gmail.com

Subject: RE: Facet ignoring repeated word

Hi,

TermVectorComponent works. I am able to find the repeating words within the 
same document...that facet was not able to. The problem I see is 
TermVectorComponent produces result by a document e.g. and I have to combine 
the counts i.e count of word my is=6 in the list of documents. Can you please 
suggest a solution to group count by word across documents?. Basically we want 
to build word cloud from Solr result

<lst name="1675">

<str name="uniqueKey">1675</str>

        <lst name="comments">

                <lst name="my">

                        <int name="tf">4</int>

                </lst>

        </lst>

</lst>

<lst name="1781">

<str name="uniqueKey">1675</str>

        <lst name="comments">

                <lst name="my">

                        <int name="tf">2</int>

                </lst>

        </lst>

</lst>

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_tvrh-3Fq-3D-2A-3A-2A-26tv-3Dtrue-26tv.fl-3Dcomments-26tv.tf-3Dtrue-26fl-3Dcomments-26rows-3D1000&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=W1Ti2_egOYFBVpBB11wxKQZqf8RGf5FkM22HrMI6eiY&e=

Hi Erick,

I need the count of repeated words to build word cloud

Thanks

Rajesh

CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including SHL. If you have received 
this e-mail in error, please notify the sender and immediately, destroy all 
copies of this email and its attachments. The publication, copying, in whole or 
in part, or use or dissemination in any other way of this e-mail and 
attachments by anyone other than the intended person(s) is prohibited.

-----Original Message-----

From: Ahmet Arslan [mailto:iori...@yahoo.com]

Sent: Tuesday, May 3, 2016 6:19 AM

To: solr-user@lucene.apache.org; G, Rajesh <r...@cebglobal.com>

Subject: Re: Facet ignoring repeated word

Hi,

StatsComponent does not respect the query parameter. However you can feed a 
function query (e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.

https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerm-2BVector-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=96tOS2bK5hyC4pncDqAVvO4eUQ3uDFk_WE9xuOFqWck&e=

https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_MoreLikeThis&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Agd0JeOWCUWrCU2PxyFWTbwVxAP7mzVVVd7-105NJtM&e=

Ahmet

On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <r...@cebglobal.com> wrote:

Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need 
the word count of comments for a question id not all. When I include the query 
q=questionid=123 I still see count of all

https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_terms-3Fterms.fl-3Dcomments-26terms-3Dtrue-26terms.limit-3D1000-26q-3Dquestionid-3D123&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ya0KmfIVVtTMgcIYpXe0pN_VwdEwXqJkF9iDhF2xOOU&e=

StatsComponent is not supporting text fields

Field type 
textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100,
 class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" 
positionIncrementGap="100">

    <analyzer type="index">

      <tokenizer class="solr.StandardTokenizerFactory"/>

      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" 
ignoreCase="true"/>

          <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>

    <analyzer type="query">

      <tokenizer class="solr.StandardTokenizerFactory"/>

          <filter class="solr.LowerCaseFilterFactory"/>

    </analyzer>

  </fieldType>

Thanks

Rajesh

CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered 
office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, 
Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the 
addressee(s) and may contain confidential and legally privileged information 
belonging to CEB and/or its subsidiaries, including CEB subsidiaries that offer 
SHL Talent Measurement products and services. If you have received this e-mail 
in error, please notify the sender and immediately, destroy all copies of this 
email and its attachments. The publication, copying, in whole or in part, or 
use or dissemination in any other way of this e-mail and attachments by anyone 
other than the intended person(s) is prohibited.

-----Original Message-----

From: Erick Erickson [mailto:erickerick...@gmail.com]

Sent: Friday, April 29, 2016 9:16 PM

To: solr-user <solr-user@lucene.apache.org>; Ahmet Arslan <iori...@yahoo.com>

Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a 
term appears in that satisfy your query, if a word appears multiple times in a 
doc, it'll only count it once.

For the general use-case it'd be unsettling for a user to see a facet count of 
500, then click on it and discover that the number of docs in the corpus was 
really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times 
really satisfies the use case.

Best,

Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <iori...@yahoo.com.invalid> wrote:

> Hi,

>

> Depending on your requirements; StatsComponent, TermsComponent, 
> LukeRequestHandler can also be used.

>

>

> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BTerms-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=wumoMAx5ahS9S8tDmQAAOqTZCPa3t_VpgDtj7awpUfI&e=

> https://urldefense.proofpoint.com/v2/url?u=https-3A__wiki.apache.org_solr_LukeRequestHandler&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Ca7XObSJb3GieteQwRbLQSmBThqpW3eovVMEkK4NnU4&e=

> https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_solr_The-2BStats-2BComponent&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=NgH0cqmhy8GcSfG4VDoxd5Y9tCAsoZEmwqE8_4UKISo&e=

> Ahmet

>

>

>

> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <r...@cebglobal.com> wrote:

> Hi,

>

> I am trying to implement word 
> cloud<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.co.uk_imgres-3Fimgurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fsites-252Fdefault-252Ffiles-252Fother-252Fsotu-5Fwordle.png-26imgrefurl-3Dhttps-253A-252F-252Fwww.whitehouse.gov-252Fblog-252F2011-252F01-252F26-252Fstate-2Dunion-2Dword-2Dcloud-2Djobs-2Damerica-2Dpeople-2Dnew-26docid-3DeZ-5FHvQpd9FRBKM-26tbnid-3DqyIc-2Delv6z-2D0iM-253A-26w-3D895-26h-3D406-26bih-3D643-26biw-3D1366-26ved-3D0ahUKEwie-5F8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA-26iact-3Dmrc-26uact-3D8&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=Cjao8wJV-9kqmiNXxqmEkdzC746qLdQdiCbjlRAjaA0&e=
>  >  using Solr.  The problem I have is Solr facet query ignores repeated 
> words in a document eg.

>

> I have indexed the text :

> It seems that the harder I work, the more work I get for the same 
> compensation and reward. The more work I take on gets absorbed into my 
> "normal" workload and I'm not recognized for working harder than my peers, 
> which makes me not want to work to my potential. I am very underwhelmed by 
> the evaluation process and bonus structure. I don't believe the current 
> structure rewards strong performers. I am confident that the company could 
> not hire someone with my talent to replace me if I left, but I don't think 
> the company realizes that.

>

> The indexed content has word my and the count the is 3 but when I run the 
> query 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__localhost-3A8182_solr_dev_select-3Ffacet-3Dtrue-26facet.field-3Dcomments-26rows-3D0-26indent-3Don-26q-3Dquestionid-3A3956-26wt-3Djson&d=CwICaQ&c=zzHkMf6HMoOvCB4yTPe0Gg&r=05YCVYE-IrDXcnbr1V8J9Q&m=lBNd_H5rkg46NYGJF0Kua46oVMy7Dr41Qbbregs1xjQ&s=eAPRQ47qzgCQed7F0hYces46xDxPvqeBxQG4JCM7RpE&e=
>   the count of word my  is 1 and not 3. Can you please help?

>

> Also please suggest If there is a better way to implement word cloud in Solr 
> other than using facet?

>

>     "facet_fields":{

>       "comments":[

>         "absorbed",1,

>         "am",1,

>         "believe",1,

>         "bonus",1,

>         "company",1,

>         "compensation",1,

>         "confident",1,

>         "could",1,

>         "current",1,

>         "don't",1,

>         "evaluation",1,

>         "get",1,

>         "gets",1,

>         "harder",1,

>         "hire",1,

>         "i",1,

>         "i'm",1,

>         "left",1,

>         "makes",1,

>         "me",1,

>         "more",1,

>         "my",1,

>         "normal",1,

>         "peers",1,

>         "performers",1,

>         "potential",1,

>         "process",1,

>         "realizes",1,

>         "recognized",1,

>         "replace",1,

>         "reward",1,

>         "rewards",1,

>         "same",1,

>         "seems",1,

>         "someone",1,

>         "strong",1,

>         "structure",1,

>         "take",1,

>         "talent",1,

>         "than",1,

>         "think",1,

>         "underwhelmed",1,

>         "very",1,

>         "want",1,

>         "which",1,

>         "work",1,

>         "working",1,

>         "workload",1]

>     }

>

>

>

>

> CEB India Private Limited. Registration No: U741040HR2004PTC035324. 
> Registered office: 6th Floor, Tower B, DLF Building No.10 DLF Cyber City, 
> Gurgaon, Haryana-122002, India..

>

>

>

> This e-mail and/or its attachments are intended only for the use of the 
> addressee(s) and may contain confidential and legally privileged information 
> belonging to CEB and/or its subsidiaries, including CEB subsidiaries that 
> offer SHL Talent Measurement products and services. If you have received this 
> e-mail in error, please notify the sender and immediately, destroy all copies 
> of this email and its attachments. The publication, copying, in whole or in 
> part, or use or dissemination in any other way of this e-mail and attachments 
> by anyone other than the intended person(s) is prohibited.

RE: Facet ignoring repeated word

Reply via email to