Re: Search by similarity?

Josh Lincoln Tue, 29 Aug 2017 07:26:04 -0700

Darko,
Can you use edismax instead?

When using dismax, solr is parsing the title field as if it's a query term.
E.g. the query seems to be interpreted as
title "title-123123123-end"
(note the lack of a colon)...which results in querying all your qf fields
for both "title" and "title-123123123-end"
I haven't used dismax in a very long time, so I don't know if this is
intentional, but it's not what I expected.


I'm able to reproduce the issue in 6.4.2 using the default techproducts
Notice that in the below the parsedquery expands to both text:title and
text:name (df=text)
http://localhost:8983/solr/techproducts/select?indent=on&q=title
:"name"&wt=json&debug=true&defType=dismax
rawquerystring: "title:"name"",
querystring: "title:"name"",
parsedquery: "(+(DisjunctionMaxQuery(((text:title)^1.0))
DisjunctionMaxQuery(((text:name)^1.0))) ())/no_coord",
parsedquery_toString: "+(((text:title)^1.0) ((text:name)^1.0)) ()"

But it's not an issue if you use edismax
http://localhost:8983/solr/techproducts/select?indent=on&q=title
:"name"&wt=json&debug=true&defType=edismax
rawquerystring: "title:"name"",
querystring: "title:"name"",
parsedquery: "(+title:name)/no_coord",
parsedquery_toString: "+title:name",



On Tue, Aug 29, 2017 at 8:44 AM Darko Todoric <[email protected]> wrote:

> Hi Erick,
>
> "debug":{ "rawquerystring":"title:\"title-123123123-end\"",
> "querystring":"title:\"title-123123123-end\"",
> "parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 |
> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
> (authors:title)^4.0 | (doi:title:)^1.0))
> DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 |
> (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123
> end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
> (authors:\"title 123123123 end\"~1)^4.0 |
> (doi:title-123123123-end)^1.0)))~1 ())/no_coord",
> "parsedquery_toString":"+((((author_full:title)^7.0 |
> (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 |
> (authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123
> end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl
> 123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 |
> (authors:\"title 123123123 end\"~1)^4.0 |
> (doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969
> = sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 =
> weight(abstract:titl in 23194) [], result of:\n 16.848969 =
> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n
> 5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 186.49593 = avgFieldLength\n 28.444445 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 =
> score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max
> of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n
> 16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 =
> score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "28227":"\n15.670726 = sum of:\n 15.670726 = sum of:\n 15.670726 = max
> of:\n 15.670726 = weight(abstract:titl in 28156) [], result of:\n
> 15.670726 = score(doc=28156,freq=2.0 = termFreq=2.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.4236413 =
> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 28156) [], result of:\n 3.816711E-5 =
> score(doc=28156,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20375":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
> of:\n 15.052014 = weight(abstract:titl in 20369) [], result of:\n
> 15.052014 = score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 20369) [], result of:\n 3.816711E-5 =
> score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "20381":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max
> of:\n 15.052014 = weight(abstract:titl in 20375) [], result of:\n
> 15.052014 = score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 20375) [], result of:\n 3.816711E-5 =
> score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "29030":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max
> of:\n 13.699375 = weight(abstract:titl in 28959) [], result of:\n
> 13.699375 = score(doc=28959,freq=2.0 = termFreq=2.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 =
> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 28959) [], result of:\n 3.816711E-5 =
> score(doc=28959,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "31444":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max
> of:\n 13.699375 = weight(abstract:titl in 31373) [], result of:\n
> 13.699375 = score(doc=31373,freq=2.0 = termFreq=2.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 =
> tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 31373) [], result of:\n 3.816711E-5 =
> score(doc=31373,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "30621":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max
> of:\n 13.096554 = weight(abstract:titl in 30550) [], result of:\n
> 13.096554 = score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 30550) [], result of:\n 3.816711E-5 =
> score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "32067":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max
> of:\n 13.096554 = weight(abstract:titl in 31996) [], result of:\n
> 13.096554 = score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n
> 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 31996) [], result of:\n 3.816711E-5 =
> score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n",
> "1935":"\n11.583146 = sum of:\n 11.583146 = sum of:\n 11.583146 = max
> of:\n 11.583146 = weight(abstract:titl in 1934) [], result of:\n
> 11.583146 = score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 2.0
> = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.0522962 =
> tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75
> = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n
> 3.816711E-5 = weight(title:titl in 1934) [], result of:\n 3.816711E-5 =
> score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n
> 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm,
> computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 =
> parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n"},
> "QParser":"DisMaxQParser", "altquerystring":null, "boostfuncs":null,
>
> Kind regards,
> Darko Todoric
>
> On 08/28/2017 06:35 PM, Erick Erickson wrote:
> > What are the results of adding &debug=query to the URL? The parsed
> > query will be especially illuminating.
> >
> > Best,
> > Erick
> >
> > On Mon, Aug 28, 2017 at 4:37 AM, Emir Arnautovic
> > <[email protected]> wrote:
> >> Hi Darko,
> >>
> >> The issue is the wrong expectations: title-1-end is parsed to 3 tokens
> >> (guessing) and mm=99% of 3 tokens is 2.99 and it is rounded down to 2.
> Since
> >> all your documents have 'title' and 'end' tokens, all match. If you
> want to
> >> round up, you can use mm=-1% - that will result in zero (or one match
> if you
> >> do not filter out original document).
> >>
> >> You have to play with your tokenizers and define what is similarity
> match
> >> percentage (if you want to stick with mm).
> >>
> >> Regards,
> >> Emir
> >>
> >>
> >>
> >> On 28.08.2017 09:17, Darko Todoric wrote:
> >>> Hm... I cannot make that this DisMax work on my Solr...
> >>>
> >>> In solr I have document with title:
> >>>   - "title-1-end"
> >>>   - "title-2-end"
> >>>   - "title-3-end"
> >>>   - ...
> >>>   - ...
> >>>   - "title-312-end"
> >>>
> >>> and when I make query
> >>> "*
> http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title
> :"title-123123123-end"&wt=json*'
> >>> I get all documents from solr :\
> >>> What I doing wrong?
> >>>
> >>> Also, I don't know if affecting results, but on "title" field I use
> >>> "WhitespaceTokenizerFactory".
> >>>
> >>> Kind regards,
> >>> Darko
> >>>
> >>>
> >>> On 08/25/2017 06:38 PM, Junte Zhang wrote:
> >>>> If you already have the title of the document, then you could run that
> >>>> title as a new query against the whole index and exclude the source
> document
> >>>> from the results as a filter.
> >>>>
> >>>> You could use the DisMax query parser:
> >>>>
> https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser
> >>>>
> >>>> And then set the minimum match ratio of the OR clauses to 90%.
> >>>>
> >>>> /JZ
> >>>>
> >>>> -----Original Message-----
> >>>> From: Darko Todoric [mailto:[email protected]]
> >>>> Sent: Friday, August 25, 2017 5:49 PM
> >>>> To: [email protected]
> >>>> Subject: Search by similarity?
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> I have 90.000.000 documents in Solr and I need to compare "title" of
> this
> >>>> document and get all documents with more than 80% similarity. PHP have
> >>>> "similar_text" but it's not so smart inserting 90m documents in the
> array...
> >>>> Can I do some query in Solr which will give me the more the 80%
> >>>> similarity?
> >>>>
> >>>>
> >>>> Kind regards,
> >>>> Darko Todoric
> >>>>
> >>>> --
> >>>> Darko Todoric
> >>>> Web Engineer, MDPI DOO
> >>>> Veljka Dugosevica 54, 11060 Belgrade, Serbia
> >>>> +381 65 43 90 620
> >>>> www.mdpi.com
> >>>>
> >>>> Disclaimer: The information and files contained in this message are
> >>>> confidential and intended solely for the use of the individual or
> entity to
> >>>> whom they are addressed.
> >>>> f you have received this message in error, please notify me and delete
> >>>> this message from your system.
> >>>> You may not copy this message in its entirety or in part, or disclose
> its
> >>>> contents to anyone.
> >>>>
> >> --
> >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> >> Solr & Elasticsearch Support * http://sematext.com/
> >>
>
> --
> Darko Todoric
> Web Engineer, MDPI DOO
> Veljka Dugosevica 54, 11060 Belgrade, Serbia
> +381 65 43 90 620
> www.mdpi.com
>
> Disclaimer: The information and files contained in this message are
> confidential
> and intended solely for the use of the individual or entity to whom they
> are addressed.
> f you have received this message in error, please notify me and delete
> this message from your system.
> You may not copy this message in its entirety or in part, or disclose its
> contents to anyone.
>
>

Re: Search by similarity?

Reply via email to