Darko, Can you use edismax instead? When using dismax, solr is parsing the title field as if it's a query term. E.g. the query seems to be interpreted as title "title-123123123-end" (note the lack of a colon)...which results in querying all your qf fields for both "title" and "title-123123123-end" I haven't used dismax in a very long time, so I don't know if this is intentional, but it's not what I expected.
I'm able to reproduce the issue in 6.4.2 using the default techproducts Notice that in the below the parsedquery expands to both text:title and text:name (df=text) http://localhost:8983/solr/techproducts/select?indent=on&q=title :"name"&wt=json&debug=true&defType=dismax rawquerystring: "title:"name"", querystring: "title:"name"", parsedquery: "(+(DisjunctionMaxQuery(((text:title)^1.0)) DisjunctionMaxQuery(((text:name)^1.0))) ())/no_coord", parsedquery_toString: "+(((text:title)^1.0) ((text:name)^1.0)) ()" But it's not an issue if you use edismax http://localhost:8983/solr/techproducts/select?indent=on&q=title :"name"&wt=json&debug=true&defType=edismax rawquerystring: "title:"name"", querystring: "title:"name"", parsedquery: "(+title:name)/no_coord", parsedquery_toString: "+title:name", On Tue, Aug 29, 2017 at 8:44 AM Darko Todoric <todo...@mdpi.com> wrote: > Hi Erick, > > "debug":{ "rawquerystring":"title:\"title-123123123-end\"", > "querystring":"title:\"title-123123123-end\"", > "parsedquery":"(+(DisjunctionMaxQuery(((author_full:title)^7.0 | > (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 | > (authors:title)^4.0 | (doi:title:)^1.0)) > DisjunctionMaxQuery(((author_full:\"title 123123123 end\"~1)^7.0 | > (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl 123123123 > end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 | > (authors:\"title 123123123 end\"~1)^4.0 | > (doi:title-123123123-end)^1.0)))~1 ())/no_coord", > "parsedquery_toString":"+((((author_full:title)^7.0 | > (abstract:titl)^2.0 | (title:titl)^3.0 | (keywords:titl)^5.0 | > (authors:title)^4.0 | (doi:title:)^1.0) ((author_full:\"title 123123123 > end\"~1)^7.0 | (abstract:\"titl 123123123 end\"~1)^2.0 | (title:\"titl > 123123123 end\"~1)^3.0 | (keywords:\"titl 123123123 end\"~1)^5.0 | > (authors:\"title 123123123 end\"~1)^4.0 | > (doi:title-123123123-end)^1.0))~1) ()", "explain":{ "23251":"\n16.848969 > = sum of:\n 16.848969 = sum of:\n 16.848969 = max of:\n 16.848969 = > weight(abstract:titl in 23194) [], result of:\n 16.848969 = > score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 = boost\n > 5.503748 = idf(docFreq=74, docCount=18297)\n 1.5306814 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 186.49593 = avgFieldLength\n 28.444445 = fieldLength\n > 3.816711E-5 = weight(title:titl in 23194) [], result of:\n 3.816711E-5 = > score(doc=23194,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "20495":"\n16.169483 = sum of:\n 16.169483 = sum of:\n 16.169483 = max > of:\n 16.169483 = weight(abstract:titl in 20489) [], result of:\n > 16.169483 = score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n > 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.468952 = > tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 40.96 = fieldLength\n > 3.816711E-5 = weight(title:titl in 20489) [], result of:\n 3.816711E-5 = > score(doc=20489,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "28227":"\n15.670726 = sum of:\n 15.670726 = sum of:\n 15.670726 = max > of:\n 15.670726 = weight(abstract:titl in 28156) [], result of:\n > 15.670726 = score(doc=28156,freq=2.0 = termFreq=2.0\n), product of:\n > 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.4236413 = > tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n > 3.816711E-5 = weight(title:titl in 28156) [], result of:\n 3.816711E-5 = > score(doc=28156,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "20375":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max > of:\n 15.052014 = weight(abstract:titl in 20369) [], result of:\n > 15.052014 = score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n > 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 = > tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n > 3.816711E-5 = weight(title:titl in 20369) [], result of:\n 3.816711E-5 = > score(doc=20369,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "20381":"\n15.052014 = sum of:\n 15.052014 = sum of:\n 15.052014 = max > of:\n 15.052014 = weight(abstract:titl in 20375) [], result of:\n > 15.052014 = score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n > 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.3674331 = > tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 64.0 = fieldLength\n > 3.816711E-5 = weight(title:titl in 20375) [], result of:\n 3.816711E-5 = > score(doc=20375,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "29030":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max > of:\n 13.699375 = weight(abstract:titl in 28959) [], result of:\n > 13.699375 = score(doc=28959,freq=2.0 = termFreq=2.0\n), product of:\n > 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 = > tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n > 3.816711E-5 = weight(title:titl in 28959) [], result of:\n 3.816711E-5 = > score(doc=28959,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "31444":"\n13.699375 = sum of:\n 13.699375 = sum of:\n 13.699375 = max > of:\n 13.699375 = weight(abstract:titl in 31373) [], result of:\n > 13.699375 = score(doc=31373,freq=2.0 = termFreq=2.0\n), product of:\n > 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.2445496 = > tfNorm, computed from:\n 2.0 = termFreq=2.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 256.0 = fieldLength\n > 3.816711E-5 = weight(title:titl in 31373) [], result of:\n 3.816711E-5 = > score(doc=31373,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "30621":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max > of:\n 13.096554 = weight(abstract:titl in 30550) [], result of:\n > 13.096554 = score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n > 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 = > tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n > 3.816711E-5 = weight(title:titl in 30550) [], result of:\n 3.816711E-5 = > score(doc=30550,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "32067":"\n13.096554 = sum of:\n 13.096554 = sum of:\n 13.096554 = max > of:\n 13.096554 = weight(abstract:titl in 31996) [], result of:\n > 13.096554 = score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n > 2.0 = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.189785 = > tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 113.77778 = fieldLength\n > 3.816711E-5 = weight(title:titl in 31996) [], result of:\n 3.816711E-5 = > score(doc=31996,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n", > "1935":"\n11.583146 = sum of:\n 11.583146 = sum of:\n 11.583146 = max > of:\n 11.583146 = weight(abstract:titl in 1934) [], result of:\n > 11.583146 = score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 2.0 > = boost\n 5.503748 = idf(docFreq=74, docCount=18297)\n 1.0522962 = > tfNorm, computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 > = parameter b\n 186.49593 = avgFieldLength\n 163.84 = fieldLength\n > 3.816711E-5 = weight(title:titl in 1934) [], result of:\n 3.816711E-5 = > score(doc=1934,freq=1.0 = termFreq=1.0\n), product of:\n 3.0 = boost\n > 1.4457239E-5 = idf(docFreq=34584, docCount=34584)\n 0.88 = tfNorm, > computed from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.75 = > parameter b\n 3.0 = avgFieldLength\n 4.0 = fieldLength\n"}, > "QParser":"DisMaxQParser", "altquerystring":null, "boostfuncs":null, > > Kind regards, > Darko Todoric > > On 08/28/2017 06:35 PM, Erick Erickson wrote: > > What are the results of adding &debug=query to the URL? The parsed > > query will be especially illuminating. > > > > Best, > > Erick > > > > On Mon, Aug 28, 2017 at 4:37 AM, Emir Arnautovic > > <emir.arnauto...@sematext.com> wrote: > >> Hi Darko, > >> > >> The issue is the wrong expectations: title-1-end is parsed to 3 tokens > >> (guessing) and mm=99% of 3 tokens is 2.99 and it is rounded down to 2. > Since > >> all your documents have 'title' and 'end' tokens, all match. If you > want to > >> round up, you can use mm=-1% - that will result in zero (or one match > if you > >> do not filter out original document). > >> > >> You have to play with your tokenizers and define what is similarity > match > >> percentage (if you want to stick with mm). > >> > >> Regards, > >> Emir > >> > >> > >> > >> On 28.08.2017 09:17, Darko Todoric wrote: > >>> Hm... I cannot make that this DisMax work on my Solr... > >>> > >>> In solr I have document with title: > >>> - "title-1-end" > >>> - "title-2-end" > >>> - "title-3-end" > >>> - ... > >>> - ... > >>> - "title-312-end" > >>> > >>> and when I make query > >>> "* > http://localhost:8983/solr/SciLit/select?defType=dismax&indent=on&mm=99%&q=title > :"title-123123123-end"&wt=json*' > >>> I get all documents from solr :\ > >>> What I doing wrong? > >>> > >>> Also, I don't know if affecting results, but on "title" field I use > >>> "WhitespaceTokenizerFactory". > >>> > >>> Kind regards, > >>> Darko > >>> > >>> > >>> On 08/25/2017 06:38 PM, Junte Zhang wrote: > >>>> If you already have the title of the document, then you could run that > >>>> title as a new query against the whole index and exclude the source > document > >>>> from the results as a filter. > >>>> > >>>> You could use the DisMax query parser: > >>>> > https://cwiki.apache.org/confluence/display/solr/The+DisMax+Query+Parser > >>>> > >>>> And then set the minimum match ratio of the OR clauses to 90%. > >>>> > >>>> /JZ > >>>> > >>>> -----Original Message----- > >>>> From: Darko Todoric [mailto:todo...@mdpi.com] > >>>> Sent: Friday, August 25, 2017 5:49 PM > >>>> To: solr-user@lucene.apache.org > >>>> Subject: Search by similarity? > >>>> > >>>> Hi, > >>>> > >>>> > >>>> I have 90.000.000 documents in Solr and I need to compare "title" of > this > >>>> document and get all documents with more than 80% similarity. PHP have > >>>> "similar_text" but it's not so smart inserting 90m documents in the > array... > >>>> Can I do some query in Solr which will give me the more the 80% > >>>> similarity? > >>>> > >>>> > >>>> Kind regards, > >>>> Darko Todoric > >>>> > >>>> -- > >>>> Darko Todoric > >>>> Web Engineer, MDPI DOO > >>>> Veljka Dugosevica 54, 11060 Belgrade, Serbia > >>>> +381 65 43 90 620 > >>>> www.mdpi.com > >>>> > >>>> Disclaimer: The information and files contained in this message are > >>>> confidential and intended solely for the use of the individual or > entity to > >>>> whom they are addressed. > >>>> f you have received this message in error, please notify me and delete > >>>> this message from your system. > >>>> You may not copy this message in its entirety or in part, or disclose > its > >>>> contents to anyone. > >>>> > >> -- > >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management > >> Solr & Elasticsearch Support * http://sematext.com/ > >> > > -- > Darko Todoric > Web Engineer, MDPI DOO > Veljka Dugosevica 54, 11060 Belgrade, Serbia > +381 65 43 90 620 > www.mdpi.com > > Disclaimer: The information and files contained in this message are > confidential > and intended solely for the use of the individual or entity to whom they > are addressed. > f you have received this message in error, please notify me and delete > this message from your system. > You may not copy this message in its entirety or in part, or disclose its > contents to anyone. > >