Re: Sorting on pseudo field(The one which is added during doctransformer)

2018-05-18 Thread prateek . agarwal
Hi Mikhail,

I think you forgot to link the reference.

Thanks


Regards,
Prateek

On 2018/05/17 13:18:22, Mikhail Khludnev  wrote: 
> Here is the reference I've found so far.
> 
> On Thu, May 17, 2018 at 12:26 PM, prateek.agar...@bigbasket.com <
> prateek.agar...@bigbasket.com> wrote:
> 
> >
> > Hi Mikhail,
> >
> > > You can either sort by function that needs to turn the logic into value
> > > source parser.
> >
> > But like my requirement for this was to add a field dynamically from cache
> > or external source to the returned documents from the solr and perform
> > sorting in the solr itself if required otherwise use the score to sort.
> > So how would you advise to go about this??
> >
> > And how to go about your way "to turn the logic into value source parser"
> > like how to do this for this case??
> >
> >
> > > If you need to toss just result page, check rerank.
> >
> > I don't want to use it to rank the relevancy of results.
> >
> > Thanks for the response.
> >
> >
> >
> > Regards,
> > Prateek
> >
> 
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev
> 


[ANNOUNCE] Apache Solr 6.6.4 released

2018-05-18 Thread Ishan Chattopadhyaya
18 May 2018, Apache Solr™ 6.6.4 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.6.4

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release includes 1 bug fix since the 6.6.3 release:

* Do not allow to use absolute URIs for including other files in
solrconfig.xml and schema parsing

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/6.6.4

Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/6_6_4/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 6.6.4 released

2018-05-18 Thread Ishan Chattopadhyaya
18 May 2018, Apache Solr™ 6.6.4 available

The Lucene PMC is pleased to announce the release of Apache Solr 6.6.4

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release includes 1 bug fix since the 6.6.3 release:

* Do not allow to use absolute URIs for including other files in
solrconfig.xml and schema parsing

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/6.6.4

Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/6_6_4/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


Re: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-18 Thread Alessandro Benedetti
Hi Markus,
can you show all the query parameters used when submitting the request to
the request handler ?
Can you also include the parsed query  ( in the debug)

I am curious to investigate this case.

Cheers

--
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
www.sease.io

On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma 
wrote:

> Hello,
>
> And sorry to disturb again. Does anyone of you have any meaningful opinion
> on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> but with query-time KeywordRepeat filter it causes trouble in some cases.
> Is it normal for the clauses to be absent in the debug output, but the
> boost doubled in value?
>
> I like this behaviour, but is it a side effect that is considered a bug in
> later versions? And where is the documentation in this. I cannot find
> anything in the Lucene or Solr Javadocs, or the reference manual.
>
> Many thanks, again,
> Markus
>
>
>
> -Original message-
> > From:Markus Jelsma 
> > Sent: Wednesday 9th May 2018 17:39
> > To: solr-user 
> > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> >
> > Hello,
> >
> > First, apologies for the weird subject line.
> >
> > We index many languages and search over all those languages at once, but
> boost the language of the user's preference. To differentiate between
> stemmed tokens and unstemmed tokens we use KeywordRepeat and
> RemoveDuplicates, this works very well.
> >
> > However, we just stumbled over the following example, q=australia is not
> stemmed in English, but its suffix is removed by the Romanian stemmer,
> causing the Romanian results to be returned on top of English results,
> despite language boosting.
> >
> > This is because the Romanian part of the query consists of the stemmed
> and unstemmed version of the word, but the English part of the query is
> just one clause per field (title, content etc). Thus the Romanian results
> score roughtly twice that of English results.
> >
> > Now, this is of course really obvious, but the 'solution' is not. To
> work around the problem i removed the RemoveDuplicates filter so i get two
> clauses for English as well, really ugly but it works. What i don't
> understand is the debug output, it doesn't list two identical clauses,
> instead, it doubled the boost on the field, so instead of:
> >
> > 27.048403 = PayloadSpanQuery, product of:
> >   27.048403 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >   7.4 = boost
> >   3.084852 = idf(docFreq=14539, docCount=317894)
> >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> > 4.0 = phraseFreq=4.0
> > 0.3 = parameter k1
> > 0.5 = parameter b
> > 15.08689 = avgFieldLength
> > 24.0 = fieldLength
> >   1.0 = AveragePayloadFunction.docScore()
> >
> > I now get
> >
> > 54.096806 = PayloadSpanQuery, product of:
> >   54.096806 = weight(title_en:australia in 15850)
> [SchemaSimilarity], result of:
> > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > ), product of:
> >   14.8 = boost
> >   3.084852 = idf(docFreq=14539, docCount=317894)
> >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> * (1 - b + b * fieldLength / avgFieldLength)) from:
> > 4.0 = phraseFreq=4.0
> > 0.3 = parameter k1
> > 0.5 = parameter b
> > 15.08689 = avgFieldLength
> > 24.0 = fieldLength
> >   1.0 = AveragePayloadFunction.docScore()
> >
> > So instead of expecting two clauses in the debug, i get one but with a
> doubled boost.
> >
> > The question is, is this supposed to be like this?
> >
> > Also, are there any real solutions to this problem? Removing the
> RemoveDuplicats filter looks really silly.
> >
> > Many thanks!
> > Markus
> >
>


Re: Sorting on pseudo field(The one which is added during doctransformer)

2018-05-18 Thread Mikhail Khludnev
Right
https://wiki.apache.org/solr/SolrPlugins#ValueSourceParser

On Fri, May 18, 2018 at 8:04 AM, prateek.agar...@bigbasket.com <
prateek.agar...@bigbasket.com> wrote:

> Hi Mikhail,
>
> I think you forgot to link the reference.
>
> Thanks
>
>
> Regards,
> Prateek
>
> On 2018/05/17 13:18:22, Mikhail Khludnev  wrote:
> > Here is the reference I've found so far.
> >
> > On Thu, May 17, 2018 at 12:26 PM, prateek.agar...@bigbasket.com <
> > prateek.agar...@bigbasket.com> wrote:
> >
> > >
> > > Hi Mikhail,
> > >
> > > > You can either sort by function that needs to turn the logic into
> value
> > > > source parser.
> > >
> > > But like my requirement for this was to add a field dynamically from
> cache
> > > or external source to the returned documents from the solr and perform
> > > sorting in the solr itself if required otherwise use the score to sort.
> > > So how would you advise to go about this??
> > >
> > > And how to go about your way "to turn the logic into value source
> parser"
> > > like how to do this for this case??
> > >
> > >
> > > > If you need to toss just result page, check rerank.
> > >
> > > I don't want to use it to rank the relevancy of results.
> > >
> > > Thanks for the response.
> > >
> > >
> > >
> > > Regards,
> > > Prateek
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> >
>



-- 
Sincerely yours
Mikhail Khludnev


Re: Solr admin Segments page legend

2018-05-18 Thread Erick Erickson
The proportions are off for deleted documents, see: SOLR-8839.

On Fri, May 18, 2018 at 12:48 AM, Asher Shih  wrote:
> unsubscribe
>
> On Thu, May 17, 2018 at 8:17 PM, Yasufumi Mizoguchi
>  wrote:
>> Hi,
>>
>> I found some information about the pink bar from mail archive.
>> I think this should be written in ref. guide.
>>
>>> I think that pink segments are those segments
>>> which the system thinks are most likely to be chosen for automatic
>>> merging, according to whatever merge policy you have active.  Most
>>> likely the merge policy is TieredMergePolicy.
>>
>> For details, check the following:
>> http://lucene.472066.n3.nabble.com/what-s-the-pink-segment-on-solr-UI-meaning-td4378940.html
>>
>> Thanks,
>> Yasufumi
>>
>>
>> 2018年5月18日(金) 11:55 Nawab Zada Asad Iqbal :
>>
>>> Hi,
>>>
>>> Solr has a nice segments visualization at [core_name]/segments , but I am
>>> wondering what the colors mean?
>>>
>>> Gray color is probably deleted documents., But I couldn't guess the
>>> significance of pink color:
>>>
>>>
>>>
>>> Thanks
>>> Nawab
>>>


Re: Solr admin Segments page legend

2018-05-18 Thread Erick Erickson
Asher:

Please follow the instructions here:
http://lucene.apache.org/solr/community.html#mailing-lists-irc. You
must use the _exact_ same e-mail as you used to subscribe.


If the initial try doesn't work and following the suggestions at the
"problems" link doesn't work for you, let us know. But note you need
to show us the _entire_ return header to allow anyone to diagnose the
problem.

Best,
Erick

On Fri, May 18, 2018 at 7:49 AM, Erick Erickson  wrote:
> The proportions are off for deleted documents, see: SOLR-8839.
>
> On Fri, May 18, 2018 at 12:48 AM, Asher Shih  wrote:
>> unsubscribe
>>
>> On Thu, May 17, 2018 at 8:17 PM, Yasufumi Mizoguchi
>>  wrote:
>>> Hi,
>>>
>>> I found some information about the pink bar from mail archive.
>>> I think this should be written in ref. guide.
>>>
 I think that pink segments are those segments
 which the system thinks are most likely to be chosen for automatic
 merging, according to whatever merge policy you have active.  Most
 likely the merge policy is TieredMergePolicy.
>>>
>>> For details, check the following:
>>> http://lucene.472066.n3.nabble.com/what-s-the-pink-segment-on-solr-UI-meaning-td4378940.html
>>>
>>> Thanks,
>>> Yasufumi
>>>
>>>
>>> 2018年5月18日(金) 11:55 Nawab Zada Asad Iqbal :
>>>
 Hi,

 Solr has a nice segments visualization at [core_name]/segments , but I am
 wondering what the colors mean?

 Gray color is probably deleted documents., But I couldn't guess the
 significance of pink color:



 Thanks
 Nawab



Solr import doubling space on disk

2018-05-18 Thread Darko Todoric

Hi guys,

We have about 250gb solr data on one server and when we start full 
import solr doubling space on disk... This is problem for us because we 
have 500gb SSD on this server and we hit almost 100% disk usage when 
full import running.
Because we don't use "clean" option, are they are way to tell full/delta 
import that update data immediately and don't wait to finished and then 
update all? In that way, full import no need to create this tmp folder 
from the 250gb.


Kind regards,
Darko Todoric


Re: Solr import doubling space on disk

2018-05-18 Thread Emir Arnautović
Hi Darko,
There is no updating data in Solr. It is always written into new segment and if 
some existing document has the same ID it will be flagged as deleted but will 
not be removed until that segment is merged. While merging it will keep old 
segments until new is done and searcher updated. So in any case there is a 
change that Solr might need more space than index. In some extreme cases it can 
be even three times the size of an index.
I am bit rusty on DIH, but based on your comment it seems that full-import is 
doing temp index and then switch. Delta import should update existing and if 
you can use delta import you should be safe. Having 250GB index and max segment 
of 5GB you should not reach 500GB even if you delta import all documents.
Please note that for full import it is advisable to create a new index so I 
would suggest that you start asking for bigger disks.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 May 2018, at 14:28, Darko Todoric  wrote:
> 
> Hi guys,
> 
> We have about 250gb solr data on one server and when we start full import 
> solr doubling space on disk... This is problem for us because we have 500gb 
> SSD on this server and we hit almost 100% disk usage when full import running.
> Because we don't use "clean" option, are they are way to tell full/delta 
> import that update data immediately and don't wait to finished and then 
> update all? In that way, full import no need to create this tmp folder from 
> the 250gb.
> 
> Kind regards,
> Darko Todoric



Re: Solr import doubling space on disk

2018-05-18 Thread Alexandre Rafalovitch
And (as an additive comment),

You may want to index into a completely separate collection and then
do alias switching to point to it when done. That indexing could even
be on a separate machine.

Regards,
   Alex.

On 18 May 2018 at 08:47, Emir Arnautović  wrote:
> Hi Darko,
> There is no updating data in Solr. It is always written into new segment and 
> if some existing document has the same ID it will be flagged as deleted but 
> will not be removed until that segment is merged. While merging it will keep 
> old segments until new is done and searcher updated. So in any case there is 
> a change that Solr might need more space than index. In some extreme cases it 
> can be even three times the size of an index.
> I am bit rusty on DIH, but based on your comment it seems that full-import is 
> doing temp index and then switch. Delta import should update existing and if 
> you can use delta import you should be safe. Having 250GB index and max 
> segment of 5GB you should not reach 500GB even if you delta import all 
> documents.
> Please note that for full import it is advisable to create a new index so I 
> would suggest that you start asking for bigger disks.
>
> HTH,
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 18 May 2018, at 14:28, Darko Todoric  wrote:
>>
>> Hi guys,
>>
>> We have about 250gb solr data on one server and when we start full import 
>> solr doubling space on disk... This is problem for us because we have 500gb 
>> SSD on this server and we hit almost 100% disk usage when full import 
>> running.
>> Because we don't use "clean" option, are they are way to tell full/delta 
>> import that update data immediately and don't wait to finished and then 
>> update all? In that way, full import no need to create this tmp folder from 
>> the 250gb.
>>
>> Kind regards,
>> Darko Todoric
>


SOLR: Array Key to Value on Result

2018-05-18 Thread Doss
Hi,

I am looking for a solution. Lets say we have complete City, State, Country
values in an array (with Multi language), based on user selection I will be
storing only the corresponding keys in the index.

If I place the complete  array inside solr conf or inside the data
directory, is there any component  which can map the key and adds the
corresponding value in the result set?

Eg.

Array: {"1":{"EN":"India","TAM":"\u0b87\u0ba8\u0bcd\u0ba4\u0bbf\u0baf\u0bbe"}}

while getting the result if I pass

country key as 1 & language as EN then the result should have the country
value as India.

Please help to crack this.


Thanks,
Doss.


Caching Solr Grouping Results

2018-05-18 Thread rubi.hali
Hi All

Can somebody please explain if we can cache solr grouping results in query
result cache as i dont see any inserts in query result cache once we enabled
grouping?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Solr Cloud config api invoking via solrj

2018-05-18 Thread Natarajan, Rajeswari
 Hi,

Would like to dynamically change the CDCR configuration dynamically . I think 
Solr cloud config api can be used for this.
Is there way to invoke Solr cloud config api's through Solrj's CloudSolrClient.


Thanks,
Raji



Why are port numbers not showing on Solr Admin Cloud Tab

2018-05-18 Thread THADC
Hello,

I created a two node solr cloud setup using the internal zookeeper. My setup
was as follows:

./bin/solr start -c -p 8983 -s /opt/solrDeployTest/nodeHome/node1/solr/ 

./bin/solr start -c -p 7574 -s /opt/solrDeployTest/nodeHome/node2/solr/

, I then created my own configset and loaded it into zookeeper:

./bin/solr zk upconfig -n foobar_config -d
server/solr/configsets/foobar_config/ -z localhost:9983

, I then created a collection as follows:

:8983/solr/admin/collections?action=CREATE&name=foobar-dev&numShards=2&replicationFactor=2&&maxShardsPerNode=4&collection.configName=foobar_config

I am puzzled why I am not seeing the port numbers in the graphic tree of
collection-shards-replicas on the cloud tab. In other words, I am only
seeing the full IP for each replica (in my case identical) without colon and
port number. Any insights are appreciated. Thank you



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


CDCR sensitive to network failures

2018-05-18 Thread Webster Homer
Recently I encountered some problems with CDCR after we experienced network
problems, I thought I'd share.

I'm using Solr 7.2.0
We have 3 solr cloud instances where we update one cloud and use cdcr to
forward updates to the two solrclouds that are hosted in a cloud.

Usually this works pretty well.
Recently we have experienced some serious but intermittent network issues.
When that occurs we find that we get tons of cdcr warnings:

CdcrReplicator  Failed to forward update request to target:
bioreliance-catalog-assay
with errors like ClassCastException, and/or NullpointerException etc...

Updates accumulate on the server and it has tons of errors in the
cdcr?action=errors
"2018-05-18T16:11:19.860Z","internal","2018-05-18T16:11:18.860Z","internal",
"2018-05-18T16:11:17.860Z","internal",
When I looked around on the source collection, I found tlog files like this:
-rw-r--r-- 1 apache apache 1376736 May 10 23:04
tlog.141.1600138985674375168
*-rw-r--r-- 1 apache apache   0 May 11 23:05
tlog.143.1600229645842644992*
*-rw-r--r-- 1 apache apache   65458 May 12 07:50
tlog.142.1600229582225539072*
-rw-r--r-- 1 apache apache 1355610 May 18 10:05
tlog.144.1600814785270644736
-rw-r--r-- 1 apache apache 1355610 May 18 10:16
tlog.145.1600815458585411584
-rw-r--r-- 1 apache apache 1355610 May 18 10:21
tlog.146.1600815785277652992
-rw-r--r-- 1 apache apache 1355610 May 18 10:29
tlog.147.1600816282070941696

Note the 0 length file, and the truncated file
tlog.142.1600229582225539072

The solution is to delete these files. Once these files are removed the
updates start flowing

These errors show up as warnings in the log, I would have expected them to
be errors. CDCR doesn't seem to be able to detect that the tlog is
corrupted.

Hope this helps someone else. If there are better solutions, I'd like to
know

-- 


This message and any attachment are confidential and may be
privileged or 
otherwise protected from disclosure. If you are not the intended
recipient, 
you must not copy this message or attachment or disclose the
contents to 
any other person. If you have received this transmission in error,
please 
notify the sender immediately and delete the message and any attachment

from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do
not accept liability for any omissions or errors in this 
message which may
arise as a result of E-Mail-transmission or for damages 
resulting from any
unauthorized changes of the content of this message and 
any attachment thereto.
Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee
that this message is free of viruses and does 
not accept liability for any
damages caused by any virus transmitted 
therewith.



Click http://www.emdgroup.com/disclaimer 
 to access the
German, French, Spanish 
and Portuguese versions of this disclaimer.


Re: Why are port numbers not showing on Solr Admin Cloud Tab

2018-05-18 Thread THADC
I decided to simply create an external zookeeper cluster and point my solr
server nodes to them instead of using the build-in zookeeper. For whatever
reason, this solved the issue: I see the port numbers.

Please consider this thread closed.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


CDCR setup with Custom Document Routing

2018-05-18 Thread Atita Arora
Hi,

I am to setup the CDCR for a Solr Cluster which uses Custom Document
Routing.
Has anyone tried that before ?
Do we have any caveats to know well before ?
I will be setting up Uni Directional in Solr 7.3.

Per documentation -

> The current design works most robustly if both the Source and target
> clusters have the same number of shards. There is no requirement that the
> shards in the Source and target collection have the same number of replicas.
> Having different numbers of shards on the Source and target cluster is
> possible, but is also an "expert" configuration as that option imposes
> certain constraints and is not recommended. Most of the scenarios where
> having differing numbers of shards are contemplated are better accomplished
> by hosting multiple shards on each target Solr instance.


I am precisely little curious to know how would this fare if this isn't
followed.
Would highly appreciate any pointers around this.

Sincerely,
Atita


RE: Multiple languages, boosting and, stemming and KeywordRepeat

2018-05-18 Thread Markus Jelsma
Hi Alessandro,

I looked at the parsed_query again and spotted something that could be the 
problem. We extend ExtendedDismaxQParser for payload support among other 
things. I suspect something is going wrong with rewriting the claused of 
SynonymQuery there.

Thanks for letting me look at that part again, i clearly missed it the last 
time.

Thanks,
Markus
 
 
-Original message-
> From:Alessandro Benedetti 
> Sent: Friday 18th May 2018 12:54
> To: solr-user@lucene.apache.org
> Subject: Re: Multiple languages, boosting and, stemming and KeywordRepeat
> 
> Hi Markus,
> can you show all the query parameters used when submitting the request to
> the request handler ?
> Can you also include the parsed query  ( in the debug)
> 
> I am curious to investigate this case.
> 
> Cheers
> 
> --
> Alessandro Benedetti
> Search Consultant, R&D Software Engineer, Director
> www.sease.io
> 
> On Thu, May 17, 2018 at 10:53 PM, Markus Jelsma 
> wrote:
> 
> > Hello,
> >
> > And sorry to disturb again. Does anyone of you have any meaningful opinion
> > on this peculiar matter? The RemoveDuplicates filter exists for a reason,
> > but with query-time KeywordRepeat filter it causes trouble in some cases.
> > Is it normal for the clauses to be absent in the debug output, but the
> > boost doubled in value?
> >
> > I like this behaviour, but is it a side effect that is considered a bug in
> > later versions? And where is the documentation in this. I cannot find
> > anything in the Lucene or Solr Javadocs, or the reference manual.
> >
> > Many thanks, again,
> > Markus
> >
> >
> >
> > -Original message-
> > > From:Markus Jelsma 
> > > Sent: Wednesday 9th May 2018 17:39
> > > To: solr-user 
> > > Subject: Multiple languages, boosting and, stemming and KeywordRepeat
> > >
> > > Hello,
> > >
> > > First, apologies for the weird subject line.
> > >
> > > We index many languages and search over all those languages at once, but
> > boost the language of the user's preference. To differentiate between
> > stemmed tokens and unstemmed tokens we use KeywordRepeat and
> > RemoveDuplicates, this works very well.
> > >
> > > However, we just stumbled over the following example, q=australia is not
> > stemmed in English, but its suffix is removed by the Romanian stemmer,
> > causing the Romanian results to be returned on top of English results,
> > despite language boosting.
> > >
> > > This is because the Romanian part of the query consists of the stemmed
> > and unstemmed version of the word, but the English part of the query is
> > just one clause per field (title, content etc). Thus the Romanian results
> > score roughtly twice that of English results.
> > >
> > > Now, this is of course really obvious, but the 'solution' is not. To
> > work around the problem i removed the RemoveDuplicates filter so i get two
> > clauses for English as well, really ugly but it works. What i don't
> > understand is the debug output, it doesn't list two identical clauses,
> > instead, it doubled the boost on the field, so instead of:
> > >
> > > 27.048403 = PayloadSpanQuery, product of:
> > >   27.048403 = weight(title_en:australia in 15850)
> > [SchemaSimilarity], result of:
> > > 27.048403 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > > ), product of:
> > >   7.4 = boost
> > >   3.084852 = idf(docFreq=14539, docCount=317894)
> > >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> > * (1 - b + b * fieldLength / avgFieldLength)) from:
> > > 4.0 = phraseFreq=4.0
> > > 0.3 = parameter k1
> > > 0.5 = parameter b
> > > 15.08689 = avgFieldLength
> > > 24.0 = fieldLength
> > >   1.0 = AveragePayloadFunction.docScore()
> > >
> > > I now get
> > >
> > > 54.096806 = PayloadSpanQuery, product of:
> > >   54.096806 = weight(title_en:australia in 15850)
> > [SchemaSimilarity], result of:
> > > 54.096806 = score(doc=15850,freq=4.0 = phraseFreq=4.0
> > > ), product of:
> > >   14.8 = boost
> > >   3.084852 = idf(docFreq=14539, docCount=317894)
> > >   1.1848832 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1
> > * (1 - b + b * fieldLength / avgFieldLength)) from:
> > > 4.0 = phraseFreq=4.0
> > > 0.3 = parameter k1
> > > 0.5 = parameter b
> > > 15.08689 = avgFieldLength
> > > 24.0 = fieldLength
> > >   1.0 = AveragePayloadFunction.docScore()
> > >
> > > So instead of expecting two clauses in the debug, i get one but with a
> > doubled boost.
> > >
> > > The question is, is this supposed to be like this?
> > >
> > > Also, are there any real solutions to this problem? Removing the
> > RemoveDuplicats filter looks really silly.
> > >
> > > Many thanks!
> > > Markus
> > >
> >
> 


Getting more documents from resultsSet

2018-05-18 Thread root23
Hi all,
I am working on Solr 6. Our business requirement is that we need to return
2000 docs for every query we execute.
Now normally if i execute the same set to query with start=0 to rows=10. It
returns very fast(event for our most complex queries in like less then 3
seconds).
however the moment i add start=0 to rows =2000, the response time is like 30
seconds or so. 

I understand that solr has to do probably disk seek to get the documents
which might be the bottle neck in this case. 

Is there a way i can optimize around this knowingly that i might have to get
2000 results in one go and then might have to paginate also further and
showing 2000 results on each page. We could go to as much as 50 page.



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Getting more documents from resultsSet

2018-05-18 Thread Erick Erickson
If you only return fields that are docValue=true that'll largely
eliminate the disk seeks. 30 seconds does seem kind of excessive even
with disk seeks though.

Here'r a reference: https://lucene.apache.org/solr/guide/6_6/docvalues.html

Whenever I see anything like "...our business requirement is...", I
cringe. _Why_ is that a requirement? What is being done _for the user_
that requires 2000 documents? There may be legitimate reasons, but
there also may be better ways to get what you need. This may very well
be an XY problem.

For instance, if you want to take the top 2,000 docs from query X and
score just those, see:
https://lucene.apache.org/solr/guide/6_6/query-re-ranking.html,
specifically: ReRankQParserPlugin.

Best,
Erick

On Fri, May 18, 2018 at 11:09 AM, root23  wrote:
> Hi all,
> I am working on Solr 6. Our business requirement is that we need to return
> 2000 docs for every query we execute.
> Now normally if i execute the same set to query with start=0 to rows=10. It
> returns very fast(event for our most complex queries in like less then 3
> seconds).
> however the moment i add start=0 to rows =2000, the response time is like 30
> seconds or so.
>
> I understand that solr has to do probably disk seek to get the documents
> which might be the bottle neck in this case.
>
> Is there a way i can optimize around this knowingly that i might have to get
> 2000 results in one go and then might have to paginate also further and
> showing 2000 results on each page. We could go to as much as 50 page.
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Index filename while indexing JSON file

2018-05-18 Thread S.Ashwath
Hello,

I have 2 directories: 1 with txt files and the other with corresponding
JSON (metadata) files (around 9 of each). There is one JSON file for
each CSV file, and they share the same name (they don't share any other
fields).

The txt files just have plain text, I mapped each line to a field call
'sentence' and included the file name as a field using the data import
handler. No problems here.

The JSON file has metadata: 3 tags: a URL, author and title (for the
content in the corresponding txt file).
When I index the JSON file (I just used the _default schema, and posted the
fields to the schema, as explained in the official solr tutorial),* I don't
know how to get the file name into the index as a field.* As far as i know,
that's no way to use the Data import handler for JSON files. I've read that
I can pass a literal through the bin/post tool, but again, as far as I
understand, I can't pass in the file name dynamically as a literal.

I NEED to get the file name, it is the only way in which I can associate
the metadata with each sentence in the txt files in my downstream Python
code.

So if anybody has a suggestion about how I should index the JSON file name
along with the JSON content (or even some workaround), I'd be eternally
grateful.

Regards,

Ash


Re: Getting more documents from resultsSet

2018-05-18 Thread Deepak Goel
I wonder if in-memory-filesystem would help...

On Sat, 19 May 2018, 01:03 Erick Erickson,  wrote:

> If you only return fields that are docValue=true that'll largely
> eliminate the disk seeks. 30 seconds does seem kind of excessive even
> with disk seeks though.
>
> Here'r a reference:
> https://lucene.apache.org/solr/guide/6_6/docvalues.html
>
> Whenever I see anything like "...our business requirement is...", I
> cringe. _Why_ is that a requirement? What is being done _for the user_
> that requires 2000 documents? There may be legitimate reasons, but
> there also may be better ways to get what you need. This may very well
> be an XY problem.
>
> For instance, if you want to take the top 2,000 docs from query X and
> score just those, see:
> https://lucene.apache.org/solr/guide/6_6/query-re-ranking.html,
> specifically: ReRankQParserPlugin.
>
> Best,
> Erick
>
> On Fri, May 18, 2018 at 11:09 AM, root23  wrote:
> > Hi all,
> > I am working on Solr 6. Our business requirement is that we need to
> return
> > 2000 docs for every query we execute.
> > Now normally if i execute the same set to query with start=0 to rows=10.
> It
> > returns very fast(event for our most complex queries in like less then 3
> > seconds).
> > however the moment i add start=0 to rows =2000, the response time is
> like 30
> > seconds or so.
> >
> > I understand that solr has to do probably disk seek to get the documents
> > which might be the bottle neck in this case.
> >
> > Is there a way i can optimize around this knowingly that i might have to
> get
> > 2000 results in one go and then might have to paginate also further and
> > showing 2000 results on each page. We could go to as much as 50 page.
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Getting more documents from resultsSet

2018-05-18 Thread Pratik Patel
Using cursor marker might help as explained in this documentation
https://lucene.apache.org/solr/guide/6_6/pagination-of-results.html

On Fri, May 18, 2018 at 4:13 PM, Deepak Goel  wrote:

> I wonder if in-memory-filesystem would help...
>
> On Sat, 19 May 2018, 01:03 Erick Erickson, 
> wrote:
>
> > If you only return fields that are docValue=true that'll largely
> > eliminate the disk seeks. 30 seconds does seem kind of excessive even
> > with disk seeks though.
> >
> > Here'r a reference:
> > https://lucene.apache.org/solr/guide/6_6/docvalues.html
> >
> > Whenever I see anything like "...our business requirement is...", I
> > cringe. _Why_ is that a requirement? What is being done _for the user_
> > that requires 2000 documents? There may be legitimate reasons, but
> > there also may be better ways to get what you need. This may very well
> > be an XY problem.
> >
> > For instance, if you want to take the top 2,000 docs from query X and
> > score just those, see:
> > https://lucene.apache.org/solr/guide/6_6/query-re-ranking.html,
> > specifically: ReRankQParserPlugin.
> >
> > Best,
> > Erick
> >
> > On Fri, May 18, 2018 at 11:09 AM, root23  wrote:
> > > Hi all,
> > > I am working on Solr 6. Our business requirement is that we need to
> > return
> > > 2000 docs for every query we execute.
> > > Now normally if i execute the same set to query with start=0 to
> rows=10.
> > It
> > > returns very fast(event for our most complex queries in like less then
> 3
> > > seconds).
> > > however the moment i add start=0 to rows =2000, the response time is
> > like 30
> > > seconds or so.
> > >
> > > I understand that solr has to do probably disk seek to get the
> documents
> > > which might be the bottle neck in this case.
> > >
> > > Is there a way i can optimize around this knowingly that i might have
> to
> > get
> > > 2000 results in one go and then might have to paginate also further and
> > > showing 2000 results on each page. We could go to as much as 50 page.
> > >
> > >
> > >
> > > --
> > > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >
>