Synonyms relationships

2018-10-31 Thread Nicolas Paris
Hi

Does SolR provide a way to describe synonyms relationships such
"equivalent to" ,"narrower thant", "broader than" ?

It turns out both postgres and oracle do, but I can't find any related
information in the documentation.

This is useful to allow generalizing the terms of the research or not.

Thanks ,


-- 
nicolas


Pagination Graph/SQL

2018-12-02 Thread Nicolas Paris
Hi

SolR pagination is incredible: you can provide the end user a small set
of results together with the total number of documents found (numFound).

I am wondering if both "parallel SQL" and "Graph Traversal" feature
also provide a pagination mechanism.

As an example, the above SQL :
SELECT id
FROM table
LIMIT 10

would both return 10 id's together with the number of document within
the table.

Thanks,


-- 
nicolas


Highlighting Parent Documents

2018-12-09 Thread Nicolas Paris
Hi

I have read here [1] and here [2] that it is possible to highlight only
parent documents in block join queries. But I didn't succeed yet:

So here is my nested document example:
[
{
"id": "2",
"type_s": "parent",
"content_txt": ["apache"],
"_childDocuments_": [
{
"id": "1",
"type_s": "child",
"content_txt": ["solr"]
}
]
}
]

Here is my query (=give me document that have a parent which contain
"apache" term):
curl http://localhost:8983/solr/col/query -d '
fl=id
&hl=on
&hl.fl=*
&q={!child of="type_s:parent"}type_s:parent AND content_txt:apache'

And here is the result:
{
...
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"1"}]
  },
  "highlighting":{
"1":{}}}


I was hoping to get this (=the doc 1 and highlight doc 2: the parent) :

{
...
  "response":{"numFound":1,"start":0,"docs":[
  {
"id":"1"}]
  },
  "highlighting":{
"2":{
  "content_txt":["apache"]}}}


[1] 
http://lucene.472066.n3.nabble.com/Fwd-Standard-highlighting-doesn-t-work-for-Block-Join-td4260784.html
[2] 
http://lucene.472066.n3.nabble.com/highlighting-on-child-document-td4238236.html


Thanks by advance,

-- 
nicolas


Search only for single value of Solr multivalue field (part 2)

2018-12-16 Thread Nicolas Paris
hi

This question is highly related to a previous one found on the
mailing-list archive [1].

I have this document:

"content_txt":["001 first","002 second"]
I d'like the below query return nothing:
> q=content_txt:(first AND second)

The method proposed ([1]) by Erick works ok to look for a single value
having BOTH first AND second by setting the field positionIncrementGap
high enough:

This query returns nothing as expected:
> q=content_txt:("first second"~99)


However, this is based on *phrase search*. Phrase search does not allow
to use the below simple query parser features. That's a _HUGE_ limitation!
- regexp
- fuzzy
- whildcard
- ranges

So the query below does won't match the first field:
> q=content_txt:("[000 TO 001] first"~99)
While this one does match the second and shouldn't!
> q=content_txt:([000 TO 001] AND "second")

QUESTION:
-
Is there a chance such feature will be developed in future SolR version ? I 
mean something
allowing considering multivalued fields independently ? A new field
attribute such independentMultivalued=true would be ok ?

Thanks,


[1]: 
http://lucene.472066.n3.nabble.com/Search-only-for-single-value-of-Solr-multivalue-field-td4309850.html#a4309893

-- 
nicolas


Re: Search only for single value of Solr multivalue field (part 2)

2018-12-16 Thread Nicolas Paris
On Sun, Dec 16, 2018 at 09:30:33AM -0800, Erick Erickson wrote:
> Have you looked at ComplexPhraseQueryParser here?
> https://lucene.apache.org/solr/guide/6_6/other-parsers.html

Sure. However, I am using multi-word synonyms and so far the
complexphrase does not handle them. (maybe soon ?)

> Depending on how many of these you have, you could do something with
> dynamic fields. Rather than use a single MV field, use N fields. You'd
> probably have to copyField or some such to a catch-all field for
> searches that you wanted to ignore the "mv nature" of the field. 

Problem with copyField from multiple fields acts as a MV field. So the
problem remains: dealing with MV fields. Isn't ?

Thanks

-- 
nicolas


Re: Search only for single value of Solr multivalue field (part 2)

2018-12-17 Thread Nicolas Paris
On Sun, Dec 16, 2018 at 05:44:30PM -0800, Erick Erickson wrote:
> No, the idea is that you have N single valued fields, one for each of
> the MV entries you have. The copyField dest would be MV, and only used
> in those cases you wanted to match across values. Not saying that's a
> great solution, or if it would even necessarily work but thought it
> worth mentioning.

Ok, then my initial document with MV fields:
> "content_txt":["001 first","002 second"]
would become:
> "content1_t":"001 first"
> "content2_t":"002 second"
> "_copiedfield_":["001 first","002 second"]

And then the initial user query:
> content_txt:(first AND second)
would become:
> content1_t:(first AND second) OR content2_t:(first AND second)


Depending on the length of the initial array, each document will have a
different number of contentx_t. This means some management like a layer between
the user and the parser, to extend the query with the maximum possible
contentx_t fields in the collection. (with max=100 for performance reason?)


QUESTION:

is the MV limitation a *solr parser* limitation, or a *lucene* limitation. If
it is the latter, writing my own parser would be an option isn't ?


-- 
nicolas


MoreLikeThis & Synonyms

2018-12-26 Thread Nicolas Paris
Hi

It turns out that MoreLikeThis handler does not use queryTime synonyms
expansion.

It is only compatible with indexTime synonyms.

However multiword synonyms are only compatible with queryTime synonyms
expansion.

For this reason this does not allow the use of multiword synonyms within
together with the MoreLikeThis handler.

Is there any reason for the MoreLikeThis feature not compatible with
Multiword Synonyms  ?

Thanks
-- 
nicolas


Re: MoreLikeThis & Synonyms

2018-12-27 Thread Nicolas Paris
On Wed, Dec 26, 2018 at 09:09:02PM -0800, Erick Erickson wrote:
> bq. However multiword synonyms are only compatible with queryTime synonyms
> expansion.
> 
> Why do you say this? What version of Solr? Query-time mult-word
> synonyms were _added_, but AFAIK the capability of multi-word synonyms
> was not taken away. 

>From this blogpost [1] I deduced multi-word synonyms are only compatible
with query time expansion.

> Or are you saying that MLT doesn't play nice at all with multi-word
> synonyms?

>From my tests, MLT does not expand the query with synonyms. So it is not
possible to use query time synonyms nor mutli-word. Only index time is
possible with the limitations it has [1]

> What version of Solr are you using?

I am running solr 7.6.

[1] 
https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/

-- 
nicolas


edismax: sorting on numeric fields

2019-02-14 Thread Nicolas Paris
Hi

I have a numeric field (say "weight") and I d'like to be able to get
results sorted.
q=kind:animal weight:50
pf=kind^2 weight^3

would return:
name=dog, kind=animal, weight=51
name=tiger, kind=animal,weight=150
name=elephant, kind=animal,weight=2000


In other terms how to deal with numeric fields ?

My first idea is to encode numeric into letters (one x per value)
dog x
tiger x
elephant 


and the query would be
kind:animal, weight:xxx


How to deal with numeric fields ?

Thanks
-- 
nicolas


Re: edismax: sorting on numeric fields

2019-02-16 Thread Nicolas Paris
Hi

Thanks.
To clarify, I don't want to sort by numeric fields, instead, I d'like to
get sort by distance to my query.


On Thu, Feb 14, 2019 at 06:20:19PM -0500, Gus Heck wrote:
> Hi Niclolas,
> 
> Solr has no difficulty sorting on numeric fields if they are indexed as a
> numeric type. Just use "&sort=weight asc" If you're field is indexed as
> text of course it won't sort properly, but then you should fix your schema.
> 
> -Gus
> 
> On Thu, Feb 14, 2019 at 4:10 PM David Hastings 
> wrote:
> 
> > Not clearly understanding your question here.  if your query is
> > q=kind:animal weight:50 you will get no results, as nothing matches
> > (assuming a q.op of AND)
> >
> >
> > On Thu, Feb 14, 2019 at 4:06 PM Nicolas Paris 
> > wrote:
> >
> > > Hi
> > >
> > > I have a numeric field (say "weight") and I d'like to be able to get
> > > results sorted.
> > > q=kind:animal weight:50
> > > pf=kind^2 weight^3
> > >
> > > would return:
> > > name=dog, kind=animal, weight=51
> > > name=tiger, kind=animal,weight=150
> > > name=elephant, kind=animal,weight=2000
> > >
> > >
> > > In other terms how to deal with numeric fields ?
> > >
> > > My first idea is to encode numeric into letters (one x per value)
> > > dog x
> > > tiger x
> > > elephant
> > >
> > 
> > >
> > > and the query would be
> > > kind:animal, weight:xxx
> > >
> > >
> > > How to deal with numeric fields ?
> > >
> > > Thanks
> > > --
> > > nicolas
> > >
> >
> 
> 
> -- 
> http://www.the111shift.com

-- 
nicolas


Re: edismax: sorting on numeric fields

2019-02-17 Thread Nicolas Paris
hi

On Sun, Feb 17, 2019 at 09:40:43AM +0100, Jan Høydahl wrote:
> q=kind:animal&wantedweight=50&sort=abs(sub(weight,wantedweight)) asc

*sort function* looks a great way.

I did not mention very clearly I d'like to use edsimax and add some
weight to some fields (such *description* in the below example). So
*sorting* looks not sufficient.

After playing around with distance functions it turns out this one looks
quite sufficient:
?defType=edismax&q=kind:animal&carnivore&qf=description^3&boost=div(1,abs(sub(weight,50)))

this means: "give me all animals having carnivore in the description, and
order them given their weight is near 50 kg and carnivore is relevant
in the description"


Does this make sence ?



On Sun, Feb 17, 2019 at 09:40:43AM +0100, Jan Høydahl wrote:
> q=kind:animal&wantedweight=50&sort=abs(sub(weight,wantedweight)) asc
> 
> Jan Høydahl
> 
> > 16. feb. 2019 kl. 17:08 skrev Dave :
> > 
> > Sounds like you need to use code and post process your results as it sounds 
> > too specific to your use case. Just my opinion, unless you want to get into 
> > spacial queries which is a whole different animal and something I don’t 
> > think many have experience with, including myself 
> > 
> >> On Feb 16, 2019, at 10:10 AM, Nicolas Paris  
> >> wrote:
> >> 
> >> Hi
> >> 
> >> Thanks.
> >> To clarify, I don't want to sort by numeric fields, instead, I d'like to
> >> get sort by distance to my query.
> >> 
> >> 
> >>> On Thu, Feb 14, 2019 at 06:20:19PM -0500, Gus Heck wrote:
> >>> Hi Niclolas,
> >>> 
> >>> Solr has no difficulty sorting on numeric fields if they are indexed as a
> >>> numeric type. Just use "&sort=weight asc" If you're field is indexed as
> >>> text of course it won't sort properly, but then you should fix your 
> >>> schema.
> >>> 
> >>> -Gus
> >>> 
> >>> On Thu, Feb 14, 2019 at 4:10 PM David Hastings 
> >>> 
> >>> wrote:
> >>> 
> >>>> Not clearly understanding your question here.  if your query is
> >>>> q=kind:animal weight:50 you will get no results, as nothing matches
> >>>> (assuming a q.op of AND)
> >>>> 
> >>>> 
> >>>> On Thu, Feb 14, 2019 at 4:06 PM Nicolas Paris 
> >>>> wrote:
> >>>> 
> >>>>> Hi
> >>>>> 
> >>>>> I have a numeric field (say "weight") and I d'like to be able to get
> >>>>> results sorted.
> >>>>> q=kind:animal weight:50
> >>>>> pf=kind^2 weight^3
> >>>>> 
> >>>>> would return:
> >>>>> name=dog, kind=animal, weight=51
> >>>>> name=tiger, kind=animal,weight=150
> >>>>> name=elephant, kind=animal,weight=2000
> >>>>> 
> >>>>> 
> >>>>> In other terms how to deal with numeric fields ?
> >>>>> 
> >>>>> My first idea is to encode numeric into letters (one x per value)
> >>>>> dog x
> >>>>> tiger x
> >>>>> elephant
> >>>>> 
> >>>> 
> >>>>> 
> >>>>> and the query would be
> >>>>> kind:animal, weight:xxx
> >>>>> 
> >>>>> 
> >>>>> How to deal with numeric fields ?
> >>>>> 
> >>>>> Thanks
> >>>>> --
> >>>>> nicolas
> >>>>> 
> >>>> 
> >>> 
> >>> 
> >>> -- 
> >>> http://www.the111shift.com
> >> 
> >> -- 
> >> nicolas
> 

-- 
nicolas


query bag of word with negation

2018-04-22 Thread Nicolas Paris
Hello

I wonder if there is a plain text query syntax to say:
give me all document that match:

wonderful pizza NOT peperoni

all those in a 5 distance word bag
then

pizza are wonderful -> would match
I made a wonderful pasta and pizza -> would match
Peperoni pizza are so wonderful -> would not match

I tested:
"wonderful pizza - peperoni"~5
without success

Thanks


Re: query bag of word with negation

2018-04-22 Thread Nicolas Paris
  1. Query terms containing other than just letters or digits may be placed
>> within double quotes so that  those other characters do not separate a term
>> into many terms. A dot (period) and white space are neither  letter nor
>> digit. Examples: "Now is the time for all good men" (spaces, quotes impose
>> ordering too), "goods.doc" (a dot).
>
>

> 2. Mode button "or" (the default) means match one or more terms, perhaps
>> scattered about. Mode button "and" means must match all terms, scattered or
>> not.
>
>

> 3. A one word query term may be prefixed by title: or url: to search on
>> those fields. A space must follow the colon, and the search term is case
>> sensitive. Examples: url: .ppt or title: Goodies. Many docs do not have a
>> formal internal title field, thus prefix title: may not work.
>
>

> 4. Compound queries can be built by joining terms with and or - and group
>> items with ( ). Not is expressed as a minus sign prefixing a term. A bare
>> space means use the Mode (or, and). Example: Nancy and Mary and -Jane and
>> -(Robert Daniel) which means both the first two and not Jane and neither of
>> the two guys.
>
>



5. A query of asterisk/star (*) means match everything. Examples: * for
>> everything (zero or more characters). Fussy, show all without term .pdf *
>> and -".pdf" For normal queries the program uses the edismax interface. A
>> few, such as url: foobar, reference the Lucene interface. This is specified
>> by the qagent= parameter, of edismax or empty respectively, in a search
>> request. Thus regular facilities can do most of this work.
>
>


> What this example does not address is your distance 5 critera. However,
>> the NOT facility may do the trick for you, though a minus sign is taken as
>> a literal minus sign or word separator if located within a quoted string.
>
>
​​Indeed sadly words can be anywhere in the document ​ (no notion of
distance​)

Thanks, Joe D.
>
>
​Thanks for the 5 details anyway​


Re: query bag of word with negation

2018-04-22 Thread Nicolas Paris
Hello Markus

Thanks !

The ComplexPhraseQueryParser syntax:
q={!complexphrase inOrder=false}collector:"wonderful pizza -peperoni"~5
answers my needs.

BTW,
Apparently it accepts both leading/ending wildcards, that's look powerful
feature.

Any chance it would support the "sow=false" in order to combine with
 multi-word synonyms ?


2018-04-22 21:11 GMT+02:00 Markus Jelsma :

> Hello Nicolas,
>
> Yes you can! Check out ComplexPhaseQParser
> https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-
> ComplexPhraseQueryParser
>
> Regards,
> Markus
>
>
>
> -Original message-
> > From:Nicolas Paris 
> > Sent: Sunday 22nd April 2018 20:04
> > To: solr-user@lucene.apache.org
> > Subject: query bag of word with negation
> >
> > Hello
> >
> > I wonder if there is a plain text query syntax to say:
> > give me all document that match:
> >
> > wonderful pizza NOT peperoni
> >
> > all those in a 5 distance word bag
> > then
> >
> > pizza are wonderful -> would match
> > I made a wonderful pasta and pizza -> would match
> > Peperoni pizza are so wonderful -> would not match
> >
> > I tested:
> > "wonderful pizza - peperoni"~5
> > without success
> >
> > Thanks
> >
>


Re: Is anybody using UIMA with Solr?

2018-06-19 Thread Nicolas Paris
Hi

Not realy a direct answer - Never used it, however this feature have
been attractive to me while first looking at uima.

Right now, I would say UIMA connectors in general are by design
a pain to maintain. Source and target often do have optimised
way to bulk export/import data. For example, using a jdbc postgresql
connector is a bad idea compared to using the optimzed COPY function.
And each database has it's own optimized way of doing.

That's why developpers of UIMA should focus on  improving what UIMA
is good at: processing texts.
Exporting and importing texts responsibility should remain to the other
tools.

Tell me if i am wrong

2018-06-18 13:13 GMT+02:00 Alexandre Rafalovitch :

> Hi,
>
> Solr ships an UIMA component and examples that haven't worked for a
> while. Details are in:
> https://issues.apache.org/jira/browse/SOLR-11694
>
> The choices for developers are:
> 1) Rip UIMA out (and save space)
> 2) Update UIMA to latest 2.x version
> 3) Update UIMA to super-latest possibly-breaking 3.x
>
> The most likely choice at this point is 1. But I am curious (given
> that UIMA is in IBM Watson...) if anybody actually has a use-case that
> strongly votes for options 2 or 3, given that the update effort is
> probably not trivial.
>
> Note that if you use UIMA with Solr, but in a configuration completely
> different from that shipped (so the options 2/3 would still be
> irrelevant), it could be still fun to share the knowledge in this
> thread, with the appropriate disclaimer.
>
> Regards,
>Alex.
>


Re: Is anybody using UIMA with Solr?

2018-06-19 Thread Nicolas Paris
sorry thought I was on UIMA mailing list.
That being said, my position is the same :

let UIMA folks load data into SolR by using the most optimized way.
(what would be the best way ? Loading jsons ?)

2018-06-19 22:48 GMT+02:00 Nicolas Paris :

> Hi
>
> Not realy a direct answer - Never used it, however this feature have
> been attractive to me while first looking at uima.
>
> Right now, I would say UIMA connectors in general are by design
> a pain to maintain. Source and target often do have optimised
> way to bulk export/import data. For example, using a jdbc postgresql
> connector is a bad idea compared to using the optimzed COPY function.
> And each database has it's own optimized way of doing.
>
> That's why developpers of UIMA should focus on  improving what UIMA
> is good at: processing texts.
> Exporting and importing texts responsibility should remain to the other
> tools.
>
> Tell me if i am wrong
>
> 2018-06-18 13:13 GMT+02:00 Alexandre Rafalovitch :
>
>> Hi,
>>
>> Solr ships an UIMA component and examples that haven't worked for a
>> while. Details are in:
>> https://issues.apache.org/jira/browse/SOLR-11694
>>
>> The choices for developers are:
>> 1) Rip UIMA out (and save space)
>> 2) Update UIMA to latest 2.x version
>> 3) Update UIMA to super-latest possibly-breaking 3.x
>>
>> The most likely choice at this point is 1. But I am curious (given
>> that UIMA is in IBM Watson...) if anybody actually has a use-case that
>> strongly votes for options 2 or 3, given that the update effort is
>> probably not trivial.
>>
>> Note that if you use UIMA with Solr, but in a configuration completely
>> different from that shipped (so the options 2/3 would still be
>> irrelevant), it could be still fun to share the knowledge in this
>> thread, with the appropriate disclaimer.
>>
>> Regards,
>>Alex.
>>
>
>


Re: Storage/Volume type for Kubernetes Solr POD?

2020-02-07 Thread Nicolas PARIS
hi all

what about cephfs or lustre distrubuted filesystem for such purpose ?


Karl Stoney  writes:

> we personally run solr on google cloud kubernetes engine and each node has a 
> 512Gb persistent ssd (network attached) storage which gives roughly this 
> performance (read/write):
>
> Sustained random IOPS limit 15,360.00 15,360.00
> Sustained throughput limit (MB/s) 245.76  245.76
>
> and we get very good performance.
>
> ultimately though it's going to depend on your workload
> 
> From: Susheel Kumar 
> Sent: 06 February 2020 13:43
> To: solr-user@lucene.apache.org 
> Subject: Storage/Volume type for Kubernetes Solr POD?
>
> Hello,
>
> Whats type of storage/volume is recommended to run Solr on Kubernetes POD?
> I know in the past Solr has issues with NFS storing its indexes and was not
> recommended.
>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Fconcepts%2Fstorage%2Fvolumes%2F&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7Cade649a9f6e84e1ee7d008d7ab0a8c7b%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165934101219754&sdata=wsc4v3dJwTzOqSirbo7DvdmrimTL2sOX66Ug%2FvzrRw8%3D&reserved=0
>
> Thanks,
> Susheel
> This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 
> Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. 
> 9439967). This email and any files transmitted with it are confidential and 
> may be legally privileged, and intended solely for the use of the individual 
> or entity to whom they are addressed. If you have received this email in 
> error please notify the sender. This email message has been swept for the 
> presence of computer viruses.


-- 
nicolas paris


multivalue faceting term optimization

2020-03-09 Thread Nicolas Paris
Hello,


Environment:
- SolrCloud 8.4.1
- 4 shards with xmx = 120GO and ssd disks
- 50M documents / 40GO physical per shard
- mainly large texts fields and also, one multivalue/docvalue/indexed string
list of 15 values per document


Goal:
I want to provide terms facet on a string multivalue field. This offers
the client to build dynamic word cloud depending on filter queries. The
words within the array are extracted with TFIDF from large raw texts
from neightbourg fields.


Behavior:
The computing time is below 2 seconds when the FQ query is selective
enough (<2M). However it results as a timeout when the FQ finds > 2M documents


Question:
How to improve brute performances ?
I tried:
- facet.limit / facet.threads / facet.method
- limiting the multivalue size (from 50 elements per documents to 15)
Is ther any parameter I am missing ?


Thought:
If there is now way to faster performances for the brute task, I guess I
could artificially limit the FQ under 2M for all queries by getting a
sample (I don't really care having more than 2M documents to build the
word cloud).
I am wondering how I could filter the documents to get approximate facets ?


Thanks !


-- 
nicolas paris


Re: multivalue faceting term optimization

2020-03-09 Thread Nicolas Paris


Toke Eskildsen  writes:
> JSON faceting allows you to skip the fine counting with the parameter
> refine: 

I also tried the facet.refine parameter, but didn't notice any improvement.


>> I am wondering how I could filter the documents to get approximate
>> facets ?
>
> Clunky idea: Introduce a hash field for each document. [...]
> [...]you could also create fields with random values

That's a pragmatic solution. Two steps:
1. get the count, hightlight and first matches
2. depending on the count, filter based on random/hash values

BTW I wonder if the first step will be cached, as to get highlights I
cannot use FQ, but Q. And the latter is not meant to cache the
results. So this might lead to duplicate the effort isn'it ?


> It might help to have everything in a single shard, to avoid the
> secondary fine count. But your index is rather large

Yes, it's large, and growing from 1M each month. Merging in one shard is
not an option.

However, I suppose I should be able to ask the facet to one shard only
if the count is above a threshold ? This would reduce the number of
document by ~4 and avoid secondary fine count. That maybe better than
subsetting with extra random fields

-- 
nicolas paris


Re: multivalue faceting term optimization

2020-03-09 Thread Nicolas Paris


Erick Erickson  writes:
> Have you looked at the HyperLogLog stuff? Here’s at least a mention of
> it: https://lucene.apache.org/solr/guide/8_4/the-stats-component.html

I am used to hll in the context of count distinct values -- cardinality.

I have to admit that section 
https://lucene.apache.org/solr/guide/8_4/the-stats-component.html#local-parameters-with-the-stats-component
is about hll and facets, but I am not sure that really meet the use
case. I also have to admit that part is quite cryptic to me.


-- 
nicolas paris


Status of solR / HDFS-v3 compatibility

2019-05-02 Thread Nicolas Paris
Hi

solr doc [1] says it's only compatible with hdfs 2.x
is that true ?


[1]: http://lucene.apache.org/solr/guide/7_7/running-solr-on-hdfs.html

-- 
nicolas


Re: Status of solR / HDFS-v3 compatibility

2019-05-02 Thread Nicolas Paris
Thanks kevin, and also thanks for your blog on hdfs topic

> https://risdenk.github.io/2018/10/23/apache-solr-running-on-apache-hadoop-hdfs.html

On Thu, May 02, 2019 at 09:37:59AM -0400, Kevin Risden wrote:
> For Apache Solr 7.x or older yes - Apache Hadoop 2.x was the dependency.
> Apache Solr 8.0+ has Hadoop 3 compatibility with SOLR-9515. I did some
> testing to make sure that Solr 8.0 worked on Hadoop 2 as well as Hadoop 3,
> but the libraries are Hadoop 3.
> 
> The reference guide for 8.0+ hasn't been released yet, but also don't think
> it was updated.
> 
> Kevin Risden
> 
> 
> On Thu, May 2, 2019 at 9:32 AM Nicolas Paris 
> wrote:
> 
> > Hi
> >
> > solr doc [1] says it's only compatible with hdfs 2.x
> > is that true ?
> >
> >
> > [1]: http://lucene.apache.org/solr/guide/7_7/running-solr-on-hdfs.html
> >
> > --
> > nicolas
> >

-- 
nicolas


Document Update performances Improvement

2019-10-15 Thread Nicolas Paris
Hi

I am looking for a way to faster the update of documents.

In my context, the update replaces one of the many existing indexed
fields, and keep the others as is.

Right now, I am building the whole document, and replacing the existing
one by id.

I am wondering if **atomic update feature** would faster the process.

>From one hand, using this feature would save network because only a
small subset of the document would be send from the client to the
server. 
On the other hand, the server will have to collect the values from the
disk and reindex them. In addition, this implies to store the values for
every fields (I am not storing every fields) and use more space.

Also I have read about the ConcurrentUpdateSolrServer class might be an
optimized way of updating documents.

I am using spark-solr library to deal with solr-cloud. If something
exist to faster the process, I would be glad to implement it in that
library.
Also, I have split the collection over multiple shard, and I admit this
faster the update process, but who knows ?

Thoughts ?

-- 
nicolas


Solr-Cloud, join and collection collocation

2019-10-15 Thread Nicolas Paris
Hi

I have several large collections that cannot fit in a standalone solr
instance. They are split over multiple shards in solr-cloud mode.

Those collections are supposed to be joined to an other collection to
retrieve subset. Because I am using distributed collections, I am not
able to use the solr join feature.

For this reason, I denormalize the information by adding the joined
collection within every collections. Naturally, when I want to update
the joined collection, I have to update every one of the distributed
collections.

In standalone mode, I only would have to update the joined collection.

I wonder if there is a way to overcome this limitation. For example, by
replicating the joined collection to every shard - or other method I am
ignoring.

Any thought ? 
-- 
nicolas


Re: Solr-Cloud, join and collection collocation

2019-10-15 Thread Nicolas Paris
> You can certainly replicate the joined collection to every shard. It
> must fit in one shard and a replica of that shard must be co-located
> with every replica of the “to” collection.

Yes, I found this in the documentation, with a clear example just after
this mail. I will test it today. I also read your blog about join
performances[1] and I suspect the performance impact of joins will be
huge because the joined collection is about 10M documents (only two
fields, unique id and an array of longs and a filter applied to the
array, join key is 10M unique IDs).

> Have you looked at streaming and “streaming expressions"? It does not
> have the same problem, although it does have its own limitations.

I never tested them, and I am not very confortable yet in how to test
them. Is it possible to mix query parsers and streaming expression in
the client call via http parameters - or is streaming expression apply
programmatically only ?

[1] https://lucidworks.com/post/solr-and-joins/

On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> You can certainly replicate the joined collection to every shard. It must fit 
> in one shard and a replica of that shard must be co-located with every 
> replica of the “to” collection.
> 
> Have you looked at streaming and “streaming expressions"? It does not have 
> the same problem, although it does have its own limitations.
> 
> Best,
> Erick
> 
> > On Oct 15, 2019, at 6:58 PM, Nicolas Paris  wrote:
> > 
> > Hi
> > 
> > I have several large collections that cannot fit in a standalone solr
> > instance. They are split over multiple shards in solr-cloud mode.
> > 
> > Those collections are supposed to be joined to an other collection to
> > retrieve subset. Because I am using distributed collections, I am not
> > able to use the solr join feature.
> > 
> > For this reason, I denormalize the information by adding the joined
> > collection within every collections. Naturally, when I want to update
> > the joined collection, I have to update every one of the distributed
> > collections.
> > 
> > In standalone mode, I only would have to update the joined collection.
> > 
> > I wonder if there is a way to overcome this limitation. For example, by
> > replicating the joined collection to every shard - or other method I am
> > ignoring.
> > 
> > Any thought ? 
> > -- 
> > nicolas
> 

-- 
nicolas


Re: Solr-Cloud, join and collection collocation

2019-10-16 Thread Nicolas Paris
Sadly, the join performances are poor.
The joined collection is 12M documents, and the performances are 6k ms
versus 60ms when I compare to the denormalized field.

Apparently, the performances does not change when the filter on the
joined collection is changed. It is still 6k ms when the subset is 12M
or 1 document in size. So the performance of join looks correlated to
size of joined collection and not the kind of filter applied to it.

I will explore the streaming expressions

On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > You can certainly replicate the joined collection to every shard. It
> > must fit in one shard and a replica of that shard must be co-located
> > with every replica of the “to” collection.
> 
> Yes, I found this in the documentation, with a clear example just after
> this mail. I will test it today. I also read your blog about join
> performances[1] and I suspect the performance impact of joins will be
> huge because the joined collection is about 10M documents (only two
> fields, unique id and an array of longs and a filter applied to the
> array, join key is 10M unique IDs).
> 
> > Have you looked at streaming and “streaming expressions"? It does not
> > have the same problem, although it does have its own limitations.
> 
> I never tested them, and I am not very confortable yet in how to test
> them. Is it possible to mix query parsers and streaming expression in
> the client call via http parameters - or is streaming expression apply
> programmatically only ?
> 
> [1] https://lucidworks.com/post/solr-and-joins/
> 
> On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > You can certainly replicate the joined collection to every shard. It must 
> > fit in one shard and a replica of that shard must be co-located with every 
> > replica of the “to” collection.
> > 
> > Have you looked at streaming and “streaming expressions"? It does not have 
> > the same problem, although it does have its own limitations.
> > 
> > Best,
> > Erick
> > 
> > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris  
> > > wrote:
> > > 
> > > Hi
> > > 
> > > I have several large collections that cannot fit in a standalone solr
> > > instance. They are split over multiple shards in solr-cloud mode.
> > > 
> > > Those collections are supposed to be joined to an other collection to
> > > retrieve subset. Because I am using distributed collections, I am not
> > > able to use the solr join feature.
> > > 
> > > For this reason, I denormalize the information by adding the joined
> > > collection within every collections. Naturally, when I want to update
> > > the joined collection, I have to update every one of the distributed
> > > collections.
> > > 
> > > In standalone mode, I only would have to update the joined collection.
> > > 
> > > I wonder if there is a way to overcome this limitation. For example, by
> > > replicating the joined collection to every shard - or other method I am
> > > ignoring.
> > > 
> > > Any thought ? 
> > > -- 
> > > nicolas
> > 
> 
> -- 
> nicolas
> 

-- 
nicolas


Re: Solr-Cloud, join and collection collocation

2019-10-16 Thread Nicolas Paris
> Note: adding score=none as a local param. Turns another algorithm
> dragging by from side join.

Indeed, the behavior with score=none local param is a query time
correlated with the joined collection subset size. For subset of 100k
documenrs, the query time is 1 seconds, 4 sec for 1M I get client
timeout (15sec) for any superior to 5M.

On this basis I guess some redesign will be necessary to find the good
in between normalization and de-normalization for insertion/selection
speed trade-off

Thanks



On Wed, Oct 16, 2019 at 03:32:33PM +0300, Mikhail Khludnev wrote:
> Note: adding score=none as a local param. Turns another algorithm dragging
> by from side join.
> 
> On Wed, Oct 16, 2019 at 11:37 AM Nicolas Paris 
> wrote:
> 
> > Sadly, the join performances are poor.
> > The joined collection is 12M documents, and the performances are 6k ms
> > versus 60ms when I compare to the denormalized field.
> >
> > Apparently, the performances does not change when the filter on the
> > joined collection is changed. It is still 6k ms when the subset is 12M
> > or 1 document in size. So the performance of join looks correlated to
> > size of joined collection and not the kind of filter applied to it.
> >
> > I will explore the streaming expressions
> >
> > On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote:
> > > > You can certainly replicate the joined collection to every shard. It
> > > > must fit in one shard and a replica of that shard must be co-located
> > > > with every replica of the “to” collection.
> > >
> > > Yes, I found this in the documentation, with a clear example just after
> > > this mail. I will test it today. I also read your blog about join
> > > performances[1] and I suspect the performance impact of joins will be
> > > huge because the joined collection is about 10M documents (only two
> > > fields, unique id and an array of longs and a filter applied to the
> > > array, join key is 10M unique IDs).
> > >
> > > > Have you looked at streaming and “streaming expressions"? It does not
> > > > have the same problem, although it does have its own limitations.
> > >
> > > I never tested them, and I am not very confortable yet in how to test
> > > them. Is it possible to mix query parsers and streaming expression in
> > > the client call via http parameters - or is streaming expression apply
> > > programmatically only ?
> > >
> > > [1] https://lucidworks.com/post/solr-and-joins/
> > >
> > > On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote:
> > > > You can certainly replicate the joined collection to every shard. It
> > must fit in one shard and a replica of that shard must be co-located with
> > every replica of the “to” collection.
> > > >
> > > > Have you looked at streaming and “streaming expressions"? It does not
> > have the same problem, although it does have its own limitations.
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris 
> > wrote:
> > > > >
> > > > > Hi
> > > > >
> > > > > I have several large collections that cannot fit in a standalone solr
> > > > > instance. They are split over multiple shards in solr-cloud mode.
> > > > >
> > > > > Those collections are supposed to be joined to an other collection to
> > > > > retrieve subset. Because I am using distributed collections, I am not
> > > > > able to use the solr join feature.
> > > > >
> > > > > For this reason, I denormalize the information by adding the joined
> > > > > collection within every collections. Naturally, when I want to update
> > > > > the joined collection, I have to update every one of the distributed
> > > > > collections.
> > > > >
> > > > > In standalone mode, I only would have to update the joined
> > collection.
> > > > >
> > > > > I wonder if there is a way to overcome this limitation. For example,
> > by
> > > > > replicating the joined collection to every shard - or other method I
> > am
> > > > > ignoring.
> > > > >
> > > > > Any thought ?
> > > > > --
> > > > > nicolas
> > > >
> > >
> > > --
> > > nicolas
> > >
> >
> > --
> > nicolas
> >
> 
> 
> -- 
> Sincerely yours
> Mikhail Khludnev

-- 
nicolas


Re: Document Update performances Improvement

2019-10-19 Thread Nicolas Paris
Hi community,

Any advice to speed-up updates ?
Is there any advice on commit, memory, docvalues, stored or any tips to
faster things ?

Thanks


On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> Hi
> 
> I am looking for a way to faster the update of documents.
> 
> In my context, the update replaces one of the many existing indexed
> fields, and keep the others as is.
> 
> Right now, I am building the whole document, and replacing the existing
> one by id.
> 
> I am wondering if **atomic update feature** would faster the process.
> 
> From one hand, using this feature would save network because only a
> small subset of the document would be send from the client to the
> server. 
> On the other hand, the server will have to collect the values from the
> disk and reindex them. In addition, this implies to store the values for
> every fields (I am not storing every fields) and use more space.
> 
> Also I have read about the ConcurrentUpdateSolrServer class might be an
> optimized way of updating documents.
> 
> I am using spark-solr library to deal with solr-cloud. If something
> exist to faster the process, I would be glad to implement it in that
> library.
> Also, I have split the collection over multiple shard, and I admit this
> faster the update process, but who knows ?
> 
> Thoughts ?
> 
> -- 
> nicolas
> 

-- 
nicolas


Re: Document Update performances Improvement

2019-10-19 Thread Nicolas Paris
> Maybe you need to give more details. I recommend always to try and
> test yourself as you know your own solution best. What performance do
> your use car needs and what is your current performance?

I have 10 collections on 4 shards (no replications). The collections are
quite large ranging from 2GB to 60 GB per shard. In every case, the
update process only add several values to an indexed array field on a
document subset of each collection. The proportion of the subset is from
0 to 100%, and 95% of time below 20%. The array field represents 1 over
20 fields which are mainly unstored fields with some large textual
fields.

The 4 solr instance collocate with the spark. Right now I tested with 40
spark executors. Commit timing and commit number document are both set
to 2. Each shard has 20g of memory.
Loading/replacing the largest collection is about 2 hours - which is
quite fast I guess. Updating 5% percent of documents of each
collections, is about half an hour.

Because my need is "only" to append several values to an array I suspect
there is some trick to make things faster.



On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> Maybe you need to give more details. I recommend always to try and test 
> yourself as you know your own solution best. Depending on your spark process 
> atomic updates  could be faster.
> 
> With Spark-Solr additional complexity comes. You could have too many 
> executors for your Solr instance(s), ie a too high parallelism.
> 
> Probably the most important question is:
> What performance do your use car needs and what is your current performance?
> 
> Once this is clear further architecture aspects can be derived, such as 
> number of spark executors, number of Solr instances, sharding, replication, 
> commit timing etc.
> 
> > Am 19.10.2019 um 21:52 schrieb Nicolas Paris :
> > 
> > Hi community,
> > 
> > Any advice to speed-up updates ?
> > Is there any advice on commit, memory, docvalues, stored or any tips to
> > faster things ?
> > 
> > Thanks
> > 
> > 
> >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> >> Hi
> >> 
> >> I am looking for a way to faster the update of documents.
> >> 
> >> In my context, the update replaces one of the many existing indexed
> >> fields, and keep the others as is.
> >> 
> >> Right now, I am building the whole document, and replacing the existing
> >> one by id.
> >> 
> >> I am wondering if **atomic update feature** would faster the process.
> >> 
> >> From one hand, using this feature would save network because only a
> >> small subset of the document would be send from the client to the
> >> server. 
> >> On the other hand, the server will have to collect the values from the
> >> disk and reindex them. In addition, this implies to store the values for
> >> every fields (I am not storing every fields) and use more space.
> >> 
> >> Also I have read about the ConcurrentUpdateSolrServer class might be an
> >> optimized way of updating documents.
> >> 
> >> I am using spark-solr library to deal with solr-cloud. If something
> >> exist to faster the process, I would be glad to implement it in that
> >> library.
> >> Also, I have split the collection over multiple shard, and I admit this
> >> faster the update process, but who knows ?
> >> 
> >> Thoughts ?
> >> 
> >> -- 
> >> nicolas
> >> 
> > 
> > -- 
> > nicolas
> 

-- 
nicolas


Re: Document Update performances Improvement

2019-10-22 Thread Nicolas Paris
> We, at Auto-Suggest, also do atomic updates daily and specifically
> changing merge factor gave us a boost of ~4x

Interesting. What kind of change exactly on the merge factor side ?


> At current configuration, our core atomically updates ~423 documents
> per second.

Would you say atomical update is faster than regular replacement of
documents ? (considering my first thought on this below)

> > I am wondering if **atomic update feature** would faster the process.
> > From one hand, using this feature would save network because only a
> > small subset of the document would be send from the client to the
> > server.
> > On the other hand, the server will have to collect the values from the
> > disk and reindex them. In addition, this implies to store the values
> > every fields (I am not storing every fields) and use more space.


Thanks Paras



On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote:
> Hi Nicolas,
> 
> Have you tried playing with values of *IndexConfig*
> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html>
> (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at
> Auto-Suggest, also do atomic updates daily and specifically changing merge
> factor gave us a boost of ~4x during indexing. At current configuration,
> our core atomically updates ~423 documents per second.
> 
> On Sun, 20 Oct 2019 at 02:07, Nicolas Paris 
> wrote:
> 
> > > Maybe you need to give more details. I recommend always to try and
> > > test yourself as you know your own solution best. What performance do
> > > your use car needs and what is your current performance?
> >
> > I have 10 collections on 4 shards (no replications). The collections are
> > quite large ranging from 2GB to 60 GB per shard. In every case, the
> > update process only add several values to an indexed array field on a
> > document subset of each collection. The proportion of the subset is from
> > 0 to 100%, and 95% of time below 20%. The array field represents 1 over
> > 20 fields which are mainly unstored fields with some large textual
> > fields.
> >
> > The 4 solr instance collocate with the spark. Right now I tested with 40
> > spark executors. Commit timing and commit number document are both set
> > to 2. Each shard has 20g of memory.
> > Loading/replacing the largest collection is about 2 hours - which is
> > quite fast I guess. Updating 5% percent of documents of each
> > collections, is about half an hour.
> >
> > Because my need is "only" to append several values to an array I suspect
> > there is some trick to make things faster.
> >
> >
> >
> > On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> > > Maybe you need to give more details. I recommend always to try and test
> > yourself as you know your own solution best. Depending on your spark
> > process atomic updates  could be faster.
> > >
> > > With Spark-Solr additional complexity comes. You could have too many
> > executors for your Solr instance(s), ie a too high parallelism.
> > >
> > > Probably the most important question is:
> > > What performance do your use car needs and what is your current
> > performance?
> > >
> > > Once this is clear further architecture aspects can be derived, such as
> > number of spark executors, number of Solr instances, sharding, replication,
> > commit timing etc.
> > >
> > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris  > >:
> > > >
> > > > Hi community,
> > > >
> > > > Any advice to speed-up updates ?
> > > > Is there any advice on commit, memory, docvalues, stored or any tips to
> > > > faster things ?
> > > >
> > > > Thanks
> > > >
> > > >
> > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> > > >> Hi
> > > >>
> > > >> I am looking for a way to faster the update of documents.
> > > >>
> > > >> In my context, the update replaces one of the many existing indexed
> > > >> fields, and keep the others as is.
> > > >>
> > > >> Right now, I am building the whole document, and replacing the
> > existing
> > > >> one by id.
> > > >>
> > > >> I am wondering if **atomic update feature** would faster the process.
> > > >>
> > > >> From one hand, using this feature would save network because only a
> > > >> small subset of the document

Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>.
> <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>

Thanks for those relevant pointers and the explanation.

> How often do you commit? Are you committing after each XML is
> indexed? If yes, what is your batch (XML) size? Review default settings of
> autoCommit and considering increasing it. 

I guess I do not use any XML under the hood: spark-solr uses sorlj which
serialize the document in java binary objects. However the commit
strategy applies too, I have setup 20,000 documents or 20,000 ms.

> Do you want real time reflection
> of updates? If no, you can compromise on commits and merge factors and do
> faster indexing. Don't so soft commits then.

Indeed I d'like the document be accessible sooner. That being said, 5
minutes delay is acceptable.

> In our case, I have set autoCommit to commit after 50,000 documents are
> indexed. After EdgeNGrams tokenization, while full indexing, we have seen
> index to get over 60 GBs. Once we are done with full indexing, I optimize
> the index and the index size comes below 13 GB!

I guess I get the idea: "put the dollars as fast as possible in the bag,
we will clean-up when back home"

Thanks

On Wed, Oct 23, 2019 at 11:34:44AM +0530, Paras Lehana wrote:
> Hi Nicolas,
> 
> What kind of change exactly on the merge factor side ?
> 
> 
> We increased maxMergeAtOnce and segmentsPerTier from 5 to 50. This will
> make Solr to merge segments less frequently after many index updates. Yes,
> you need to find the sweet spot here but do try increasing these values
> from the default ones. I strongly recommend you to give a 2 min read to this
> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>.
> Do note that increasing these values will require you to have larger
> physical storage until segments merge.
> 
> Besides this, do review your autoCommit config
> <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit>
> or the frequency of your hard commits. In our case, we don't want real time
> updates - so we can always commit less frequently. This makes indexing
> faster. How often do you commit? Are you committing after each XML is
> indexed? If yes, what is your batch (XML) size? Review default settings of
> autoCommit and considering increasing it. Do you want real time reflection
> of updates? If no, you can compromise on commits and merge factors and do
> faster indexing. Don't so soft commits then.
> 
> In our case, I have set autoCommit to commit after 50,000 documents are
> indexed. After EdgeNGrams tokenization, while full indexing, we have seen
> index to get over 60 GBs. Once we are done with full indexing, I optimize
> the index and the index size comes below 13 GB! Since we can trade off
> space temporarily for increased indexing speed, we are still committed to
> find sweeter spots for faster indexing. For statistics purpose, we have
> over 250 million documents for indexing that converges to 60 million unique
> documents after atomic updates (full indexing).
> 
> 
> 
> > Would you say atomical update is faster than regular replacement of
> > documents?
> 
> 
> No, I don't say that. Either of the two configs (autoCommit, Merge Policy)
> will impact regular indexing too. In our case, non-atomic indexing is out
> of question.
> 
> On Wed, 23 Oct 2019 at 00:43, Nicolas Paris 
> wrote:
> 
> > > We, at Auto-Suggest, also do atomic updates daily and specifically
> > > changing merge factor gave us a boost of ~4x
> >
> > Interesting. What kind of change exactly on the merge factor side ?
> >
> >
> > > At current configuration, our core atomically updates ~423 documents
> > > per second.
> >
> > Would you say atomical update is faster than regular replacement of
> > documents ? (considering my first thought on this below)
> >
> > > > I am wondering if **atomic update feature** would faster the process.
> > > > From one hand, using this feature would save network because only a
> > > > small subset of the document would be send from the client to the
> > > > server.
> > > > On the other hand, the server will have to collect the values from the
> > > > disk and reindex them. In addition, this implies to store the values
> > > > every fields (I am not storing every fields) and use more space.
> >
> >
> > Thanks Paras
> >
> >
> >
> > On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana w

Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
> Set the first two to the same number, and the third to a minumum of three
> times what you set the other two.
> When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to
> 35, and maxMergeAtOnceExplicit to 105.  This made merging happen a lot less
> frequently.

Good to know the chief recipes.

> On the Solr side, atomic updates will be slightly slower than indexing the
> whole document provided to Solr. 

This makes sense.

> If Solr can access the previous document faster than you can get the
> document from your source system, then atomic updates might be faster.

The documents are stored within parquet files without any processing
needed. In this case, the atomic update is not likely to faster things.


Thanks

On Wed, Oct 23, 2019 at 07:49:44AM -0600, Shawn Heisey wrote:
> On 10/22/2019 1:12 PM, Nicolas Paris wrote:
> > > We, at Auto-Suggest, also do atomic updates daily and specifically
> > > changing merge factor gave us a boost of ~4x
> > 
> > Interesting. What kind of change exactly on the merge factor side ?
> 
> The mergeFactor setting is deprecated.  Instead, use maxMergeAtOnce,
> segmentsPerTier, and a setting that is not mentioned in the ref guide --
> maxMergeAtOnceExplicit.
> 
> Set the first two to the same number, and the third to a minumum of three
> times what you set the other two.
> 
> The default setting for maxMergeAtOnce and segmentsPerTier is 10, with 30
> for maxMergeAtOnceExplicit.  When you're trying to increase indexing speed
> and you think segment merging is interfering, you want to increase these
> values to something larger.  Note that increasing these values will increase
> the number of files that your Solr install keeps open.
> 
> https://lucene.apache.org/solr/guide/8_1/indexconfig-in-solrconfig.html#mergepolicyfactory
> 
> When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to
> 35, and maxMergeAtOnceExplicit to 105.  This made merging happen a lot less
> frequently.
> 
> > Would you say atomical update is faster than regular replacement of
> > documents ? (considering my first thought on this below)
> 
> On the Solr side, atomic updates will be slightly slower than indexing the
> whole document provided to Solr.  When an atomic update is done, Solr will
> find the existing document, then combine what's in that document with the
> changes you specify using the atomic update, and then index the whole
> combined document as a new document that replaces with original.
> 
> Whether or not atomic updates are faster or slower in practice than indexing
> the whole document will depend on how your source systems work, and that is
> not something we can know.  If Solr can access the previous document faster
> than you can get the document from your source system, then atomic updates
> might be faster.
> 
> Thanks,
> Shawn
> 

-- 
nicolas


Re: Document Update performances Improvement

2019-10-23 Thread Nicolas Paris
> With Spark-Solr additional complexity comes. You could have too many
> executors for your Solr instance(s), ie a too high parallelism.

I have been reducing the parallelism of spark-solr part by 5. I had 40
executors loading 4 shards. Right now only 8 executors loading 4 shards.
As a result, I can see a 10 times update improvement, and I suspect the
update process had been overhelmed by spark.

I have been able to keep 40 executor for document preprocessing and
reducing to 8 executors within the same spark job by using the
"dataframe.coalesce" feature which does not shuffle the data at all and
keeps both spark cluster and solr quiet in term of network.

Thanks

On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote:
> Maybe you need to give more details. I recommend always to try and test 
> yourself as you know your own solution best. Depending on your spark process 
> atomic updates  could be faster.
> 
> With Spark-Solr additional complexity comes. You could have too many 
> executors for your Solr instance(s), ie a too high parallelism.
> 
> Probably the most important question is:
> What performance do your use car needs and what is your current performance?
> 
> Once this is clear further architecture aspects can be derived, such as 
> number of spark executors, number of Solr instances, sharding, replication, 
> commit timing etc.
> 
> > Am 19.10.2019 um 21:52 schrieb Nicolas Paris :
> > 
> > Hi community,
> > 
> > Any advice to speed-up updates ?
> > Is there any advice on commit, memory, docvalues, stored or any tips to
> > faster things ?
> > 
> > Thanks
> > 
> > 
> >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote:
> >> Hi
> >> 
> >> I am looking for a way to faster the update of documents.
> >> 
> >> In my context, the update replaces one of the many existing indexed
> >> fields, and keep the others as is.
> >> 
> >> Right now, I am building the whole document, and replacing the existing
> >> one by id.
> >> 
> >> I am wondering if **atomic update feature** would faster the process.
> >> 
> >> From one hand, using this feature would save network because only a
> >> small subset of the document would be send from the client to the
> >> server. 
> >> On the other hand, the server will have to collect the values from the
> >> disk and reindex them. In addition, this implies to store the values for
> >> every fields (I am not storing every fields) and use more space.
> >> 
> >> Also I have read about the ConcurrentUpdateSolrServer class might be an
> >> optimized way of updating documents.
> >> 
> >> I am using spark-solr library to deal with solr-cloud. If something
> >> exist to faster the process, I would be glad to implement it in that
> >> library.
> >> Also, I have split the collection over multiple shard, and I admit this
> >> faster the update process, but who knows ?
> >> 
> >> Thoughts ?
> >> 
> >> -- 
> >> nicolas
> >> 
> > 
> > -- 
> > nicolas
> 

-- 
nicolas


Re: POS Tagger

2019-10-25 Thread Nicolas Paris
Also we are using stanford POS tagger for french. The processing time is
mitigated by the spark-corenlp package which distribute the process over
multiple node.

Also I am interesting in the way you use POS information within solr
queries, or solr fields. 

Thanks,
On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> ah, yeah its not the fastest but it proved to be the best for my purposes,
> I use it to pre-process data before indexing, to apply more metadata to the
> documents in a separate field(s)
> 
> On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> audrey.lorberf...@ibm.com  wrote:
> 
> > No, I meant for part-of-speech tagging __ But that's interesting that you
> > use StanfordNLP. I've read that it's very slow, so we are concerned that it
> > might not work for us at query-time. Do you use it at query-time, or just
> > index-time?
> >
> > --
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >
> >
> > On 10/25/19, 10:30 AM, "David Hastings" 
> > wrote:
> >
> > Do you mean for entity extraction?
> > I make a LOT of use from the stanford nlp project, and get out the
> > entities
> > and use them for different purposes in solr
> > -Dave
> >
> > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> >
> > > Hi All,
> > >
> > > Does anyone use a POS tagger with their Solr instance other than
> > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > >
> > > Thanks!
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> >
> >
> >

-- 
nicolas


Re: POS Tagger

2019-10-25 Thread Nicolas Paris
> Do you use the POS tagger at query time, or just at index time? 

I have the POS tagger pipeline ready but nothing done yet on the solr
part. Right now I am wondering how to use it but still looking for
relevant implementation.

I guess having the POS information ready before indexation gives the
flexibility to test multiple scenario.

In case of acronyms, one possible way is indeed to consider the user
query as NOUNS, and from the index side, only keep the acronyms that
are tagged with NOUNS. (i.e. detect acronyms within text, and look for
it's POS; remove it in case it's not a NOUN)

Definitely, I prefer the pre-processing approach for this, than creating
dedicated solr analysers because my context is batch processing, and
also this simplifies testing and debugging - while offering large panel
of NLP tools to deal with.

On Fri, Oct 25, 2019 at 04:09:29PM +, Audrey Lorberfeld - 
audrey.lorberf...@ibm.com wrote:
> Nicolas,
> 
> Do you use the POS tagger at query time, or just at index time? 
> 
> We are thinking of using it to filter the tokens we will eventually perform 
> ML on. Basically, we have a bunch of acronyms in our corpus. However, many 
> departments use the same acronyms but expand those acronyms to different 
> things. Eventually, we are thinking of using ML on our index to determine 
> which expansion is meant by a particular query according to the context we 
> find in certain documents. However, since we don't want to run ML on all 
> tokens in a query, and since we think that acronyms are usually the nouns in 
> a multi-token query, we want to only feed nouns to the ML model (TBD).
> 
> Does that make sense? So, we'd want both an index-side POS tagger (could be 
> slow), and also a query-side POS tagger (must be fast).
> 
> -- 
> Audrey Lorberfeld
> Data Scientist, w3 Search
> IBM
> audrey.lorberf...@ibm.com
>  
> 
> On 10/25/19, 11:57 AM, "Nicolas Paris"  wrote:
> 
> Also we are using stanford POS tagger for french. The processing time is
> mitigated by the spark-corenlp package which distribute the process over
> multiple node.
> 
> Also I am interesting in the way you use POS information within solr
> queries, or solr fields. 
> 
> Thanks,
> On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > ah, yeah its not the fastest but it proved to be the best for my 
> purposes,
> > I use it to pre-process data before indexing, to apply more metadata to 
> the
> > documents in a separate field(s)
> > 
> > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > audrey.lorberf...@ibm.com  wrote:
> > 
> > > No, I meant for part-of-speech tagging __ But that's interesting that 
> you
> > > use StanfordNLP. I've read that it's very slow, so we are concerned 
> that it
> > > might not work for us at query-time. Do you use it at query-time, or 
> just
> > > index-time?
> > >
> > > --
> > > Audrey Lorberfeld
> > > Data Scientist, w3 Search
> > > IBM
> > > audrey.lorberf...@ibm.com
> > >
> > >
> > > On 10/25/19, 10:30 AM, "David Hastings" 
> > > wrote:
> > >
> > > Do you mean for entity extraction?
> > > I make a LOT of use from the stanford nlp project, and get out the
> > > entities
> > > and use them for different purposes in solr
> > > -Dave
> > >
> > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > >
> > > > Hi All,
> > > >
> > > > Does anyone use a POS tagger with their Solr instance other than
> > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > >
> > > > Thanks!
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > >
> > >
> > >
> 
> -- 
> nicolas
> 
> 

-- 
nicolas


Re: POS Tagger

2019-10-25 Thread Nicolas Paris
Also the openNlp solr POS tagger [1] uses the typeAsSynonymFilter to
store the POS: 

" Index the POS for each token as a synonym, after prefixing the POS with @ "

Not sure how to deal with POS after such indexing, but this looks
interesting approach ?

[1] 
http://lucene.apache.org/solr/guide/7_3/language-analysis.html#opennlp-part-of-speech-filter
On Fri, Oct 25, 2019 at 06:25:36PM +0200, Nicolas Paris wrote:
> > Do you use the POS tagger at query time, or just at index time? 
> 
> I have the POS tagger pipeline ready but nothing done yet on the solr
> part. Right now I am wondering how to use it but still looking for
> relevant implementation.
> 
> I guess having the POS information ready before indexation gives the
> flexibility to test multiple scenario.
> 
> In case of acronyms, one possible way is indeed to consider the user
> query as NOUNS, and from the index side, only keep the acronyms that
> are tagged with NOUNS. (i.e. detect acronyms within text, and look for
> it's POS; remove it in case it's not a NOUN)
> 
> Definitely, I prefer the pre-processing approach for this, than creating
> dedicated solr analysers because my context is batch processing, and
> also this simplifies testing and debugging - while offering large panel
> of NLP tools to deal with.
> 
> On Fri, Oct 25, 2019 at 04:09:29PM +, Audrey Lorberfeld - 
> audrey.lorberf...@ibm.com wrote:
> > Nicolas,
> > 
> > Do you use the POS tagger at query time, or just at index time? 
> > 
> > We are thinking of using it to filter the tokens we will eventually perform 
> > ML on. Basically, we have a bunch of acronyms in our corpus. However, many 
> > departments use the same acronyms but expand those acronyms to different 
> > things. Eventually, we are thinking of using ML on our index to determine 
> > which expansion is meant by a particular query according to the context we 
> > find in certain documents. However, since we don't want to run ML on all 
> > tokens in a query, and since we think that acronyms are usually the nouns 
> > in a multi-token query, we want to only feed nouns to the ML model (TBD).
> > 
> > Does that make sense? So, we'd want both an index-side POS tagger (could be 
> > slow), and also a query-side POS tagger (must be fast).
> > 
> > -- 
> > Audrey Lorberfeld
> > Data Scientist, w3 Search
> > IBM
> > audrey.lorberf...@ibm.com
> >  
> > 
> > On 10/25/19, 11:57 AM, "Nicolas Paris"  wrote:
> > 
> > Also we are using stanford POS tagger for french. The processing time is
> > mitigated by the spark-corenlp package which distribute the process over
> > multiple node.
> > 
> > Also I am interesting in the way you use POS information within solr
> > queries, or solr fields. 
> > 
> > Thanks,
> > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote:
> > > ah, yeah its not the fastest but it proved to be the best for my 
> > purposes,
> > > I use it to pre-process data before indexing, to apply more metadata 
> > to the
> > > documents in a separate field(s)
> > > 
> > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld -
> > > audrey.lorberf...@ibm.com  wrote:
> > > 
> > > > No, I meant for part-of-speech tagging __ But that's interesting 
> > that you
> > > > use StanfordNLP. I've read that it's very slow, so we are concerned 
> > that it
> > > > might not work for us at query-time. Do you use it at query-time, 
> > or just
> > > > index-time?
> > > >
> > > > --
> > > > Audrey Lorberfeld
> > > > Data Scientist, w3 Search
> > > > IBM
> > > > audrey.lorberf...@ibm.com
> > > >
> > > >
> > > > On 10/25/19, 10:30 AM, "David Hastings" 
> > 
> > > > wrote:
> > > >
> > > > Do you mean for entity extraction?
> > > > I make a LOT of use from the stanford nlp project, and get out 
> > the
> > > > entities
> > > > and use them for different purposes in solr
> > > > -Dave
> > > >
> > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld -
> > > > audrey.lorberf...@ibm.com  wrote:
> > > >
> > > > > Hi All,
> > > > >
> > > > > Does anyone use a POS tagger with their Solr instance other 
> > than
> > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > --
> > > > > Audrey Lorberfeld
> > > > > Data Scientist, w3 Search
> > > > > IBM
> > > > > audrey.lorberf...@ibm.com
> > > > >
> > > > >
> > > >
> > > >
> > > >
> > 
> > -- 
> > nicolas
> > 
> > 
> 
> -- 
> nicolas
> 

-- 
nicolas


Re: Solr Ref Guide Changes - now HTML only

2019-10-28 Thread Nicolas Paris
> If you are someone who wishes the PDF would continue, please share your
> feedback.

I have not particularly explored the documentation format but the
content. However here my thought on this:

Pdf version of solr documentation has two advantages:
1. readable offline
2. make searching easier than the html version


If there were a "one page" version of the html documentation, this
would mitigate searching within the whole. Also a monolitic html page
makes things easier to access offline.(transform back to pdf, ebook..?)

I am not very happy with the search engine embedded within the html
documentation I admit. Hope this is not solr under the hood :S

-- 
nicolas


CloudSolrClient - basic auth - multi shard collection

2019-11-18 Thread Nicolas Paris
Hello,

I am having trouble with basic auth on a solrcloud instance. When the
collection is only one shard, there is no problem. When the collection
is multiple shard, there is no problem until I ask multiple query
concurrently: I get 401 error and asking for credentials for concurrent
queries.

I have created a Premptive Auth Interceptor which should add the
credential information for every http call.

Thanks for any pointer,

solr:8.1
spring-data-solr:4.1.0
-- 
nicolas


Re: CloudSolrClient - basic auth - multi shard collection

2019-11-20 Thread Nicolas Paris
>  you can fix the issue by upgrading to 8.2 - both of those

Thanks, I will try ASAP

On Wed, Nov 20, 2019 at 01:58:31PM -0500, Jason Gerlowski wrote:
> Hi Nicholas,
> 
> I'm not really familiar with spring-data-solr, so I can't speak to
> that detail, but it sounds like you might be running into either
> https://issues.apache.org/jira/browse/SOLR-13510 or
> https://issues.apache.org/jira/browse/SOLR-13472.  There are partial
> workarounds on those issues that might help you.  If those aren't
> sufficient, you can fix the issue by upgrading to 8.2 - both of those
> bugs are fixed in that version.
> 
> Hope that helps,
> 
> Jason
> 
> 
> On Mon, Nov 18, 2019 at 8:26 AM Nicolas Paris  
> wrote:
> >
> > Hello,
> >
> > I am having trouble with basic auth on a solrcloud instance. When the
> > collection is only one shard, there is no problem. When the collection
> > is multiple shard, there is no problem until I ask multiple query
> > concurrently: I get 401 error and asking for credentials for concurrent
> > queries.
> >
> > I have created a Premptive Auth Interceptor which should add the
> > credential information for every http call.
> >
> > Thanks for any pointer,
> >
> > solr:8.1
> > spring-data-solr:4.1.0
> > --
> > nicolas
> 

-- 
nicolas


Re: A Last Message to the Solr Users

2019-12-01 Thread Nicolas Paris
Hi Mark,

Have you shared with the community all the weaknesses of solrcloud  you
have in mind and the advice to overcome that? 
Apparently you wrote most of that code and your feedback would be
helpful for community.

Regards

On Sat, Nov 30, 2019 at 09:31:34PM -0600, Mark Miller wrote:
> I’d also like to say the last 5 years of my life have been spent being paid
> to upgrade Solr systems. I’ve made a lot of doing this.
> 
> As I said from the start - take this for what it’s worst. For his guys it’s
> not worth much. That’s cool.
> 
> And it’s a little inside joke that I’ll be back :) I joke a lot.
> 
> But seriously, you have a second chance here.
> 
> This mostly concerns SolrCloud. That’s why I recommend standalone mode. But
> key people know why to do. I know it will happen - but their lives will be
> easier if you help.
> 
> Lol.
> 
> - Mark
> 
> On Sat, Nov 30, 2019 at 9:25 PM Mark Miller  wrote:
> 
> > I said the key people understand :)
> >
> > I’ve worked in Lucene since 2006 and have an insane amount of the code
> > foot print in Solr and SolrCloud :) Look up the stats. Do you have any
> > contributions?
> >
> > I said the key people know.
> >
> > Solr stand-alone is and has been very capable. People are working around
> > SolrCloud too.All fine and good. Millions are being made and saved.
> > Everyone is comfortable. Some might thinks the sky looks clear and blue.
> > I’ve spent a lot of capital to make sure the wrong people don’t think that
> > anymore ;)
> >
> > Unless you are a Developer, you won’t understand the other issues. But you
> > don’t need too.
> >
> > Mark
> >
> > On Sat, Nov 30, 2019 at 7:05 PM Dave  wrote:
> >
> >> I’m young here I think, not even 40 and only been using solr since like
> >> 2008 or so, so like 1.4 give or take. But I know a really good therapist if
> >> you want to talk about it.
> >>
> >> > On Nov 30, 2019, at 6:56 PM, Mark Miller  wrote:
> >> >
> >> > Now I have sacrificed to give you a new chance. A little for my
> >> community.
> >> > It was my community. But it was mostly for me. The developer I started
> >> as
> >> > would kick my ass today.  Circumstances and luck has brought money to
> >> our
> >> > project. And it has corrupted our process, our community, and our code.
> >> >
> >> > In college i would talk about past Mark screwing future Mark and too bad
> >> > for him. What did he ever do for me? Well, he got me again ;)
> >> >
> >> > I’m out of steam, time and wife patentice.
> >> >
> >> > Enough key people are aware of the scope of the problem now that you
> >> won’t
> >> > need me. I was never actually part of the package. To the many, many
> >> people
> >> > that offered me private notes of encouragement and future help - thank
> >> you
> >> > so much. Your help will be needed.
> >> >
> >> > You will reset. You will fix this. Or I will be back.
> >> >
> >> > Mark
> >> >
> >> >
> >> > --
> >> > - Mark
> >> >
> >> > http://about.me/markrmiller
> >>
> > --
> > - Mark
> >
> > http://about.me/markrmiller
> >
> -- 
> - Mark
> 
> http://about.me/markrmiller

-- 
nicolas


does copyFields increase indexe size ?

2019-12-24 Thread Nicolas Paris
Hi

>From my understanding, copy fields creates an new indexes from the
copied fields.
>From my tests, I copied 1k textual fields into _text_ with copyFields.
As a result there is no increase in the size of the collection. All the
source fields are indexed and stored. The _text_ field is indexed but
not stored.

This is a great surprise but is this behavior expected ?


-- 
nicolas


Re: does copyFields increase indexe size ?

2019-12-24 Thread Nicolas Paris
> The action of changing the schema makes zero changes in the index.  It
> merely changes how Solr interacts with the index.

Do you mean "copy fields" is only an action of changing the schema ?
I was thinking it was adding a new field and eventually a new index to
the collection

On Tue, Dec 24, 2019 at 10:59:03AM -0700, Shawn Heisey wrote:
> On 12/24/2019 10:45 AM, Nicolas Paris wrote:
> >  From my understanding, copy fields creates an new indexes from the
> > copied fields.
> >  From my tests, I copied 1k textual fields into _text_ with copyFields.
> > As a result there is no increase in the size of the collection. All the
> > source fields are indexed and stored. The _text_ field is indexed but
> > not stored.
> > 
> > This is a great surprise but is this behavior expected ?
> 
> The action of changing the schema makes zero changes in the index.  It
> merely changes how Solr interacts with the index.
> 
> If you want the index to change when the schema is changed, you need to
> restart or reload and then re-do the indexing after the change is saved.
> 
> https://cwiki.apache.org/confluence/display/solr/HowToReindex
> 
> Thanks,
> Shawn
> 

-- 
nicolas


Re: does copyFields increase indexe size ?

2019-12-25 Thread Nicolas Paris
> If you are redoing the indexing after changing the schema and
> reloading/restarting, then you can ignore me.

I am sorry to say that I have to ignore you. Indeed, my tests include
recreating the collection from scratch - with and without the copy
fields.
In both cases the index size is the same ! (while the _text_ field is
working correctly)

On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote:
> On 12/24/2019 5:11 PM, Nicolas Paris wrote:
> > Do you mean "copy fields" is only an action of changing the schema ?
> > I was thinking it was adding a new field and eventually a new index to
> > the collection
> 
> The copy that copyField does happens at index time.  Reindexing is required
> after changing the schema, or nothing happens.
> 
> If you are redoing the indexing after changing the schema and
> reloading/restarting, then you can ignore me.
> 
> Thanks,
> Shawn
> 

-- 
nicolas


Re: does copyFields increase indexe size ?

2019-12-25 Thread Nicolas Paris


On Wed, Dec 25, 2019 at 05:30:03AM -0500, Dave wrote:
> #2 you initially said you were talking about 1k documents. 

Hi Dave. Again, sorry for the confusion. This is 1k fields
(general_text), over 50M large  documents copied into one _text_ field. 
4 shards, 40GB per shard in both case, with/without the _text_ field

> 
> > On Dec 25, 2019, at 3:07 AM, Nicolas Paris  wrote:
> > 
> > 
> >> 
> >> If you are redoing the indexing after changing the schema and
> >> reloading/restarting, then you can ignore me.
> > 
> > I am sorry to say that I have to ignore you. Indeed, my tests include
> > recreating the collection from scratch - with and without the copy
> > fields.
> > In both cases the index size is the same ! (while the _text_ field is
> > working correctly)
> > 
> >> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote:
> >>> On 12/24/2019 5:11 PM, Nicolas Paris wrote:
> >>> Do you mean "copy fields" is only an action of changing the schema ?
> >>> I was thinking it was adding a new field and eventually a new index to
> >>> the collection
> >> 
> >> The copy that copyField does happens at index time.  Reindexing is required
> >> after changing the schema, or nothing happens.
> >> 
> >> If you are redoing the indexing after changing the schema and
> >> reloading/restarting, then you can ignore me.
> >> 
> >> Thanks,
> >> Shawn
> >> 
> > 
> > -- 
> > nicolas
> 

-- 
nicolas


Re: does copyFields increase indexe size ?

2019-12-26 Thread Nicolas Paris
Anyway, that´s good news copy field does not increase indexe size in
some circumstance:
- the copied fields and the target field share the same datatype
- the target field is not stored

this is tested on text fields


On Wed, Dec 25, 2019 at 11:42:23AM +0100, Nicolas Paris wrote:
> 
> On Wed, Dec 25, 2019 at 05:30:03AM -0500, Dave wrote:
> > #2 you initially said you were talking about 1k documents. 
> 
> Hi Dave. Again, sorry for the confusion. This is 1k fields
> (general_text), over 50M large  documents copied into one _text_ field. 
> 4 shards, 40GB per shard in both case, with/without the _text_ field
> 
> > 
> > > On Dec 25, 2019, at 3:07 AM, Nicolas Paris  
> > > wrote:
> > > 
> > > 
> > >> 
> > >> If you are redoing the indexing after changing the schema and
> > >> reloading/restarting, then you can ignore me.
> > > 
> > > I am sorry to say that I have to ignore you. Indeed, my tests include
> > > recreating the collection from scratch - with and without the copy
> > > fields.
> > > In both cases the index size is the same ! (while the _text_ field is
> > > working correctly)
> > > 
> > >> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote:
> > >>> On 12/24/2019 5:11 PM, Nicolas Paris wrote:
> > >>> Do you mean "copy fields" is only an action of changing the schema ?
> > >>> I was thinking it was adding a new field and eventually a new index to
> > >>> the collection
> > >> 
> > >> The copy that copyField does happens at index time.  Reindexing is 
> > >> required
> > >> after changing the schema, or nothing happens.
> > >> 
> > >> If you are redoing the indexing after changing the schema and
> > >> reloading/restarting, then you can ignore me.
> > >> 
> > >> Thanks,
> > >> Shawn
> > >> 
> > > 
> > > -- 
> > > nicolas
> > 
> 
> -- 
> nicolas
> 

-- 
nicolas


Re: does copyFields increase indexe size ?

2019-12-26 Thread Nicolas Paris
Hi Eric

Below a part of the managed-schema. There is 1k section* fields. The
second experience, I removed the copyField, droped the collection and
re-indexed the whole. To mesure the index size, I went to solr-cloud and
looked in the cloud part: 40GO per shard. I also look at the folder
size. I made some tests and the _text_ field is indexed.

 















  
  









  






On Thu, Dec 26, 2019 at 02:16:32PM -0500, Erick Erickson wrote:
> This simply cannot be true unless the destination copyField is indexed=false, 
> docValues=false stored=false. I.e. “some circumstances” means there’s really 
> no use in using the copyField in the first place. I suppose that if you don’t 
> store any term vectors, no position information nothing except, say, the 
> terms then maybe you’ll have extremely minimal size. But even in that case, 
> I’d use the original field in an “fq” clause which doesn’t use any scoring in 
> place of using the copyField.
> 
> Each field is stored in a separate part of the relevant files (.tim, .pos, 
> etc). Term frequencies are kept on a _per field_ basis for instance.
> 
> So this pretty much has to be small sample size or other measurement error.
> 
> Best,
> Erick
> 
> > On Dec 26, 2019, at 9:27 AM, Nicolas Paris  wrote:
> > 
> > Anyway, that´s good news copy field does not increase indexe size in
> > some circumstance:
> > - the copied fields and the target field share the same datatype
> > - the target field is not stored
> > 
> > this is tested on text fields
> > 
> > 
> > On Wed, Dec 25, 2019 at 11:42:23AM +0100, Nicolas Paris wrote:
> >> 
> >> On Wed, Dec 25, 2019 at 05:30:03AM -0500, Dave wrote:
> >>> #2 you initially said you were talking about 1k documents. 
> >> 
> >> Hi Dave. Again, sorry for the confusion. This is 1k fields
> >> (general_text), over 50M large  documents copied into one _text_ field. 
> >> 4 shards, 40GB per shard in both case, with/without the _text_ field
> >> 
> >>> 
> >>>> On Dec 25, 2019, at 3:07 AM, Nicolas Paris  
> >>>> wrote:
> >>>> 
> >>>> 
> >>>>> 
> >>>>> If you are redoing the indexing after changing the schema and
> >>>>> reloading/restarting, then you can ignore me.
> >>>> 
> >>>> I am sorry to say that I have to ignore you. Indeed, my tests include
> >>>> recreating the collection from scratch - with and without the copy
> >>>> fields.
> >>>> In both cases the index size is the same ! (while the _text_ field is
> >>>> working correctly)
> >>>> 
> >>>>> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote:
> >>>>>> On 12/24/2019 5:11 PM, Nicolas Paris wrote:
> >>>>>> Do you mean "copy fields" is only an action of changing the schema ?
> >>>>>> I was thinking it was adding a new field and eventually a new index to
> >>>>>> the collection
> >>>>> 
> >>>>> The copy that copyField does happens at index time.  Reindexing is 
> >>>>> required
> >>>>> after changing the schema, or nothing happens.
> >>>>> 
> >>>>> If you are redoing the indexing after changing the schema and
> >>>>> reloading/restarting, then you can ignore me.
> >>>>> 
> >>>>> Thanks,
> >>>>> Shawn
> >>>>> 
> >>>> 
> >>>> -- 
> >>>> nicolas
> >>> 
> >> 
> >> -- 
> >> nicolas
> >> 
> > 
> > -- 
> > nicolas
> 

-- 
nicolas


Re: does copyFields increase indexe size ?

2019-12-28 Thread Nicolas Paris


> So what will be added is just another set of pointers to each relevant
> term. That's not going to be very large. Probably

Hi Shawn. This explains much ! Thanks.
In case of text fields, the highlight is done on the source fields and
the _text_ field is only used for lookup. This behavior is perfect for
my needs.

On Fri, Dec 27, 2019 at 05:28:25PM -0700, Shawn Heisey wrote:
> On 12/26/2019 1:21 PM, Nicolas Paris wrote:
> > Below a part of the managed-schema. There is 1k section* fields. The
> > second experience, I removed the copyField, droped the collection and
> > re-indexed the whole. To mesure the index size, I went to solr-cloud and
> > looked in the cloud part: 40GO per shard. I also look at the folder
> > size. I made some tests and the _text_ field is indexed.
> 
> Your schema says that the destination field is not stored and doesn't have
> docValues.  So the only thing it has is indexed.
> 
> All of the terms generated by index analysis will already be in the index
> from the source fields.  So what will be added is just another set of
> pointers to each relevant term.  That's not going to be very large. Probably
> only a few bytes for each term.
> 
> So with this copyField, the index will get larger, but probably not
> significantly.
> 
> Thanks,
> Shawn
> 

-- 
nicolas


Re: Coming back to search after some time... SOLR or Elastic for text search?

2020-01-16 Thread Nicolas Paris
> We have implemented the content ingestion and processing pipelines already
> in python and SPARK, so most of the data will be pushed in using APIs.

I use the spark-solr library in production and have looked at the ES
equivalent and the solr connector looks much more advanced for both
loading and fetching data. In particular the fetching part uses the solr
export handler which makes things incredibly fast. Also spark-solr uses
the dataframe API while ES looks to be stuck with the RDD api AFAIK.

A good connector to spark offer lot of perspectives in term of data
transformation and machine learning advanced features within the search
engine.

On Tue, Jan 14, 2020 at 11:02:17PM -0500, Dc Tech wrote:
> I am SOLR fant and had implemented it in our company over 10 years ago.
> I moved away from that role and the new search team in the meanwhile
> implemented a proprietary (and expensive) nosql style search engine. That
> the project did not go well, and now I am back to project and reviewing the
> technology stack.
> 
> Some of the team think that ElasticSearch could be a good option,
> especially since we can easily get hosted versions with AWS where we have
> all the contractual stuff sorted out.
> 
> Whle SOLR definitely seems more advanced  (LTR, streaming expressions,
> graph, and all the knobs and dials for relevancy tuning), Elastic may be
> sufficient for our needs. It does not seem to have LTR out of the box but
> the relevancy tuning knobs and dials seem to be similar to what SOLR has.
> 
> The corpus size is not a challenge  - we have about one million document,
> of which about 1/2 have full text, while the test are simpler (i.e. company
> directory etc.).
> The query volumes are also quite low (max 5/second at peak).
> We have implemented the content ingestion and processing pipelines already
> in python and SPARK, so most of the data will be pushed in using APIs.
> 
> I would really appreciate any guidance from the community !!

-- 
nicolas