Synonyms relationships
Hi Does SolR provide a way to describe synonyms relationships such "equivalent to" ,"narrower thant", "broader than" ? It turns out both postgres and oracle do, but I can't find any related information in the documentation. This is useful to allow generalizing the terms of the research or not. Thanks , -- nicolas
Pagination Graph/SQL
Hi SolR pagination is incredible: you can provide the end user a small set of results together with the total number of documents found (numFound). I am wondering if both "parallel SQL" and "Graph Traversal" feature also provide a pagination mechanism. As an example, the above SQL : SELECT id FROM table LIMIT 10 would both return 10 id's together with the number of document within the table. Thanks, -- nicolas
Highlighting Parent Documents
Hi I have read here [1] and here [2] that it is possible to highlight only parent documents in block join queries. But I didn't succeed yet: So here is my nested document example: [ { "id": "2", "type_s": "parent", "content_txt": ["apache"], "_childDocuments_": [ { "id": "1", "type_s": "child", "content_txt": ["solr"] } ] } ] Here is my query (=give me document that have a parent which contain "apache" term): curl http://localhost:8983/solr/col/query -d ' fl=id &hl=on &hl.fl=* &q={!child of="type_s:parent"}type_s:parent AND content_txt:apache' And here is the result: { ... "response":{"numFound":1,"start":0,"docs":[ { "id":"1"}] }, "highlighting":{ "1":{}}} I was hoping to get this (=the doc 1 and highlight doc 2: the parent) : { ... "response":{"numFound":1,"start":0,"docs":[ { "id":"1"}] }, "highlighting":{ "2":{ "content_txt":["apache"]}}} [1] http://lucene.472066.n3.nabble.com/Fwd-Standard-highlighting-doesn-t-work-for-Block-Join-td4260784.html [2] http://lucene.472066.n3.nabble.com/highlighting-on-child-document-td4238236.html Thanks by advance, -- nicolas
Search only for single value of Solr multivalue field (part 2)
hi This question is highly related to a previous one found on the mailing-list archive [1]. I have this document: "content_txt":["001 first","002 second"] I d'like the below query return nothing: > q=content_txt:(first AND second) The method proposed ([1]) by Erick works ok to look for a single value having BOTH first AND second by setting the field positionIncrementGap high enough: This query returns nothing as expected: > q=content_txt:("first second"~99) However, this is based on *phrase search*. Phrase search does not allow to use the below simple query parser features. That's a _HUGE_ limitation! - regexp - fuzzy - whildcard - ranges So the query below does won't match the first field: > q=content_txt:("[000 TO 001] first"~99) While this one does match the second and shouldn't! > q=content_txt:([000 TO 001] AND "second") QUESTION: - Is there a chance such feature will be developed in future SolR version ? I mean something allowing considering multivalued fields independently ? A new field attribute such independentMultivalued=true would be ok ? Thanks, [1]: http://lucene.472066.n3.nabble.com/Search-only-for-single-value-of-Solr-multivalue-field-td4309850.html#a4309893 -- nicolas
Re: Search only for single value of Solr multivalue field (part 2)
On Sun, Dec 16, 2018 at 09:30:33AM -0800, Erick Erickson wrote: > Have you looked at ComplexPhraseQueryParser here? > https://lucene.apache.org/solr/guide/6_6/other-parsers.html Sure. However, I am using multi-word synonyms and so far the complexphrase does not handle them. (maybe soon ?) > Depending on how many of these you have, you could do something with > dynamic fields. Rather than use a single MV field, use N fields. You'd > probably have to copyField or some such to a catch-all field for > searches that you wanted to ignore the "mv nature" of the field. Problem with copyField from multiple fields acts as a MV field. So the problem remains: dealing with MV fields. Isn't ? Thanks -- nicolas
Re: Search only for single value of Solr multivalue field (part 2)
On Sun, Dec 16, 2018 at 05:44:30PM -0800, Erick Erickson wrote: > No, the idea is that you have N single valued fields, one for each of > the MV entries you have. The copyField dest would be MV, and only used > in those cases you wanted to match across values. Not saying that's a > great solution, or if it would even necessarily work but thought it > worth mentioning. Ok, then my initial document with MV fields: > "content_txt":["001 first","002 second"] would become: > "content1_t":"001 first" > "content2_t":"002 second" > "_copiedfield_":["001 first","002 second"] And then the initial user query: > content_txt:(first AND second) would become: > content1_t:(first AND second) OR content2_t:(first AND second) Depending on the length of the initial array, each document will have a different number of contentx_t. This means some management like a layer between the user and the parser, to extend the query with the maximum possible contentx_t fields in the collection. (with max=100 for performance reason?) QUESTION: is the MV limitation a *solr parser* limitation, or a *lucene* limitation. If it is the latter, writing my own parser would be an option isn't ? -- nicolas
MoreLikeThis & Synonyms
Hi It turns out that MoreLikeThis handler does not use queryTime synonyms expansion. It is only compatible with indexTime synonyms. However multiword synonyms are only compatible with queryTime synonyms expansion. For this reason this does not allow the use of multiword synonyms within together with the MoreLikeThis handler. Is there any reason for the MoreLikeThis feature not compatible with Multiword Synonyms ? Thanks -- nicolas
Re: MoreLikeThis & Synonyms
On Wed, Dec 26, 2018 at 09:09:02PM -0800, Erick Erickson wrote: > bq. However multiword synonyms are only compatible with queryTime synonyms > expansion. > > Why do you say this? What version of Solr? Query-time mult-word > synonyms were _added_, but AFAIK the capability of multi-word synonyms > was not taken away. >From this blogpost [1] I deduced multi-word synonyms are only compatible with query time expansion. > Or are you saying that MLT doesn't play nice at all with multi-word > synonyms? >From my tests, MLT does not expand the query with synonyms. So it is not possible to use query time synonyms nor mutli-word. Only index time is possible with the limitations it has [1] > What version of Solr are you using? I am running solr 7.6. [1] https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/ -- nicolas
edismax: sorting on numeric fields
Hi I have a numeric field (say "weight") and I d'like to be able to get results sorted. q=kind:animal weight:50 pf=kind^2 weight^3 would return: name=dog, kind=animal, weight=51 name=tiger, kind=animal,weight=150 name=elephant, kind=animal,weight=2000 In other terms how to deal with numeric fields ? My first idea is to encode numeric into letters (one x per value) dog x tiger x elephant and the query would be kind:animal, weight:xxx How to deal with numeric fields ? Thanks -- nicolas
Re: edismax: sorting on numeric fields
Hi Thanks. To clarify, I don't want to sort by numeric fields, instead, I d'like to get sort by distance to my query. On Thu, Feb 14, 2019 at 06:20:19PM -0500, Gus Heck wrote: > Hi Niclolas, > > Solr has no difficulty sorting on numeric fields if they are indexed as a > numeric type. Just use "&sort=weight asc" If you're field is indexed as > text of course it won't sort properly, but then you should fix your schema. > > -Gus > > On Thu, Feb 14, 2019 at 4:10 PM David Hastings > wrote: > > > Not clearly understanding your question here. if your query is > > q=kind:animal weight:50 you will get no results, as nothing matches > > (assuming a q.op of AND) > > > > > > On Thu, Feb 14, 2019 at 4:06 PM Nicolas Paris > > wrote: > > > > > Hi > > > > > > I have a numeric field (say "weight") and I d'like to be able to get > > > results sorted. > > > q=kind:animal weight:50 > > > pf=kind^2 weight^3 > > > > > > would return: > > > name=dog, kind=animal, weight=51 > > > name=tiger, kind=animal,weight=150 > > > name=elephant, kind=animal,weight=2000 > > > > > > > > > In other terms how to deal with numeric fields ? > > > > > > My first idea is to encode numeric into letters (one x per value) > > > dog x > > > tiger x > > > elephant > > > > > > > > > > > and the query would be > > > kind:animal, weight:xxx > > > > > > > > > How to deal with numeric fields ? > > > > > > Thanks > > > -- > > > nicolas > > > > > > > > -- > http://www.the111shift.com -- nicolas
Re: edismax: sorting on numeric fields
hi On Sun, Feb 17, 2019 at 09:40:43AM +0100, Jan Høydahl wrote: > q=kind:animal&wantedweight=50&sort=abs(sub(weight,wantedweight)) asc *sort function* looks a great way. I did not mention very clearly I d'like to use edsimax and add some weight to some fields (such *description* in the below example). So *sorting* looks not sufficient. After playing around with distance functions it turns out this one looks quite sufficient: ?defType=edismax&q=kind:animal&carnivore&qf=description^3&boost=div(1,abs(sub(weight,50))) this means: "give me all animals having carnivore in the description, and order them given their weight is near 50 kg and carnivore is relevant in the description" Does this make sence ? On Sun, Feb 17, 2019 at 09:40:43AM +0100, Jan Høydahl wrote: > q=kind:animal&wantedweight=50&sort=abs(sub(weight,wantedweight)) asc > > Jan Høydahl > > > 16. feb. 2019 kl. 17:08 skrev Dave : > > > > Sounds like you need to use code and post process your results as it sounds > > too specific to your use case. Just my opinion, unless you want to get into > > spacial queries which is a whole different animal and something I don’t > > think many have experience with, including myself > > > >> On Feb 16, 2019, at 10:10 AM, Nicolas Paris > >> wrote: > >> > >> Hi > >> > >> Thanks. > >> To clarify, I don't want to sort by numeric fields, instead, I d'like to > >> get sort by distance to my query. > >> > >> > >>> On Thu, Feb 14, 2019 at 06:20:19PM -0500, Gus Heck wrote: > >>> Hi Niclolas, > >>> > >>> Solr has no difficulty sorting on numeric fields if they are indexed as a > >>> numeric type. Just use "&sort=weight asc" If you're field is indexed as > >>> text of course it won't sort properly, but then you should fix your > >>> schema. > >>> > >>> -Gus > >>> > >>> On Thu, Feb 14, 2019 at 4:10 PM David Hastings > >>> > >>> wrote: > >>> > >>>> Not clearly understanding your question here. if your query is > >>>> q=kind:animal weight:50 you will get no results, as nothing matches > >>>> (assuming a q.op of AND) > >>>> > >>>> > >>>> On Thu, Feb 14, 2019 at 4:06 PM Nicolas Paris > >>>> wrote: > >>>> > >>>>> Hi > >>>>> > >>>>> I have a numeric field (say "weight") and I d'like to be able to get > >>>>> results sorted. > >>>>> q=kind:animal weight:50 > >>>>> pf=kind^2 weight^3 > >>>>> > >>>>> would return: > >>>>> name=dog, kind=animal, weight=51 > >>>>> name=tiger, kind=animal,weight=150 > >>>>> name=elephant, kind=animal,weight=2000 > >>>>> > >>>>> > >>>>> In other terms how to deal with numeric fields ? > >>>>> > >>>>> My first idea is to encode numeric into letters (one x per value) > >>>>> dog x > >>>>> tiger x > >>>>> elephant > >>>>> > >>>> > >>>>> > >>>>> and the query would be > >>>>> kind:animal, weight:xxx > >>>>> > >>>>> > >>>>> How to deal with numeric fields ? > >>>>> > >>>>> Thanks > >>>>> -- > >>>>> nicolas > >>>>> > >>>> > >>> > >>> > >>> -- > >>> http://www.the111shift.com > >> > >> -- > >> nicolas > -- nicolas
query bag of word with negation
Hello I wonder if there is a plain text query syntax to say: give me all document that match: wonderful pizza NOT peperoni all those in a 5 distance word bag then pizza are wonderful -> would match I made a wonderful pasta and pizza -> would match Peperoni pizza are so wonderful -> would not match I tested: "wonderful pizza - peperoni"~5 without success Thanks
Re: query bag of word with negation
1. Query terms containing other than just letters or digits may be placed >> within double quotes so that those other characters do not separate a term >> into many terms. A dot (period) and white space are neither letter nor >> digit. Examples: "Now is the time for all good men" (spaces, quotes impose >> ordering too), "goods.doc" (a dot). > > > 2. Mode button "or" (the default) means match one or more terms, perhaps >> scattered about. Mode button "and" means must match all terms, scattered or >> not. > > > 3. A one word query term may be prefixed by title: or url: to search on >> those fields. A space must follow the colon, and the search term is case >> sensitive. Examples: url: .ppt or title: Goodies. Many docs do not have a >> formal internal title field, thus prefix title: may not work. > > > 4. Compound queries can be built by joining terms with and or - and group >> items with ( ). Not is expressed as a minus sign prefixing a term. A bare >> space means use the Mode (or, and). Example: Nancy and Mary and -Jane and >> -(Robert Daniel) which means both the first two and not Jane and neither of >> the two guys. > > 5. A query of asterisk/star (*) means match everything. Examples: * for >> everything (zero or more characters). Fussy, show all without term .pdf * >> and -".pdf" For normal queries the program uses the edismax interface. A >> few, such as url: foobar, reference the Lucene interface. This is specified >> by the qagent= parameter, of edismax or empty respectively, in a search >> request. Thus regular facilities can do most of this work. > > > What this example does not address is your distance 5 critera. However, >> the NOT facility may do the trick for you, though a minus sign is taken as >> a literal minus sign or word separator if located within a quoted string. > > Indeed sadly words can be anywhere in the document (no notion of distance) Thanks, Joe D. > > Thanks for the 5 details anyway
Re: query bag of word with negation
Hello Markus Thanks ! The ComplexPhraseQueryParser syntax: q={!complexphrase inOrder=false}collector:"wonderful pizza -peperoni"~5 answers my needs. BTW, Apparently it accepts both leading/ending wildcards, that's look powerful feature. Any chance it would support the "sow=false" in order to combine with multi-word synonyms ? 2018-04-22 21:11 GMT+02:00 Markus Jelsma : > Hello Nicolas, > > Yes you can! Check out ComplexPhaseQParser > https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers- > ComplexPhraseQueryParser > > Regards, > Markus > > > > -Original message- > > From:Nicolas Paris > > Sent: Sunday 22nd April 2018 20:04 > > To: solr-user@lucene.apache.org > > Subject: query bag of word with negation > > > > Hello > > > > I wonder if there is a plain text query syntax to say: > > give me all document that match: > > > > wonderful pizza NOT peperoni > > > > all those in a 5 distance word bag > > then > > > > pizza are wonderful -> would match > > I made a wonderful pasta and pizza -> would match > > Peperoni pizza are so wonderful -> would not match > > > > I tested: > > "wonderful pizza - peperoni"~5 > > without success > > > > Thanks > > >
Re: Is anybody using UIMA with Solr?
Hi Not realy a direct answer - Never used it, however this feature have been attractive to me while first looking at uima. Right now, I would say UIMA connectors in general are by design a pain to maintain. Source and target often do have optimised way to bulk export/import data. For example, using a jdbc postgresql connector is a bad idea compared to using the optimzed COPY function. And each database has it's own optimized way of doing. That's why developpers of UIMA should focus on improving what UIMA is good at: processing texts. Exporting and importing texts responsibility should remain to the other tools. Tell me if i am wrong 2018-06-18 13:13 GMT+02:00 Alexandre Rafalovitch : > Hi, > > Solr ships an UIMA component and examples that haven't worked for a > while. Details are in: > https://issues.apache.org/jira/browse/SOLR-11694 > > The choices for developers are: > 1) Rip UIMA out (and save space) > 2) Update UIMA to latest 2.x version > 3) Update UIMA to super-latest possibly-breaking 3.x > > The most likely choice at this point is 1. But I am curious (given > that UIMA is in IBM Watson...) if anybody actually has a use-case that > strongly votes for options 2 or 3, given that the update effort is > probably not trivial. > > Note that if you use UIMA with Solr, but in a configuration completely > different from that shipped (so the options 2/3 would still be > irrelevant), it could be still fun to share the knowledge in this > thread, with the appropriate disclaimer. > > Regards, >Alex. >
Re: Is anybody using UIMA with Solr?
sorry thought I was on UIMA mailing list. That being said, my position is the same : let UIMA folks load data into SolR by using the most optimized way. (what would be the best way ? Loading jsons ?) 2018-06-19 22:48 GMT+02:00 Nicolas Paris : > Hi > > Not realy a direct answer - Never used it, however this feature have > been attractive to me while first looking at uima. > > Right now, I would say UIMA connectors in general are by design > a pain to maintain. Source and target often do have optimised > way to bulk export/import data. For example, using a jdbc postgresql > connector is a bad idea compared to using the optimzed COPY function. > And each database has it's own optimized way of doing. > > That's why developpers of UIMA should focus on improving what UIMA > is good at: processing texts. > Exporting and importing texts responsibility should remain to the other > tools. > > Tell me if i am wrong > > 2018-06-18 13:13 GMT+02:00 Alexandre Rafalovitch : > >> Hi, >> >> Solr ships an UIMA component and examples that haven't worked for a >> while. Details are in: >> https://issues.apache.org/jira/browse/SOLR-11694 >> >> The choices for developers are: >> 1) Rip UIMA out (and save space) >> 2) Update UIMA to latest 2.x version >> 3) Update UIMA to super-latest possibly-breaking 3.x >> >> The most likely choice at this point is 1. But I am curious (given >> that UIMA is in IBM Watson...) if anybody actually has a use-case that >> strongly votes for options 2 or 3, given that the update effort is >> probably not trivial. >> >> Note that if you use UIMA with Solr, but in a configuration completely >> different from that shipped (so the options 2/3 would still be >> irrelevant), it could be still fun to share the knowledge in this >> thread, with the appropriate disclaimer. >> >> Regards, >>Alex. >> > >
Re: Storage/Volume type for Kubernetes Solr POD?
hi all what about cephfs or lustre distrubuted filesystem for such purpose ? Karl Stoney writes: > we personally run solr on google cloud kubernetes engine and each node has a > 512Gb persistent ssd (network attached) storage which gives roughly this > performance (read/write): > > Sustained random IOPS limit 15,360.00 15,360.00 > Sustained throughput limit (MB/s) 245.76 245.76 > > and we get very good performance. > > ultimately though it's going to depend on your workload > > From: Susheel Kumar > Sent: 06 February 2020 13:43 > To: solr-user@lucene.apache.org > Subject: Storage/Volume type for Kubernetes Solr POD? > > Hello, > > Whats type of storage/volume is recommended to run Solr on Kubernetes POD? > I know in the past Solr has issues with NFS storing its indexes and was not > recommended. > > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fkubernetes.io%2Fdocs%2Fconcepts%2Fstorage%2Fvolumes%2F&data=02%7C01%7Ckarl.stoney%40autotrader.co.uk%7Cade649a9f6e84e1ee7d008d7ab0a8c7b%7C926f3743f3d24b8a816818cfcbe776fe%7C0%7C0%7C637165934101219754&sdata=wsc4v3dJwTzOqSirbo7DvdmrimTL2sOX66Ug%2FvzrRw8%3D&reserved=0 > > Thanks, > Susheel > This e-mail is sent on behalf of Auto Trader Group Plc, Registered Office: 1 > Tony Wilson Place, Manchester, Lancashire, M15 4FN (Registered in England No. > 9439967). This email and any files transmitted with it are confidential and > may be legally privileged, and intended solely for the use of the individual > or entity to whom they are addressed. If you have received this email in > error please notify the sender. This email message has been swept for the > presence of computer viruses. -- nicolas paris
multivalue faceting term optimization
Hello, Environment: - SolrCloud 8.4.1 - 4 shards with xmx = 120GO and ssd disks - 50M documents / 40GO physical per shard - mainly large texts fields and also, one multivalue/docvalue/indexed string list of 15 values per document Goal: I want to provide terms facet on a string multivalue field. This offers the client to build dynamic word cloud depending on filter queries. The words within the array are extracted with TFIDF from large raw texts from neightbourg fields. Behavior: The computing time is below 2 seconds when the FQ query is selective enough (<2M). However it results as a timeout when the FQ finds > 2M documents Question: How to improve brute performances ? I tried: - facet.limit / facet.threads / facet.method - limiting the multivalue size (from 50 elements per documents to 15) Is ther any parameter I am missing ? Thought: If there is now way to faster performances for the brute task, I guess I could artificially limit the FQ under 2M for all queries by getting a sample (I don't really care having more than 2M documents to build the word cloud). I am wondering how I could filter the documents to get approximate facets ? Thanks ! -- nicolas paris
Re: multivalue faceting term optimization
Toke Eskildsen writes: > JSON faceting allows you to skip the fine counting with the parameter > refine: I also tried the facet.refine parameter, but didn't notice any improvement. >> I am wondering how I could filter the documents to get approximate >> facets ? > > Clunky idea: Introduce a hash field for each document. [...] > [...]you could also create fields with random values That's a pragmatic solution. Two steps: 1. get the count, hightlight and first matches 2. depending on the count, filter based on random/hash values BTW I wonder if the first step will be cached, as to get highlights I cannot use FQ, but Q. And the latter is not meant to cache the results. So this might lead to duplicate the effort isn'it ? > It might help to have everything in a single shard, to avoid the > secondary fine count. But your index is rather large Yes, it's large, and growing from 1M each month. Merging in one shard is not an option. However, I suppose I should be able to ask the facet to one shard only if the count is above a threshold ? This would reduce the number of document by ~4 and avoid secondary fine count. That maybe better than subsetting with extra random fields -- nicolas paris
Re: multivalue faceting term optimization
Erick Erickson writes: > Have you looked at the HyperLogLog stuff? Here’s at least a mention of > it: https://lucene.apache.org/solr/guide/8_4/the-stats-component.html I am used to hll in the context of count distinct values -- cardinality. I have to admit that section https://lucene.apache.org/solr/guide/8_4/the-stats-component.html#local-parameters-with-the-stats-component is about hll and facets, but I am not sure that really meet the use case. I also have to admit that part is quite cryptic to me. -- nicolas paris
Status of solR / HDFS-v3 compatibility
Hi solr doc [1] says it's only compatible with hdfs 2.x is that true ? [1]: http://lucene.apache.org/solr/guide/7_7/running-solr-on-hdfs.html -- nicolas
Re: Status of solR / HDFS-v3 compatibility
Thanks kevin, and also thanks for your blog on hdfs topic > https://risdenk.github.io/2018/10/23/apache-solr-running-on-apache-hadoop-hdfs.html On Thu, May 02, 2019 at 09:37:59AM -0400, Kevin Risden wrote: > For Apache Solr 7.x or older yes - Apache Hadoop 2.x was the dependency. > Apache Solr 8.0+ has Hadoop 3 compatibility with SOLR-9515. I did some > testing to make sure that Solr 8.0 worked on Hadoop 2 as well as Hadoop 3, > but the libraries are Hadoop 3. > > The reference guide for 8.0+ hasn't been released yet, but also don't think > it was updated. > > Kevin Risden > > > On Thu, May 2, 2019 at 9:32 AM Nicolas Paris > wrote: > > > Hi > > > > solr doc [1] says it's only compatible with hdfs 2.x > > is that true ? > > > > > > [1]: http://lucene.apache.org/solr/guide/7_7/running-solr-on-hdfs.html > > > > -- > > nicolas > > -- nicolas
Document Update performances Improvement
Hi I am looking for a way to faster the update of documents. In my context, the update replaces one of the many existing indexed fields, and keep the others as is. Right now, I am building the whole document, and replacing the existing one by id. I am wondering if **atomic update feature** would faster the process. >From one hand, using this feature would save network because only a small subset of the document would be send from the client to the server. On the other hand, the server will have to collect the values from the disk and reindex them. In addition, this implies to store the values for every fields (I am not storing every fields) and use more space. Also I have read about the ConcurrentUpdateSolrServer class might be an optimized way of updating documents. I am using spark-solr library to deal with solr-cloud. If something exist to faster the process, I would be glad to implement it in that library. Also, I have split the collection over multiple shard, and I admit this faster the update process, but who knows ? Thoughts ? -- nicolas
Solr-Cloud, join and collection collocation
Hi I have several large collections that cannot fit in a standalone solr instance. They are split over multiple shards in solr-cloud mode. Those collections are supposed to be joined to an other collection to retrieve subset. Because I am using distributed collections, I am not able to use the solr join feature. For this reason, I denormalize the information by adding the joined collection within every collections. Naturally, when I want to update the joined collection, I have to update every one of the distributed collections. In standalone mode, I only would have to update the joined collection. I wonder if there is a way to overcome this limitation. For example, by replicating the joined collection to every shard - or other method I am ignoring. Any thought ? -- nicolas
Re: Solr-Cloud, join and collection collocation
> You can certainly replicate the joined collection to every shard. It > must fit in one shard and a replica of that shard must be co-located > with every replica of the “to” collection. Yes, I found this in the documentation, with a clear example just after this mail. I will test it today. I also read your blog about join performances[1] and I suspect the performance impact of joins will be huge because the joined collection is about 10M documents (only two fields, unique id and an array of longs and a filter applied to the array, join key is 10M unique IDs). > Have you looked at streaming and “streaming expressions"? It does not > have the same problem, although it does have its own limitations. I never tested them, and I am not very confortable yet in how to test them. Is it possible to mix query parsers and streaming expression in the client call via http parameters - or is streaming expression apply programmatically only ? [1] https://lucidworks.com/post/solr-and-joins/ On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote: > You can certainly replicate the joined collection to every shard. It must fit > in one shard and a replica of that shard must be co-located with every > replica of the “to” collection. > > Have you looked at streaming and “streaming expressions"? It does not have > the same problem, although it does have its own limitations. > > Best, > Erick > > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris wrote: > > > > Hi > > > > I have several large collections that cannot fit in a standalone solr > > instance. They are split over multiple shards in solr-cloud mode. > > > > Those collections are supposed to be joined to an other collection to > > retrieve subset. Because I am using distributed collections, I am not > > able to use the solr join feature. > > > > For this reason, I denormalize the information by adding the joined > > collection within every collections. Naturally, when I want to update > > the joined collection, I have to update every one of the distributed > > collections. > > > > In standalone mode, I only would have to update the joined collection. > > > > I wonder if there is a way to overcome this limitation. For example, by > > replicating the joined collection to every shard - or other method I am > > ignoring. > > > > Any thought ? > > -- > > nicolas > -- nicolas
Re: Solr-Cloud, join and collection collocation
Sadly, the join performances are poor. The joined collection is 12M documents, and the performances are 6k ms versus 60ms when I compare to the denormalized field. Apparently, the performances does not change when the filter on the joined collection is changed. It is still 6k ms when the subset is 12M or 1 document in size. So the performance of join looks correlated to size of joined collection and not the kind of filter applied to it. I will explore the streaming expressions On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote: > > You can certainly replicate the joined collection to every shard. It > > must fit in one shard and a replica of that shard must be co-located > > with every replica of the “to” collection. > > Yes, I found this in the documentation, with a clear example just after > this mail. I will test it today. I also read your blog about join > performances[1] and I suspect the performance impact of joins will be > huge because the joined collection is about 10M documents (only two > fields, unique id and an array of longs and a filter applied to the > array, join key is 10M unique IDs). > > > Have you looked at streaming and “streaming expressions"? It does not > > have the same problem, although it does have its own limitations. > > I never tested them, and I am not very confortable yet in how to test > them. Is it possible to mix query parsers and streaming expression in > the client call via http parameters - or is streaming expression apply > programmatically only ? > > [1] https://lucidworks.com/post/solr-and-joins/ > > On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote: > > You can certainly replicate the joined collection to every shard. It must > > fit in one shard and a replica of that shard must be co-located with every > > replica of the “to” collection. > > > > Have you looked at streaming and “streaming expressions"? It does not have > > the same problem, although it does have its own limitations. > > > > Best, > > Erick > > > > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris > > > wrote: > > > > > > Hi > > > > > > I have several large collections that cannot fit in a standalone solr > > > instance. They are split over multiple shards in solr-cloud mode. > > > > > > Those collections are supposed to be joined to an other collection to > > > retrieve subset. Because I am using distributed collections, I am not > > > able to use the solr join feature. > > > > > > For this reason, I denormalize the information by adding the joined > > > collection within every collections. Naturally, when I want to update > > > the joined collection, I have to update every one of the distributed > > > collections. > > > > > > In standalone mode, I only would have to update the joined collection. > > > > > > I wonder if there is a way to overcome this limitation. For example, by > > > replicating the joined collection to every shard - or other method I am > > > ignoring. > > > > > > Any thought ? > > > -- > > > nicolas > > > > -- > nicolas > -- nicolas
Re: Solr-Cloud, join and collection collocation
> Note: adding score=none as a local param. Turns another algorithm > dragging by from side join. Indeed, the behavior with score=none local param is a query time correlated with the joined collection subset size. For subset of 100k documenrs, the query time is 1 seconds, 4 sec for 1M I get client timeout (15sec) for any superior to 5M. On this basis I guess some redesign will be necessary to find the good in between normalization and de-normalization for insertion/selection speed trade-off Thanks On Wed, Oct 16, 2019 at 03:32:33PM +0300, Mikhail Khludnev wrote: > Note: adding score=none as a local param. Turns another algorithm dragging > by from side join. > > On Wed, Oct 16, 2019 at 11:37 AM Nicolas Paris > wrote: > > > Sadly, the join performances are poor. > > The joined collection is 12M documents, and the performances are 6k ms > > versus 60ms when I compare to the denormalized field. > > > > Apparently, the performances does not change when the filter on the > > joined collection is changed. It is still 6k ms when the subset is 12M > > or 1 document in size. So the performance of join looks correlated to > > size of joined collection and not the kind of filter applied to it. > > > > I will explore the streaming expressions > > > > On Wed, Oct 16, 2019 at 08:00:43AM +0200, Nicolas Paris wrote: > > > > You can certainly replicate the joined collection to every shard. It > > > > must fit in one shard and a replica of that shard must be co-located > > > > with every replica of the “to” collection. > > > > > > Yes, I found this in the documentation, with a clear example just after > > > this mail. I will test it today. I also read your blog about join > > > performances[1] and I suspect the performance impact of joins will be > > > huge because the joined collection is about 10M documents (only two > > > fields, unique id and an array of longs and a filter applied to the > > > array, join key is 10M unique IDs). > > > > > > > Have you looked at streaming and “streaming expressions"? It does not > > > > have the same problem, although it does have its own limitations. > > > > > > I never tested them, and I am not very confortable yet in how to test > > > them. Is it possible to mix query parsers and streaming expression in > > > the client call via http parameters - or is streaming expression apply > > > programmatically only ? > > > > > > [1] https://lucidworks.com/post/solr-and-joins/ > > > > > > On Tue, Oct 15, 2019 at 07:12:25PM -0400, Erick Erickson wrote: > > > > You can certainly replicate the joined collection to every shard. It > > must fit in one shard and a replica of that shard must be co-located with > > every replica of the “to” collection. > > > > > > > > Have you looked at streaming and “streaming expressions"? It does not > > have the same problem, although it does have its own limitations. > > > > > > > > Best, > > > > Erick > > > > > > > > > On Oct 15, 2019, at 6:58 PM, Nicolas Paris > > wrote: > > > > > > > > > > Hi > > > > > > > > > > I have several large collections that cannot fit in a standalone solr > > > > > instance. They are split over multiple shards in solr-cloud mode. > > > > > > > > > > Those collections are supposed to be joined to an other collection to > > > > > retrieve subset. Because I am using distributed collections, I am not > > > > > able to use the solr join feature. > > > > > > > > > > For this reason, I denormalize the information by adding the joined > > > > > collection within every collections. Naturally, when I want to update > > > > > the joined collection, I have to update every one of the distributed > > > > > collections. > > > > > > > > > > In standalone mode, I only would have to update the joined > > collection. > > > > > > > > > > I wonder if there is a way to overcome this limitation. For example, > > by > > > > > replicating the joined collection to every shard - or other method I > > am > > > > > ignoring. > > > > > > > > > > Any thought ? > > > > > -- > > > > > nicolas > > > > > > > > > > -- > > > nicolas > > > > > > > -- > > nicolas > > > > > -- > Sincerely yours > Mikhail Khludnev -- nicolas
Re: Document Update performances Improvement
Hi community, Any advice to speed-up updates ? Is there any advice on commit, memory, docvalues, stored or any tips to faster things ? Thanks On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote: > Hi > > I am looking for a way to faster the update of documents. > > In my context, the update replaces one of the many existing indexed > fields, and keep the others as is. > > Right now, I am building the whole document, and replacing the existing > one by id. > > I am wondering if **atomic update feature** would faster the process. > > From one hand, using this feature would save network because only a > small subset of the document would be send from the client to the > server. > On the other hand, the server will have to collect the values from the > disk and reindex them. In addition, this implies to store the values for > every fields (I am not storing every fields) and use more space. > > Also I have read about the ConcurrentUpdateSolrServer class might be an > optimized way of updating documents. > > I am using spark-solr library to deal with solr-cloud. If something > exist to faster the process, I would be glad to implement it in that > library. > Also, I have split the collection over multiple shard, and I admit this > faster the update process, but who knows ? > > Thoughts ? > > -- > nicolas > -- nicolas
Re: Document Update performances Improvement
> Maybe you need to give more details. I recommend always to try and > test yourself as you know your own solution best. What performance do > your use car needs and what is your current performance? I have 10 collections on 4 shards (no replications). The collections are quite large ranging from 2GB to 60 GB per shard. In every case, the update process only add several values to an indexed array field on a document subset of each collection. The proportion of the subset is from 0 to 100%, and 95% of time below 20%. The array field represents 1 over 20 fields which are mainly unstored fields with some large textual fields. The 4 solr instance collocate with the spark. Right now I tested with 40 spark executors. Commit timing and commit number document are both set to 2. Each shard has 20g of memory. Loading/replacing the largest collection is about 2 hours - which is quite fast I guess. Updating 5% percent of documents of each collections, is about half an hour. Because my need is "only" to append several values to an array I suspect there is some trick to make things faster. On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote: > Maybe you need to give more details. I recommend always to try and test > yourself as you know your own solution best. Depending on your spark process > atomic updates could be faster. > > With Spark-Solr additional complexity comes. You could have too many > executors for your Solr instance(s), ie a too high parallelism. > > Probably the most important question is: > What performance do your use car needs and what is your current performance? > > Once this is clear further architecture aspects can be derived, such as > number of spark executors, number of Solr instances, sharding, replication, > commit timing etc. > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris : > > > > Hi community, > > > > Any advice to speed-up updates ? > > Is there any advice on commit, memory, docvalues, stored or any tips to > > faster things ? > > > > Thanks > > > > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote: > >> Hi > >> > >> I am looking for a way to faster the update of documents. > >> > >> In my context, the update replaces one of the many existing indexed > >> fields, and keep the others as is. > >> > >> Right now, I am building the whole document, and replacing the existing > >> one by id. > >> > >> I am wondering if **atomic update feature** would faster the process. > >> > >> From one hand, using this feature would save network because only a > >> small subset of the document would be send from the client to the > >> server. > >> On the other hand, the server will have to collect the values from the > >> disk and reindex them. In addition, this implies to store the values for > >> every fields (I am not storing every fields) and use more space. > >> > >> Also I have read about the ConcurrentUpdateSolrServer class might be an > >> optimized way of updating documents. > >> > >> I am using spark-solr library to deal with solr-cloud. If something > >> exist to faster the process, I would be glad to implement it in that > >> library. > >> Also, I have split the collection over multiple shard, and I admit this > >> faster the update process, but who knows ? > >> > >> Thoughts ? > >> > >> -- > >> nicolas > >> > > > > -- > > nicolas > -- nicolas
Re: Document Update performances Improvement
> We, at Auto-Suggest, also do atomic updates daily and specifically > changing merge factor gave us a boost of ~4x Interesting. What kind of change exactly on the merge factor side ? > At current configuration, our core atomically updates ~423 documents > per second. Would you say atomical update is faster than regular replacement of documents ? (considering my first thought on this below) > > I am wondering if **atomic update feature** would faster the process. > > From one hand, using this feature would save network because only a > > small subset of the document would be send from the client to the > > server. > > On the other hand, the server will have to collect the values from the > > disk and reindex them. In addition, this implies to store the values > > every fields (I am not storing every fields) and use more space. Thanks Paras On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana wrote: > Hi Nicolas, > > Have you tried playing with values of *IndexConfig* > <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html> > (merge factor, segment size, maxBufferedDocs, Merge Policies)? We, at > Auto-Suggest, also do atomic updates daily and specifically changing merge > factor gave us a boost of ~4x during indexing. At current configuration, > our core atomically updates ~423 documents per second. > > On Sun, 20 Oct 2019 at 02:07, Nicolas Paris > wrote: > > > > Maybe you need to give more details. I recommend always to try and > > > test yourself as you know your own solution best. What performance do > > > your use car needs and what is your current performance? > > > > I have 10 collections on 4 shards (no replications). The collections are > > quite large ranging from 2GB to 60 GB per shard. In every case, the > > update process only add several values to an indexed array field on a > > document subset of each collection. The proportion of the subset is from > > 0 to 100%, and 95% of time below 20%. The array field represents 1 over > > 20 fields which are mainly unstored fields with some large textual > > fields. > > > > The 4 solr instance collocate with the spark. Right now I tested with 40 > > spark executors. Commit timing and commit number document are both set > > to 2. Each shard has 20g of memory. > > Loading/replacing the largest collection is about 2 hours - which is > > quite fast I guess. Updating 5% percent of documents of each > > collections, is about half an hour. > > > > Because my need is "only" to append several values to an array I suspect > > there is some trick to make things faster. > > > > > > > > On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote: > > > Maybe you need to give more details. I recommend always to try and test > > yourself as you know your own solution best. Depending on your spark > > process atomic updates could be faster. > > > > > > With Spark-Solr additional complexity comes. You could have too many > > executors for your Solr instance(s), ie a too high parallelism. > > > > > > Probably the most important question is: > > > What performance do your use car needs and what is your current > > performance? > > > > > > Once this is clear further architecture aspects can be derived, such as > > number of spark executors, number of Solr instances, sharding, replication, > > commit timing etc. > > > > > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris > >: > > > > > > > > Hi community, > > > > > > > > Any advice to speed-up updates ? > > > > Is there any advice on commit, memory, docvalues, stored or any tips to > > > > faster things ? > > > > > > > > Thanks > > > > > > > > > > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote: > > > >> Hi > > > >> > > > >> I am looking for a way to faster the update of documents. > > > >> > > > >> In my context, the update replaces one of the many existing indexed > > > >> fields, and keep the others as is. > > > >> > > > >> Right now, I am building the whole document, and replacing the > > existing > > > >> one by id. > > > >> > > > >> I am wondering if **atomic update feature** would faster the process. > > > >> > > > >> From one hand, using this feature would save network because only a > > > >> small subset of the document
Re: Document Update performances Improvement
> <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>. > <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit> Thanks for those relevant pointers and the explanation. > How often do you commit? Are you committing after each XML is > indexed? If yes, what is your batch (XML) size? Review default settings of > autoCommit and considering increasing it. I guess I do not use any XML under the hood: spark-solr uses sorlj which serialize the document in java binary objects. However the commit strategy applies too, I have setup 20,000 documents or 20,000 ms. > Do you want real time reflection > of updates? If no, you can compromise on commits and merge factors and do > faster indexing. Don't so soft commits then. Indeed I d'like the document be accessible sooner. That being said, 5 minutes delay is acceptable. > In our case, I have set autoCommit to commit after 50,000 documents are > indexed. After EdgeNGrams tokenization, while full indexing, we have seen > index to get over 60 GBs. Once we are done with full indexing, I optimize > the index and the index size comes below 13 GB! I guess I get the idea: "put the dollars as fast as possible in the bag, we will clean-up when back home" Thanks On Wed, Oct 23, 2019 at 11:34:44AM +0530, Paras Lehana wrote: > Hi Nicolas, > > What kind of change exactly on the merge factor side ? > > > We increased maxMergeAtOnce and segmentsPerTier from 5 to 50. This will > make Solr to merge segments less frequently after many index updates. Yes, > you need to find the sweet spot here but do try increasing these values > from the default ones. I strongly recommend you to give a 2 min read to this > <https://lucene.apache.org/solr/guide/6_6/indexconfig-in-solrconfig.html#merge-factors>. > Do note that increasing these values will require you to have larger > physical storage until segments merge. > > Besides this, do review your autoCommit config > <https://lucene.apache.org/solr/guide/6_6/updatehandlers-in-solrconfig.html#UpdateHandlersinSolrConfig-autoCommit> > or the frequency of your hard commits. In our case, we don't want real time > updates - so we can always commit less frequently. This makes indexing > faster. How often do you commit? Are you committing after each XML is > indexed? If yes, what is your batch (XML) size? Review default settings of > autoCommit and considering increasing it. Do you want real time reflection > of updates? If no, you can compromise on commits and merge factors and do > faster indexing. Don't so soft commits then. > > In our case, I have set autoCommit to commit after 50,000 documents are > indexed. After EdgeNGrams tokenization, while full indexing, we have seen > index to get over 60 GBs. Once we are done with full indexing, I optimize > the index and the index size comes below 13 GB! Since we can trade off > space temporarily for increased indexing speed, we are still committed to > find sweeter spots for faster indexing. For statistics purpose, we have > over 250 million documents for indexing that converges to 60 million unique > documents after atomic updates (full indexing). > > > > > Would you say atomical update is faster than regular replacement of > > documents? > > > No, I don't say that. Either of the two configs (autoCommit, Merge Policy) > will impact regular indexing too. In our case, non-atomic indexing is out > of question. > > On Wed, 23 Oct 2019 at 00:43, Nicolas Paris > wrote: > > > > We, at Auto-Suggest, also do atomic updates daily and specifically > > > changing merge factor gave us a boost of ~4x > > > > Interesting. What kind of change exactly on the merge factor side ? > > > > > > > At current configuration, our core atomically updates ~423 documents > > > per second. > > > > Would you say atomical update is faster than regular replacement of > > documents ? (considering my first thought on this below) > > > > > > I am wondering if **atomic update feature** would faster the process. > > > > From one hand, using this feature would save network because only a > > > > small subset of the document would be send from the client to the > > > > server. > > > > On the other hand, the server will have to collect the values from the > > > > disk and reindex them. In addition, this implies to store the values > > > > every fields (I am not storing every fields) and use more space. > > > > > > Thanks Paras > > > > > > > > On Tue, Oct 22, 2019 at 01:00:10PM +0530, Paras Lehana w
Re: Document Update performances Improvement
> Set the first two to the same number, and the third to a minumum of three > times what you set the other two. > When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to > 35, and maxMergeAtOnceExplicit to 105. This made merging happen a lot less > frequently. Good to know the chief recipes. > On the Solr side, atomic updates will be slightly slower than indexing the > whole document provided to Solr. This makes sense. > If Solr can access the previous document faster than you can get the > document from your source system, then atomic updates might be faster. The documents are stored within parquet files without any processing needed. In this case, the atomic update is not likely to faster things. Thanks On Wed, Oct 23, 2019 at 07:49:44AM -0600, Shawn Heisey wrote: > On 10/22/2019 1:12 PM, Nicolas Paris wrote: > > > We, at Auto-Suggest, also do atomic updates daily and specifically > > > changing merge factor gave us a boost of ~4x > > > > Interesting. What kind of change exactly on the merge factor side ? > > The mergeFactor setting is deprecated. Instead, use maxMergeAtOnce, > segmentsPerTier, and a setting that is not mentioned in the ref guide -- > maxMergeAtOnceExplicit. > > Set the first two to the same number, and the third to a minumum of three > times what you set the other two. > > The default setting for maxMergeAtOnce and segmentsPerTier is 10, with 30 > for maxMergeAtOnceExplicit. When you're trying to increase indexing speed > and you think segment merging is interfering, you want to increase these > values to something larger. Note that increasing these values will increase > the number of files that your Solr install keeps open. > > https://lucene.apache.org/solr/guide/8_1/indexconfig-in-solrconfig.html#mergepolicyfactory > > When I built a Solr setup, I increased maxMergeAtOnce and segmentsPerTier to > 35, and maxMergeAtOnceExplicit to 105. This made merging happen a lot less > frequently. > > > Would you say atomical update is faster than regular replacement of > > documents ? (considering my first thought on this below) > > On the Solr side, atomic updates will be slightly slower than indexing the > whole document provided to Solr. When an atomic update is done, Solr will > find the existing document, then combine what's in that document with the > changes you specify using the atomic update, and then index the whole > combined document as a new document that replaces with original. > > Whether or not atomic updates are faster or slower in practice than indexing > the whole document will depend on how your source systems work, and that is > not something we can know. If Solr can access the previous document faster > than you can get the document from your source system, then atomic updates > might be faster. > > Thanks, > Shawn > -- nicolas
Re: Document Update performances Improvement
> With Spark-Solr additional complexity comes. You could have too many > executors for your Solr instance(s), ie a too high parallelism. I have been reducing the parallelism of spark-solr part by 5. I had 40 executors loading 4 shards. Right now only 8 executors loading 4 shards. As a result, I can see a 10 times update improvement, and I suspect the update process had been overhelmed by spark. I have been able to keep 40 executor for document preprocessing and reducing to 8 executors within the same spark job by using the "dataframe.coalesce" feature which does not shuffle the data at all and keeps both spark cluster and solr quiet in term of network. Thanks On Sat, Oct 19, 2019 at 10:10:36PM +0200, Jörn Franke wrote: > Maybe you need to give more details. I recommend always to try and test > yourself as you know your own solution best. Depending on your spark process > atomic updates could be faster. > > With Spark-Solr additional complexity comes. You could have too many > executors for your Solr instance(s), ie a too high parallelism. > > Probably the most important question is: > What performance do your use car needs and what is your current performance? > > Once this is clear further architecture aspects can be derived, such as > number of spark executors, number of Solr instances, sharding, replication, > commit timing etc. > > > Am 19.10.2019 um 21:52 schrieb Nicolas Paris : > > > > Hi community, > > > > Any advice to speed-up updates ? > > Is there any advice on commit, memory, docvalues, stored or any tips to > > faster things ? > > > > Thanks > > > > > >> On Wed, Oct 16, 2019 at 12:47:47AM +0200, Nicolas Paris wrote: > >> Hi > >> > >> I am looking for a way to faster the update of documents. > >> > >> In my context, the update replaces one of the many existing indexed > >> fields, and keep the others as is. > >> > >> Right now, I am building the whole document, and replacing the existing > >> one by id. > >> > >> I am wondering if **atomic update feature** would faster the process. > >> > >> From one hand, using this feature would save network because only a > >> small subset of the document would be send from the client to the > >> server. > >> On the other hand, the server will have to collect the values from the > >> disk and reindex them. In addition, this implies to store the values for > >> every fields (I am not storing every fields) and use more space. > >> > >> Also I have read about the ConcurrentUpdateSolrServer class might be an > >> optimized way of updating documents. > >> > >> I am using spark-solr library to deal with solr-cloud. If something > >> exist to faster the process, I would be glad to implement it in that > >> library. > >> Also, I have split the collection over multiple shard, and I admit this > >> faster the update process, but who knows ? > >> > >> Thoughts ? > >> > >> -- > >> nicolas > >> > > > > -- > > nicolas > -- nicolas
Re: POS Tagger
Also we are using stanford POS tagger for french. The processing time is mitigated by the spark-corenlp package which distribute the process over multiple node. Also I am interesting in the way you use POS information within solr queries, or solr fields. Thanks, On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote: > ah, yeah its not the fastest but it proved to be the best for my purposes, > I use it to pre-process data before indexing, to apply more metadata to the > documents in a separate field(s) > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld - > audrey.lorberf...@ibm.com wrote: > > > No, I meant for part-of-speech tagging __ But that's interesting that you > > use StanfordNLP. I've read that it's very slow, so we are concerned that it > > might not work for us at query-time. Do you use it at query-time, or just > > index-time? > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > > > > > On 10/25/19, 10:30 AM, "David Hastings" > > wrote: > > > > Do you mean for entity extraction? > > I make a LOT of use from the stanford nlp project, and get out the > > entities > > and use them for different purposes in solr > > -Dave > > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld - > > audrey.lorberf...@ibm.com wrote: > > > > > Hi All, > > > > > > Does anyone use a POS tagger with their Solr instance other than > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson. > > > > > > Thanks! > > > > > > -- > > > Audrey Lorberfeld > > > Data Scientist, w3 Search > > > IBM > > > audrey.lorberf...@ibm.com > > > > > > > > > > > > -- nicolas
Re: POS Tagger
> Do you use the POS tagger at query time, or just at index time? I have the POS tagger pipeline ready but nothing done yet on the solr part. Right now I am wondering how to use it but still looking for relevant implementation. I guess having the POS information ready before indexation gives the flexibility to test multiple scenario. In case of acronyms, one possible way is indeed to consider the user query as NOUNS, and from the index side, only keep the acronyms that are tagged with NOUNS. (i.e. detect acronyms within text, and look for it's POS; remove it in case it's not a NOUN) Definitely, I prefer the pre-processing approach for this, than creating dedicated solr analysers because my context is batch processing, and also this simplifies testing and debugging - while offering large panel of NLP tools to deal with. On Fri, Oct 25, 2019 at 04:09:29PM +, Audrey Lorberfeld - audrey.lorberf...@ibm.com wrote: > Nicolas, > > Do you use the POS tagger at query time, or just at index time? > > We are thinking of using it to filter the tokens we will eventually perform > ML on. Basically, we have a bunch of acronyms in our corpus. However, many > departments use the same acronyms but expand those acronyms to different > things. Eventually, we are thinking of using ML on our index to determine > which expansion is meant by a particular query according to the context we > find in certain documents. However, since we don't want to run ML on all > tokens in a query, and since we think that acronyms are usually the nouns in > a multi-token query, we want to only feed nouns to the ML model (TBD). > > Does that make sense? So, we'd want both an index-side POS tagger (could be > slow), and also a query-side POS tagger (must be fast). > > -- > Audrey Lorberfeld > Data Scientist, w3 Search > IBM > audrey.lorberf...@ibm.com > > > On 10/25/19, 11:57 AM, "Nicolas Paris" wrote: > > Also we are using stanford POS tagger for french. The processing time is > mitigated by the spark-corenlp package which distribute the process over > multiple node. > > Also I am interesting in the way you use POS information within solr > queries, or solr fields. > > Thanks, > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote: > > ah, yeah its not the fastest but it proved to be the best for my > purposes, > > I use it to pre-process data before indexing, to apply more metadata to > the > > documents in a separate field(s) > > > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld - > > audrey.lorberf...@ibm.com wrote: > > > > > No, I meant for part-of-speech tagging __ But that's interesting that > you > > > use StanfordNLP. I've read that it's very slow, so we are concerned > that it > > > might not work for us at query-time. Do you use it at query-time, or > just > > > index-time? > > > > > > -- > > > Audrey Lorberfeld > > > Data Scientist, w3 Search > > > IBM > > > audrey.lorberf...@ibm.com > > > > > > > > > On 10/25/19, 10:30 AM, "David Hastings" > > > wrote: > > > > > > Do you mean for entity extraction? > > > I make a LOT of use from the stanford nlp project, and get out the > > > entities > > > and use them for different purposes in solr > > > -Dave > > > > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld - > > > audrey.lorberf...@ibm.com wrote: > > > > > > > Hi All, > > > > > > > > Does anyone use a POS tagger with their Solr instance other than > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson. > > > > > > > > Thanks! > > > > > > > > -- > > > > Audrey Lorberfeld > > > > Data Scientist, w3 Search > > > > IBM > > > > audrey.lorberf...@ibm.com > > > > > > > > > > > > > > > > > > > -- > nicolas > > -- nicolas
Re: POS Tagger
Also the openNlp solr POS tagger [1] uses the typeAsSynonymFilter to store the POS: " Index the POS for each token as a synonym, after prefixing the POS with @ " Not sure how to deal with POS after such indexing, but this looks interesting approach ? [1] http://lucene.apache.org/solr/guide/7_3/language-analysis.html#opennlp-part-of-speech-filter On Fri, Oct 25, 2019 at 06:25:36PM +0200, Nicolas Paris wrote: > > Do you use the POS tagger at query time, or just at index time? > > I have the POS tagger pipeline ready but nothing done yet on the solr > part. Right now I am wondering how to use it but still looking for > relevant implementation. > > I guess having the POS information ready before indexation gives the > flexibility to test multiple scenario. > > In case of acronyms, one possible way is indeed to consider the user > query as NOUNS, and from the index side, only keep the acronyms that > are tagged with NOUNS. (i.e. detect acronyms within text, and look for > it's POS; remove it in case it's not a NOUN) > > Definitely, I prefer the pre-processing approach for this, than creating > dedicated solr analysers because my context is batch processing, and > also this simplifies testing and debugging - while offering large panel > of NLP tools to deal with. > > On Fri, Oct 25, 2019 at 04:09:29PM +, Audrey Lorberfeld - > audrey.lorberf...@ibm.com wrote: > > Nicolas, > > > > Do you use the POS tagger at query time, or just at index time? > > > > We are thinking of using it to filter the tokens we will eventually perform > > ML on. Basically, we have a bunch of acronyms in our corpus. However, many > > departments use the same acronyms but expand those acronyms to different > > things. Eventually, we are thinking of using ML on our index to determine > > which expansion is meant by a particular query according to the context we > > find in certain documents. However, since we don't want to run ML on all > > tokens in a query, and since we think that acronyms are usually the nouns > > in a multi-token query, we want to only feed nouns to the ML model (TBD). > > > > Does that make sense? So, we'd want both an index-side POS tagger (could be > > slow), and also a query-side POS tagger (must be fast). > > > > -- > > Audrey Lorberfeld > > Data Scientist, w3 Search > > IBM > > audrey.lorberf...@ibm.com > > > > > > On 10/25/19, 11:57 AM, "Nicolas Paris" wrote: > > > > Also we are using stanford POS tagger for french. The processing time is > > mitigated by the spark-corenlp package which distribute the process over > > multiple node. > > > > Also I am interesting in the way you use POS information within solr > > queries, or solr fields. > > > > Thanks, > > On Fri, Oct 25, 2019 at 10:42:43AM -0400, David Hastings wrote: > > > ah, yeah its not the fastest but it proved to be the best for my > > purposes, > > > I use it to pre-process data before indexing, to apply more metadata > > to the > > > documents in a separate field(s) > > > > > > On Fri, Oct 25, 2019 at 10:40 AM Audrey Lorberfeld - > > > audrey.lorberf...@ibm.com wrote: > > > > > > > No, I meant for part-of-speech tagging __ But that's interesting > > that you > > > > use StanfordNLP. I've read that it's very slow, so we are concerned > > that it > > > > might not work for us at query-time. Do you use it at query-time, > > or just > > > > index-time? > > > > > > > > -- > > > > Audrey Lorberfeld > > > > Data Scientist, w3 Search > > > > IBM > > > > audrey.lorberf...@ibm.com > > > > > > > > > > > > On 10/25/19, 10:30 AM, "David Hastings" > > > > > > wrote: > > > > > > > > Do you mean for entity extraction? > > > > I make a LOT of use from the stanford nlp project, and get out > > the > > > > entities > > > > and use them for different purposes in solr > > > > -Dave > > > > > > > > On Fri, Oct 25, 2019 at 10:16 AM Audrey Lorberfeld - > > > > audrey.lorberf...@ibm.com wrote: > > > > > > > > > Hi All, > > > > > > > > > > Does anyone use a POS tagger with their Solr instance other > > than > > > > > OpenNLP’s? We are considering OpenNLP, SpaCy, and Watson. > > > > > > > > > > Thanks! > > > > > > > > > > -- > > > > > Audrey Lorberfeld > > > > > Data Scientist, w3 Search > > > > > IBM > > > > > audrey.lorberf...@ibm.com > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > nicolas > > > > > > -- > nicolas > -- nicolas
Re: Solr Ref Guide Changes - now HTML only
> If you are someone who wishes the PDF would continue, please share your > feedback. I have not particularly explored the documentation format but the content. However here my thought on this: Pdf version of solr documentation has two advantages: 1. readable offline 2. make searching easier than the html version If there were a "one page" version of the html documentation, this would mitigate searching within the whole. Also a monolitic html page makes things easier to access offline.(transform back to pdf, ebook..?) I am not very happy with the search engine embedded within the html documentation I admit. Hope this is not solr under the hood :S -- nicolas
CloudSolrClient - basic auth - multi shard collection
Hello, I am having trouble with basic auth on a solrcloud instance. When the collection is only one shard, there is no problem. When the collection is multiple shard, there is no problem until I ask multiple query concurrently: I get 401 error and asking for credentials for concurrent queries. I have created a Premptive Auth Interceptor which should add the credential information for every http call. Thanks for any pointer, solr:8.1 spring-data-solr:4.1.0 -- nicolas
Re: CloudSolrClient - basic auth - multi shard collection
> you can fix the issue by upgrading to 8.2 - both of those Thanks, I will try ASAP On Wed, Nov 20, 2019 at 01:58:31PM -0500, Jason Gerlowski wrote: > Hi Nicholas, > > I'm not really familiar with spring-data-solr, so I can't speak to > that detail, but it sounds like you might be running into either > https://issues.apache.org/jira/browse/SOLR-13510 or > https://issues.apache.org/jira/browse/SOLR-13472. There are partial > workarounds on those issues that might help you. If those aren't > sufficient, you can fix the issue by upgrading to 8.2 - both of those > bugs are fixed in that version. > > Hope that helps, > > Jason > > > On Mon, Nov 18, 2019 at 8:26 AM Nicolas Paris > wrote: > > > > Hello, > > > > I am having trouble with basic auth on a solrcloud instance. When the > > collection is only one shard, there is no problem. When the collection > > is multiple shard, there is no problem until I ask multiple query > > concurrently: I get 401 error and asking for credentials for concurrent > > queries. > > > > I have created a Premptive Auth Interceptor which should add the > > credential information for every http call. > > > > Thanks for any pointer, > > > > solr:8.1 > > spring-data-solr:4.1.0 > > -- > > nicolas > -- nicolas
Re: A Last Message to the Solr Users
Hi Mark, Have you shared with the community all the weaknesses of solrcloud you have in mind and the advice to overcome that? Apparently you wrote most of that code and your feedback would be helpful for community. Regards On Sat, Nov 30, 2019 at 09:31:34PM -0600, Mark Miller wrote: > I’d also like to say the last 5 years of my life have been spent being paid > to upgrade Solr systems. I’ve made a lot of doing this. > > As I said from the start - take this for what it’s worst. For his guys it’s > not worth much. That’s cool. > > And it’s a little inside joke that I’ll be back :) I joke a lot. > > But seriously, you have a second chance here. > > This mostly concerns SolrCloud. That’s why I recommend standalone mode. But > key people know why to do. I know it will happen - but their lives will be > easier if you help. > > Lol. > > - Mark > > On Sat, Nov 30, 2019 at 9:25 PM Mark Miller wrote: > > > I said the key people understand :) > > > > I’ve worked in Lucene since 2006 and have an insane amount of the code > > foot print in Solr and SolrCloud :) Look up the stats. Do you have any > > contributions? > > > > I said the key people know. > > > > Solr stand-alone is and has been very capable. People are working around > > SolrCloud too.All fine and good. Millions are being made and saved. > > Everyone is comfortable. Some might thinks the sky looks clear and blue. > > I’ve spent a lot of capital to make sure the wrong people don’t think that > > anymore ;) > > > > Unless you are a Developer, you won’t understand the other issues. But you > > don’t need too. > > > > Mark > > > > On Sat, Nov 30, 2019 at 7:05 PM Dave wrote: > > > >> I’m young here I think, not even 40 and only been using solr since like > >> 2008 or so, so like 1.4 give or take. But I know a really good therapist if > >> you want to talk about it. > >> > >> > On Nov 30, 2019, at 6:56 PM, Mark Miller wrote: > >> > > >> > Now I have sacrificed to give you a new chance. A little for my > >> community. > >> > It was my community. But it was mostly for me. The developer I started > >> as > >> > would kick my ass today. Circumstances and luck has brought money to > >> our > >> > project. And it has corrupted our process, our community, and our code. > >> > > >> > In college i would talk about past Mark screwing future Mark and too bad > >> > for him. What did he ever do for me? Well, he got me again ;) > >> > > >> > I’m out of steam, time and wife patentice. > >> > > >> > Enough key people are aware of the scope of the problem now that you > >> won’t > >> > need me. I was never actually part of the package. To the many, many > >> people > >> > that offered me private notes of encouragement and future help - thank > >> you > >> > so much. Your help will be needed. > >> > > >> > You will reset. You will fix this. Or I will be back. > >> > > >> > Mark > >> > > >> > > >> > -- > >> > - Mark > >> > > >> > http://about.me/markrmiller > >> > > -- > > - Mark > > > > http://about.me/markrmiller > > > -- > - Mark > > http://about.me/markrmiller -- nicolas
does copyFields increase indexe size ?
Hi >From my understanding, copy fields creates an new indexes from the copied fields. >From my tests, I copied 1k textual fields into _text_ with copyFields. As a result there is no increase in the size of the collection. All the source fields are indexed and stored. The _text_ field is indexed but not stored. This is a great surprise but is this behavior expected ? -- nicolas
Re: does copyFields increase indexe size ?
> The action of changing the schema makes zero changes in the index. It > merely changes how Solr interacts with the index. Do you mean "copy fields" is only an action of changing the schema ? I was thinking it was adding a new field and eventually a new index to the collection On Tue, Dec 24, 2019 at 10:59:03AM -0700, Shawn Heisey wrote: > On 12/24/2019 10:45 AM, Nicolas Paris wrote: > > From my understanding, copy fields creates an new indexes from the > > copied fields. > > From my tests, I copied 1k textual fields into _text_ with copyFields. > > As a result there is no increase in the size of the collection. All the > > source fields are indexed and stored. The _text_ field is indexed but > > not stored. > > > > This is a great surprise but is this behavior expected ? > > The action of changing the schema makes zero changes in the index. It > merely changes how Solr interacts with the index. > > If you want the index to change when the schema is changed, you need to > restart or reload and then re-do the indexing after the change is saved. > > https://cwiki.apache.org/confluence/display/solr/HowToReindex > > Thanks, > Shawn > -- nicolas
Re: does copyFields increase indexe size ?
> If you are redoing the indexing after changing the schema and > reloading/restarting, then you can ignore me. I am sorry to say that I have to ignore you. Indeed, my tests include recreating the collection from scratch - with and without the copy fields. In both cases the index size is the same ! (while the _text_ field is working correctly) On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote: > On 12/24/2019 5:11 PM, Nicolas Paris wrote: > > Do you mean "copy fields" is only an action of changing the schema ? > > I was thinking it was adding a new field and eventually a new index to > > the collection > > The copy that copyField does happens at index time. Reindexing is required > after changing the schema, or nothing happens. > > If you are redoing the indexing after changing the schema and > reloading/restarting, then you can ignore me. > > Thanks, > Shawn > -- nicolas
Re: does copyFields increase indexe size ?
On Wed, Dec 25, 2019 at 05:30:03AM -0500, Dave wrote: > #2 you initially said you were talking about 1k documents. Hi Dave. Again, sorry for the confusion. This is 1k fields (general_text), over 50M large documents copied into one _text_ field. 4 shards, 40GB per shard in both case, with/without the _text_ field > > > On Dec 25, 2019, at 3:07 AM, Nicolas Paris wrote: > > > > > >> > >> If you are redoing the indexing after changing the schema and > >> reloading/restarting, then you can ignore me. > > > > I am sorry to say that I have to ignore you. Indeed, my tests include > > recreating the collection from scratch - with and without the copy > > fields. > > In both cases the index size is the same ! (while the _text_ field is > > working correctly) > > > >> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote: > >>> On 12/24/2019 5:11 PM, Nicolas Paris wrote: > >>> Do you mean "copy fields" is only an action of changing the schema ? > >>> I was thinking it was adding a new field and eventually a new index to > >>> the collection > >> > >> The copy that copyField does happens at index time. Reindexing is required > >> after changing the schema, or nothing happens. > >> > >> If you are redoing the indexing after changing the schema and > >> reloading/restarting, then you can ignore me. > >> > >> Thanks, > >> Shawn > >> > > > > -- > > nicolas > -- nicolas
Re: does copyFields increase indexe size ?
Anyway, that´s good news copy field does not increase indexe size in some circumstance: - the copied fields and the target field share the same datatype - the target field is not stored this is tested on text fields On Wed, Dec 25, 2019 at 11:42:23AM +0100, Nicolas Paris wrote: > > On Wed, Dec 25, 2019 at 05:30:03AM -0500, Dave wrote: > > #2 you initially said you were talking about 1k documents. > > Hi Dave. Again, sorry for the confusion. This is 1k fields > (general_text), over 50M large documents copied into one _text_ field. > 4 shards, 40GB per shard in both case, with/without the _text_ field > > > > > > On Dec 25, 2019, at 3:07 AM, Nicolas Paris > > > wrote: > > > > > > > > >> > > >> If you are redoing the indexing after changing the schema and > > >> reloading/restarting, then you can ignore me. > > > > > > I am sorry to say that I have to ignore you. Indeed, my tests include > > > recreating the collection from scratch - with and without the copy > > > fields. > > > In both cases the index size is the same ! (while the _text_ field is > > > working correctly) > > > > > >> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote: > > >>> On 12/24/2019 5:11 PM, Nicolas Paris wrote: > > >>> Do you mean "copy fields" is only an action of changing the schema ? > > >>> I was thinking it was adding a new field and eventually a new index to > > >>> the collection > > >> > > >> The copy that copyField does happens at index time. Reindexing is > > >> required > > >> after changing the schema, or nothing happens. > > >> > > >> If you are redoing the indexing after changing the schema and > > >> reloading/restarting, then you can ignore me. > > >> > > >> Thanks, > > >> Shawn > > >> > > > > > > -- > > > nicolas > > > > -- > nicolas > -- nicolas
Re: does copyFields increase indexe size ?
Hi Eric Below a part of the managed-schema. There is 1k section* fields. The second experience, I removed the copyField, droped the collection and re-indexed the whole. To mesure the index size, I went to solr-cloud and looked in the cloud part: 40GO per shard. I also look at the folder size. I made some tests and the _text_ field is indexed. On Thu, Dec 26, 2019 at 02:16:32PM -0500, Erick Erickson wrote: > This simply cannot be true unless the destination copyField is indexed=false, > docValues=false stored=false. I.e. “some circumstances” means there’s really > no use in using the copyField in the first place. I suppose that if you don’t > store any term vectors, no position information nothing except, say, the > terms then maybe you’ll have extremely minimal size. But even in that case, > I’d use the original field in an “fq” clause which doesn’t use any scoring in > place of using the copyField. > > Each field is stored in a separate part of the relevant files (.tim, .pos, > etc). Term frequencies are kept on a _per field_ basis for instance. > > So this pretty much has to be small sample size or other measurement error. > > Best, > Erick > > > On Dec 26, 2019, at 9:27 AM, Nicolas Paris wrote: > > > > Anyway, that´s good news copy field does not increase indexe size in > > some circumstance: > > - the copied fields and the target field share the same datatype > > - the target field is not stored > > > > this is tested on text fields > > > > > > On Wed, Dec 25, 2019 at 11:42:23AM +0100, Nicolas Paris wrote: > >> > >> On Wed, Dec 25, 2019 at 05:30:03AM -0500, Dave wrote: > >>> #2 you initially said you were talking about 1k documents. > >> > >> Hi Dave. Again, sorry for the confusion. This is 1k fields > >> (general_text), over 50M large documents copied into one _text_ field. > >> 4 shards, 40GB per shard in both case, with/without the _text_ field > >> > >>> > >>>> On Dec 25, 2019, at 3:07 AM, Nicolas Paris > >>>> wrote: > >>>> > >>>> > >>>>> > >>>>> If you are redoing the indexing after changing the schema and > >>>>> reloading/restarting, then you can ignore me. > >>>> > >>>> I am sorry to say that I have to ignore you. Indeed, my tests include > >>>> recreating the collection from scratch - with and without the copy > >>>> fields. > >>>> In both cases the index size is the same ! (while the _text_ field is > >>>> working correctly) > >>>> > >>>>> On Tue, Dec 24, 2019 at 05:32:09PM -0700, Shawn Heisey wrote: > >>>>>> On 12/24/2019 5:11 PM, Nicolas Paris wrote: > >>>>>> Do you mean "copy fields" is only an action of changing the schema ? > >>>>>> I was thinking it was adding a new field and eventually a new index to > >>>>>> the collection > >>>>> > >>>>> The copy that copyField does happens at index time. Reindexing is > >>>>> required > >>>>> after changing the schema, or nothing happens. > >>>>> > >>>>> If you are redoing the indexing after changing the schema and > >>>>> reloading/restarting, then you can ignore me. > >>>>> > >>>>> Thanks, > >>>>> Shawn > >>>>> > >>>> > >>>> -- > >>>> nicolas > >>> > >> > >> -- > >> nicolas > >> > > > > -- > > nicolas > -- nicolas
Re: does copyFields increase indexe size ?
> So what will be added is just another set of pointers to each relevant > term. That's not going to be very large. Probably Hi Shawn. This explains much ! Thanks. In case of text fields, the highlight is done on the source fields and the _text_ field is only used for lookup. This behavior is perfect for my needs. On Fri, Dec 27, 2019 at 05:28:25PM -0700, Shawn Heisey wrote: > On 12/26/2019 1:21 PM, Nicolas Paris wrote: > > Below a part of the managed-schema. There is 1k section* fields. The > > second experience, I removed the copyField, droped the collection and > > re-indexed the whole. To mesure the index size, I went to solr-cloud and > > looked in the cloud part: 40GO per shard. I also look at the folder > > size. I made some tests and the _text_ field is indexed. > > Your schema says that the destination field is not stored and doesn't have > docValues. So the only thing it has is indexed. > > All of the terms generated by index analysis will already be in the index > from the source fields. So what will be added is just another set of > pointers to each relevant term. That's not going to be very large. Probably > only a few bytes for each term. > > So with this copyField, the index will get larger, but probably not > significantly. > > Thanks, > Shawn > -- nicolas
Re: Coming back to search after some time... SOLR or Elastic for text search?
> We have implemented the content ingestion and processing pipelines already > in python and SPARK, so most of the data will be pushed in using APIs. I use the spark-solr library in production and have looked at the ES equivalent and the solr connector looks much more advanced for both loading and fetching data. In particular the fetching part uses the solr export handler which makes things incredibly fast. Also spark-solr uses the dataframe API while ES looks to be stuck with the RDD api AFAIK. A good connector to spark offer lot of perspectives in term of data transformation and machine learning advanced features within the search engine. On Tue, Jan 14, 2020 at 11:02:17PM -0500, Dc Tech wrote: > I am SOLR fant and had implemented it in our company over 10 years ago. > I moved away from that role and the new search team in the meanwhile > implemented a proprietary (and expensive) nosql style search engine. That > the project did not go well, and now I am back to project and reviewing the > technology stack. > > Some of the team think that ElasticSearch could be a good option, > especially since we can easily get hosted versions with AWS where we have > all the contractual stuff sorted out. > > Whle SOLR definitely seems more advanced (LTR, streaming expressions, > graph, and all the knobs and dials for relevancy tuning), Elastic may be > sufficient for our needs. It does not seem to have LTR out of the box but > the relevancy tuning knobs and dials seem to be similar to what SOLR has. > > The corpus size is not a challenge - we have about one million document, > of which about 1/2 have full text, while the test are simpler (i.e. company > directory etc.). > The query volumes are also quite low (max 5/second at peak). > We have implemented the content ingestion and processing pipelines already > in python and SPARK, so most of the data will be pushed in using APIs. > > I would really appreciate any guidance from the community !! -- nicolas