Uhm, actually, If you have copyField from multiple sources into that _text_ field, you may be accumulating/duplicating content on update.
Check what happens to the content of that _text_ field when you do full-text and then do an attribute update. If I am right, you may want to have a separate "original_text" field that you store and then have your aggregate copyField destination not stored. Regards, Alex. ---- http://www.solr-start.com/ - Resources for Solr users, new and experienced On 8 March 2017 at 13:41, Nicolas Bouillon <nico2000...@yahoo.com.invalid> wrote: > Guys > > A BIG thank you, it works perfectly!!! > > After so much research I finally got my solution working. > > That was the trick, _text_ is stored and it’s working as expected. > > Have a very nice day and thanks a lot for your contribution. > > Really appreciated > > Nico >> On 8 Mar 2017, at 18:26, Nicolas Bouillon <nico2000...@yahoo.com.INVALID> >> wrote: >> >> Hi Erick, Shawn, >> >> Thx really a lot for your swift reaction, it’s fantastic. >> Let me answer both your answers: >> >> 1) the df entry in solrconfig.xml has not been changed: >> >> <str name="df">_text_</str> >> >> 2)when I do a query for full-text search I don’t specify a field, I just >> enter the string I’m looking for in the q parameter: >> >> Like this: I have a ppt containing the word “Microsoft”that is called >> “Dynamics 365 Roadmap”, I do a query on “Microsoft”and it finds the document >> After update, it doesn’t find it unless I search for one of my custom fields >> or something in the title like “Dynamics” >> >> So, my conclusion would be that you suggest I mark “_text_” as stored=true >> in the schema, right? >> And reload core or even re-index. >> >> Thx a bunch >> >> >> >> >>> On 8 Mar 2017, at 17:46, Erick Erickson <erickerick...@gmail.com> wrote: >>> >>> bq: I wonder if it won’t be simpler for me to write a custom handler >>> >>> Probably not, that would be Java too ;)... >>> >>> OK, back up a bit. You can change your schema such that the full-text >>> field _is_ stored, I don't quite know what the default field is from >>> memory, but you must be searching against it ;). It sounds like you're >>> using the defaults and it's _probably_ _text_. And my guess is that >>> you're searching on that field even though you don't specify, see the >>> "df" entry in your solrconfig.xml file. There's no reason you can't >>> change that to stored="true" (reindex of course). >>> >>> Nothing that you've mentioned so far looks like it should take >>> anything except getting your configurations to be what you need, so >>> don't make more work for yourself than you need to ;). >>> >>> After that, see the link Shawn provided... >>> >>> Best, >>> Erick >>> >>> On Wed, Mar 8, 2017 at 8:22 AM, Nicolas Bouillon >>> <nico2000...@yahoo.com.invalid> wrote: >>>> Hi Erick >>>> >>>> Thanks a lot for the elaborated answer. Let me give some precisions: >>>> >>>> 1. I upload the docs using an AJAX post multiform to my server. >>>> 2. The PHP target of the post, takes the file and stores it on disk >>>> 3. If the file is moved successfully from TEMP files to final destination, >>>> I then call SOLR as follows: >>>> >>>> It’s a curl POST request: >>>> >>>> URL: http://my_server:8983/solr/my_core/update/extract/?" . $fields . >>>> "&literal.id=" . $id . "&filetypes=*&commit=true >>>> HEADER: Content-type: multipart/form-data >>>> POSTFIELDS: the entire file that has just been stored >>>> (BTW, it’s PHP specific but I send a CurlFile in an array as follows: >>>> array('myfile' => $cfile) >>>> >>>> In the URL, the parameter $fields contains the following: >>>> >>>> $fields = "literal.kref=" . $id . "&literal.ktype=" . $type . >>>> "&literal.kattachment=" . $attachment; >>>> >>>> Where kref, ktype and kattachment are my custom fields (that I added to >>>> the schema.xml previously) >>>> >>>> So, indeed it’s Tika that extracts the info. I didn’t change anything to >>>> the ExtractHandler. >>>> >>>> I read about the fact that all fields must be marked as stored=true but: >>>> >>>> - I checked in the schema, all the fields that matter (Tika default >>>> extracted fields) and my customer fields are stored=true. >>>> - I suppose that the full-text index is not stored in a field? And >>>> therefore cannot be marked as stored? >>>> >>>> I manage to upload files and mark my docs with metadata but I have >>>> existing files where I would like to update my fields (kref, …) without >>>> re-extracting and I’d like also to allow for re-indexing if needed without >>>> overriding my fields. >>>> >>>> I’m stuck… I wonder if it won’t be simpler for me to write a custom >>>> handler of some sort but I don’t really program in Java. >>>> >>>> Cheers >>>> >>>> Nico >>>> >>>>> On 8 Mar 2017, at 17:03, Erick Erickson <erickerick...@gmail.com> wrote: >>>>> >>>>> Nico: >>>>> >>>>> This is the place for such questions! I'm not quite sure the source >>>>> of the docs. When you say you "extract", does that mean you're using >>>>> the ExtractingRequestHandler, i.e. uploading PDF or Word etc. to Solr >>>>> and letting Tika parse it out? IOW, where is the fulltext coming from? >>>>> >>>>> For adding tags any time, Solr has "Atomic Updates" that has a couple >>>>> of requirements, mainly you have to set stored="true" for all your >>>>> fields _except_ the destinations for any <copyField> directives. Under >>>>> the covers this pulls the stored data from Solr, overlays it with the >>>>> new data you've sent and re-indexes it. The expense here is that your >>>>> index will increase in size, but storing the data doesn't mean much of >>>>> an increase in JVM requirements. That is, say your index doubles in >>>>> size. Your JVM heap requirements may increase 5% (and, frankly I doubt >>>>> that much, but I've never measured). FWIW, the on-disk size should >>>>> increase by roughly 50% of the raw data size. WARNING: "raw data size" >>>>> is the size _after_ extraction, so say you're indexing a 1K XML doc >>>>> where the tags are taking up .75K. Then the on-disk memory should go >>>>> up roughly .125K (50% of .25K).. >>>>> >>>>> Don't worry about "thousands" of docs ;) On my laptop I index over 1K >>>>> Wikipedia articles a second (YMMV of course). Without any particular >>>>> tuning. Without sharding. Very often the most expensive part of >>>>> indexing is acquiring the data in the first place, i.e. getting it >>>>> from a DB or extracting it from Tika. Solr will handle quite a load. >>>>> >>>>> And, if you're using the ExtractingRequestHandler, I'd seriously think >>>>> about moving it to a Client. Here's a Java example: >>>>> https://lucidworks.com/2012/02/14/indexing-with-solrj/ >>>>> >>>>> Best, >>>>> Erick >>>>> >>>>> On Wed, Mar 8, 2017 at 7:46 AM, Nicolas Bouillon >>>>> <nico2000...@yahoo.com.invalid> wrote: >>>>>> Dear SOLR friends, >>>>>> >>>>>> I developed a small ERP. I produce PDF documents linked to objects in my >>>>>> ERP: invoices, timesheets, contracts, etc... >>>>>> I have also the possibility to attach documents to a particular object >>>>>> and when I view an invoice for instance, I can see the attached >>>>>> documents. >>>>>> >>>>>> Until now, I was adding reference to these documents in my DB and store >>>>>> docs on the server. >>>>>> Still, I found it cumbersome and not flexible enough, so I removed the >>>>>> table documents from my DB and decided to use SOLR to add metadata to >>>>>> the documents in the index. >>>>>> >>>>>> Currently, I have the following custom fields: >>>>>> - ktype (string): invoice, contract, etc… >>>>>> - kattachment (int): 0 or 1 >>>>>> - kref (int): reference in DB of linked object, ex: 10 (for contract 10 >>>>>> in DB) >>>>>> - ktags (strings, mutifield): free tags, ex: customerX, consulting, >>>>>> development >>>>>> >>>>>> Each time I upload a document, I store in on server and then add it to >>>>>> SOLR using "extract" adding the metadata at the same time. It works fine. >>>>>> >>>>>> I would like now 3 things: >>>>>> >>>>>> - For existing documents that have not been extracted with metadata >>>>>> altogether at upload (documents uploaded before I developed the >>>>>> functionality), I'd like to update them with the proper metadata without >>>>>> losing the full-text search >>>>>> - Be able to add anytime tags to the ktags field after upload whilst >>>>>> keeping full-text search >>>>>> - In case I have to re-index, I want to be sure I don't have to restart >>>>>> everything from scratch. >>>>>> In a few months, I expect to have thousands of docs in my >>>>>> system....and then I'll add emails >>>>>> >>>>>> I have very little experience in SOLR. I know I can re-perform an >>>>>> extract instead of an update when I modify a field but I'm pretty sure >>>>>> it's not the right thing to do + performance problems can arise. >>>>>> >>>>>> What do you suggest me to do? >>>>>> >>>>>> I thought about storing the metadata linked to each document separately >>>>>> (in DB or separate XML file individually or one XML for all) but I'm >>>>>> pretty sure it will be very slow after a while. >>>>>> >>>>>> Thx a lot in advance fro your precious help. >>>>>> This is my first message to the user list, please excuse anything I may >>>>>> have done wrong…I learn fast, don’t worry.. >>>>>> >>>>>> Regards >>>>>> >>>>>> Nico >>>>>> >>>>>> My configuration: >>>>>> >>>>>> Synology 1511 running DSM 6.1 >>>>>> Docker container for SOLR using latest stable version >>>>>> 1 core called “katalyst” containing index of all documents >>>>>> >>>>>> ERP is written in PHP/Mysql for backend and Jquery/Bootstrap for >>>>>> front-end >>>>>> >>>>>> I have a test env on OSX Sierra running docker, a prod environment on >>>>>> Synology >>>>>> >>>>>> >>>> >> >