Guys A BIG thank you, it works perfectly!!!
After so much research I finally got my solution working. That was the trick, _text_ is stored and it’s working as expected. Have a very nice day and thanks a lot for your contribution. Really appreciated Nico > On 8 Mar 2017, at 18:26, Nicolas Bouillon <nico2000...@yahoo.com.INVALID> > wrote: > > Hi Erick, Shawn, > > Thx really a lot for your swift reaction, it’s fantastic. > Let me answer both your answers: > > 1) the df entry in solrconfig.xml has not been changed: > > <str name="df">_text_</str> > > 2)when I do a query for full-text search I don’t specify a field, I just > enter the string I’m looking for in the q parameter: > > Like this: I have a ppt containing the word “Microsoft”that is called > “Dynamics 365 Roadmap”, I do a query on “Microsoft”and it finds the document > After update, it doesn’t find it unless I search for one of my custom fields > or something in the title like “Dynamics” > > So, my conclusion would be that you suggest I mark “_text_” as stored=true in > the schema, right? > And reload core or even re-index. > > Thx a bunch > > > > >> On 8 Mar 2017, at 17:46, Erick Erickson <erickerick...@gmail.com> wrote: >> >> bq: I wonder if it won’t be simpler for me to write a custom handler >> >> Probably not, that would be Java too ;)... >> >> OK, back up a bit. You can change your schema such that the full-text >> field _is_ stored, I don't quite know what the default field is from >> memory, but you must be searching against it ;). It sounds like you're >> using the defaults and it's _probably_ _text_. And my guess is that >> you're searching on that field even though you don't specify, see the >> "df" entry in your solrconfig.xml file. There's no reason you can't >> change that to stored="true" (reindex of course). >> >> Nothing that you've mentioned so far looks like it should take >> anything except getting your configurations to be what you need, so >> don't make more work for yourself than you need to ;). >> >> After that, see the link Shawn provided... >> >> Best, >> Erick >> >> On Wed, Mar 8, 2017 at 8:22 AM, Nicolas Bouillon >> <nico2000...@yahoo.com.invalid> wrote: >>> Hi Erick >>> >>> Thanks a lot for the elaborated answer. Let me give some precisions: >>> >>> 1. I upload the docs using an AJAX post multiform to my server. >>> 2. The PHP target of the post, takes the file and stores it on disk >>> 3. If the file is moved successfully from TEMP files to final destination, >>> I then call SOLR as follows: >>> >>> It’s a curl POST request: >>> >>> URL: http://my_server:8983/solr/my_core/update/extract/?" . $fields . >>> "&literal.id=" . $id . "&filetypes=*&commit=true >>> HEADER: Content-type: multipart/form-data >>> POSTFIELDS: the entire file that has just been stored >>> (BTW, it’s PHP specific but I send a CurlFile in an array as follows: >>> array('myfile' => $cfile) >>> >>> In the URL, the parameter $fields contains the following: >>> >>> $fields = "literal.kref=" . $id . "&literal.ktype=" . $type . >>> "&literal.kattachment=" . $attachment; >>> >>> Where kref, ktype and kattachment are my custom fields (that I added to the >>> schema.xml previously) >>> >>> So, indeed it’s Tika that extracts the info. I didn’t change anything to >>> the ExtractHandler. >>> >>> I read about the fact that all fields must be marked as stored=true but: >>> >>> - I checked in the schema, all the fields that matter (Tika default >>> extracted fields) and my customer fields are stored=true. >>> - I suppose that the full-text index is not stored in a field? And >>> therefore cannot be marked as stored? >>> >>> I manage to upload files and mark my docs with metadata but I have existing >>> files where I would like to update my fields (kref, …) without >>> re-extracting and I’d like also to allow for re-indexing if needed without >>> overriding my fields. >>> >>> I’m stuck… I wonder if it won’t be simpler for me to write a custom handler >>> of some sort but I don’t really program in Java. >>> >>> Cheers >>> >>> Nico >>> >>>> On 8 Mar 2017, at 17:03, Erick Erickson <erickerick...@gmail.com> wrote: >>>> >>>> Nico: >>>> >>>> This is the place for such questions! I'm not quite sure the source >>>> of the docs. When you say you "extract", does that mean you're using >>>> the ExtractingRequestHandler, i.e. uploading PDF or Word etc. to Solr >>>> and letting Tika parse it out? IOW, where is the fulltext coming from? >>>> >>>> For adding tags any time, Solr has "Atomic Updates" that has a couple >>>> of requirements, mainly you have to set stored="true" for all your >>>> fields _except_ the destinations for any <copyField> directives. Under >>>> the covers this pulls the stored data from Solr, overlays it with the >>>> new data you've sent and re-indexes it. The expense here is that your >>>> index will increase in size, but storing the data doesn't mean much of >>>> an increase in JVM requirements. That is, say your index doubles in >>>> size. Your JVM heap requirements may increase 5% (and, frankly I doubt >>>> that much, but I've never measured). FWIW, the on-disk size should >>>> increase by roughly 50% of the raw data size. WARNING: "raw data size" >>>> is the size _after_ extraction, so say you're indexing a 1K XML doc >>>> where the tags are taking up .75K. Then the on-disk memory should go >>>> up roughly .125K (50% of .25K).. >>>> >>>> Don't worry about "thousands" of docs ;) On my laptop I index over 1K >>>> Wikipedia articles a second (YMMV of course). Without any particular >>>> tuning. Without sharding. Very often the most expensive part of >>>> indexing is acquiring the data in the first place, i.e. getting it >>>> from a DB or extracting it from Tika. Solr will handle quite a load. >>>> >>>> And, if you're using the ExtractingRequestHandler, I'd seriously think >>>> about moving it to a Client. Here's a Java example: >>>> https://lucidworks.com/2012/02/14/indexing-with-solrj/ >>>> >>>> Best, >>>> Erick >>>> >>>> On Wed, Mar 8, 2017 at 7:46 AM, Nicolas Bouillon >>>> <nico2000...@yahoo.com.invalid> wrote: >>>>> Dear SOLR friends, >>>>> >>>>> I developed a small ERP. I produce PDF documents linked to objects in my >>>>> ERP: invoices, timesheets, contracts, etc... >>>>> I have also the possibility to attach documents to a particular object >>>>> and when I view an invoice for instance, I can see the attached documents. >>>>> >>>>> Until now, I was adding reference to these documents in my DB and store >>>>> docs on the server. >>>>> Still, I found it cumbersome and not flexible enough, so I removed the >>>>> table documents from my DB and decided to use SOLR to add metadata to the >>>>> documents in the index. >>>>> >>>>> Currently, I have the following custom fields: >>>>> - ktype (string): invoice, contract, etc… >>>>> - kattachment (int): 0 or 1 >>>>> - kref (int): reference in DB of linked object, ex: 10 (for contract 10 >>>>> in DB) >>>>> - ktags (strings, mutifield): free tags, ex: customerX, consulting, >>>>> development >>>>> >>>>> Each time I upload a document, I store in on server and then add it to >>>>> SOLR using "extract" adding the metadata at the same time. It works fine. >>>>> >>>>> I would like now 3 things: >>>>> >>>>> - For existing documents that have not been extracted with metadata >>>>> altogether at upload (documents uploaded before I developed the >>>>> functionality), I'd like to update them with the proper metadata without >>>>> losing the full-text search >>>>> - Be able to add anytime tags to the ktags field after upload whilst >>>>> keeping full-text search >>>>> - In case I have to re-index, I want to be sure I don't have to restart >>>>> everything from scratch. >>>>> In a few months, I expect to have thousands of docs in my >>>>> system....and then I'll add emails >>>>> >>>>> I have very little experience in SOLR. I know I can re-perform an extract >>>>> instead of an update when I modify a field but I'm pretty sure it's not >>>>> the right thing to do + performance problems can arise. >>>>> >>>>> What do you suggest me to do? >>>>> >>>>> I thought about storing the metadata linked to each document separately >>>>> (in DB or separate XML file individually or one XML for all) but I'm >>>>> pretty sure it will be very slow after a while. >>>>> >>>>> Thx a lot in advance fro your precious help. >>>>> This is my first message to the user list, please excuse anything I may >>>>> have done wrong…I learn fast, don’t worry.. >>>>> >>>>> Regards >>>>> >>>>> Nico >>>>> >>>>> My configuration: >>>>> >>>>> Synology 1511 running DSM 6.1 >>>>> Docker container for SOLR using latest stable version >>>>> 1 core called “katalyst” containing index of all documents >>>>> >>>>> ERP is written in PHP/Mysql for backend and Jquery/Bootstrap for front-end >>>>> >>>>> I have a test env on OSX Sierra running docker, a prod environment on >>>>> Synology >>>>> >>>>> >>> >