Hi Erick, Shawn, Thx really a lot for your swift reaction, it’s fantastic. Let me answer both your answers:
1) the df entry in solrconfig.xml has not been changed: <str name="df">_text_</str> 2)when I do a query for full-text search I don’t specify a field, I just enter the string I’m looking for in the q parameter: Like this: I have a ppt containing the word “Microsoft”that is called “Dynamics 365 Roadmap”, I do a query on “Microsoft”and it finds the document After update, it doesn’t find it unless I search for one of my custom fields or something in the title like “Dynamics” So, my conclusion would be that you suggest I mark “_text_” as stored=true in the schema, right? And reload core or even re-index. Thx a bunch > On 8 Mar 2017, at 17:46, Erick Erickson <erickerick...@gmail.com> wrote: > > bq: I wonder if it won’t be simpler for me to write a custom handler > > Probably not, that would be Java too ;)... > > OK, back up a bit. You can change your schema such that the full-text > field _is_ stored, I don't quite know what the default field is from > memory, but you must be searching against it ;). It sounds like you're > using the defaults and it's _probably_ _text_. And my guess is that > you're searching on that field even though you don't specify, see the > "df" entry in your solrconfig.xml file. There's no reason you can't > change that to stored="true" (reindex of course). > > Nothing that you've mentioned so far looks like it should take > anything except getting your configurations to be what you need, so > don't make more work for yourself than you need to ;). > > After that, see the link Shawn provided... > > Best, > Erick > > On Wed, Mar 8, 2017 at 8:22 AM, Nicolas Bouillon > <nico2000...@yahoo.com.invalid> wrote: >> Hi Erick >> >> Thanks a lot for the elaborated answer. Let me give some precisions: >> >> 1. I upload the docs using an AJAX post multiform to my server. >> 2. The PHP target of the post, takes the file and stores it on disk >> 3. If the file is moved successfully from TEMP files to final destination, I >> then call SOLR as follows: >> >> It’s a curl POST request: >> >> URL: http://my_server:8983/solr/my_core/update/extract/?" . $fields . >> "&literal.id=" . $id . "&filetypes=*&commit=true >> HEADER: Content-type: multipart/form-data >> POSTFIELDS: the entire file that has just been stored >> (BTW, it’s PHP specific but I send a CurlFile in an array as follows: >> array('myfile' => $cfile) >> >> In the URL, the parameter $fields contains the following: >> >> $fields = "literal.kref=" . $id . "&literal.ktype=" . $type . >> "&literal.kattachment=" . $attachment; >> >> Where kref, ktype and kattachment are my custom fields (that I added to the >> schema.xml previously) >> >> So, indeed it’s Tika that extracts the info. I didn’t change anything to the >> ExtractHandler. >> >> I read about the fact that all fields must be marked as stored=true but: >> >> - I checked in the schema, all the fields that matter (Tika default >> extracted fields) and my customer fields are stored=true. >> - I suppose that the full-text index is not stored in a field? And therefore >> cannot be marked as stored? >> >> I manage to upload files and mark my docs with metadata but I have existing >> files where I would like to update my fields (kref, …) without re-extracting >> and I’d like also to allow for re-indexing if needed without overriding my >> fields. >> >> I’m stuck… I wonder if it won’t be simpler for me to write a custom handler >> of some sort but I don’t really program in Java. >> >> Cheers >> >> Nico >> >>> On 8 Mar 2017, at 17:03, Erick Erickson <erickerick...@gmail.com> wrote: >>> >>> Nico: >>> >>> This is the place for such questions! I'm not quite sure the source >>> of the docs. When you say you "extract", does that mean you're using >>> the ExtractingRequestHandler, i.e. uploading PDF or Word etc. to Solr >>> and letting Tika parse it out? IOW, where is the fulltext coming from? >>> >>> For adding tags any time, Solr has "Atomic Updates" that has a couple >>> of requirements, mainly you have to set stored="true" for all your >>> fields _except_ the destinations for any <copyField> directives. Under >>> the covers this pulls the stored data from Solr, overlays it with the >>> new data you've sent and re-indexes it. The expense here is that your >>> index will increase in size, but storing the data doesn't mean much of >>> an increase in JVM requirements. That is, say your index doubles in >>> size. Your JVM heap requirements may increase 5% (and, frankly I doubt >>> that much, but I've never measured). FWIW, the on-disk size should >>> increase by roughly 50% of the raw data size. WARNING: "raw data size" >>> is the size _after_ extraction, so say you're indexing a 1K XML doc >>> where the tags are taking up .75K. Then the on-disk memory should go >>> up roughly .125K (50% of .25K).. >>> >>> Don't worry about "thousands" of docs ;) On my laptop I index over 1K >>> Wikipedia articles a second (YMMV of course). Without any particular >>> tuning. Without sharding. Very often the most expensive part of >>> indexing is acquiring the data in the first place, i.e. getting it >>> from a DB or extracting it from Tika. Solr will handle quite a load. >>> >>> And, if you're using the ExtractingRequestHandler, I'd seriously think >>> about moving it to a Client. Here's a Java example: >>> https://lucidworks.com/2012/02/14/indexing-with-solrj/ >>> >>> Best, >>> Erick >>> >>> On Wed, Mar 8, 2017 at 7:46 AM, Nicolas Bouillon >>> <nico2000...@yahoo.com.invalid> wrote: >>>> Dear SOLR friends, >>>> >>>> I developed a small ERP. I produce PDF documents linked to objects in my >>>> ERP: invoices, timesheets, contracts, etc... >>>> I have also the possibility to attach documents to a particular object and >>>> when I view an invoice for instance, I can see the attached documents. >>>> >>>> Until now, I was adding reference to these documents in my DB and store >>>> docs on the server. >>>> Still, I found it cumbersome and not flexible enough, so I removed the >>>> table documents from my DB and decided to use SOLR to add metadata to the >>>> documents in the index. >>>> >>>> Currently, I have the following custom fields: >>>> - ktype (string): invoice, contract, etc… >>>> - kattachment (int): 0 or 1 >>>> - kref (int): reference in DB of linked object, ex: 10 (for contract 10 in >>>> DB) >>>> - ktags (strings, mutifield): free tags, ex: customerX, consulting, >>>> development >>>> >>>> Each time I upload a document, I store in on server and then add it to >>>> SOLR using "extract" adding the metadata at the same time. It works fine. >>>> >>>> I would like now 3 things: >>>> >>>> - For existing documents that have not been extracted with metadata >>>> altogether at upload (documents uploaded before I developed the >>>> functionality), I'd like to update them with the proper metadata without >>>> losing the full-text search >>>> - Be able to add anytime tags to the ktags field after upload whilst >>>> keeping full-text search >>>> - In case I have to re-index, I want to be sure I don't have to restart >>>> everything from scratch. >>>> In a few months, I expect to have thousands of docs in my >>>> system....and then I'll add emails >>>> >>>> I have very little experience in SOLR. I know I can re-perform an extract >>>> instead of an update when I modify a field but I'm pretty sure it's not >>>> the right thing to do + performance problems can arise. >>>> >>>> What do you suggest me to do? >>>> >>>> I thought about storing the metadata linked to each document separately >>>> (in DB or separate XML file individually or one XML for all) but I'm >>>> pretty sure it will be very slow after a while. >>>> >>>> Thx a lot in advance fro your precious help. >>>> This is my first message to the user list, please excuse anything I may >>>> have done wrong…I learn fast, don’t worry.. >>>> >>>> Regards >>>> >>>> Nico >>>> >>>> My configuration: >>>> >>>> Synology 1511 running DSM 6.1 >>>> Docker container for SOLR using latest stable version >>>> 1 core called “katalyst” containing index of all documents >>>> >>>> ERP is written in PHP/Mysql for backend and Jquery/Bootstrap for front-end >>>> >>>> I have a test env on OSX Sierra running docker, a prod environment on >>>> Synology >>>> >>>> >>