Re: SOLR Atomic update of custom stored metadata clears full-text index! How to add metadata without losing full-text search

Erick Erickson Wed, 08 Mar 2017 12:07:27 -0800

How are you updating? All the stored stuff is assuming "Atomic Updates"..




On Wed, Mar 8, 2017 at 11:15 AM, Alexandre Rafalovitch
<arafa...@gmail.com> wrote:
> Uhm, actually, If you have copyField from multiple sources into that
> _text_ field, you may be accumulating/duplicating content on update.
>
> Check what happens to the content of that _text_ field when you do
> full-text and then do an attribute update.
>
> If I am right, you may want to have a separate "original_text" field
> that you store and then have your aggregate copyField destination not
> stored.
>
> Regards,
>    Alex.
> ----
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 8 March 2017 at 13:41, Nicolas Bouillon
> <nico2000...@yahoo.com.invalid> wrote:
>> Guys
>>
>> A BIG thank you, it works perfectly!!!
>>
>> After so much research I finally got my solution working.
>>
>> That was the trick, _text_ is stored and it’s working as expected.
>>
>> Have a very nice day and thanks a lot for your contribution.
>>
>> Really appreciated
>>
>> Nico
>>> On 8 Mar 2017, at 18:26, Nicolas Bouillon <nico2000...@yahoo.com.INVALID> 
>>> wrote:
>>>
>>> Hi Erick, Shawn,
>>>
>>> Thx really a lot for your swift reaction, it’s fantastic.
>>> Let me answer both your answers:
>>>
>>> 1) the df entry in solrconfig.xml has not been changed:
>>>
>>> <str name="df">_text_</str>
>>>
>>> 2)when I do a query for full-text search I don’t specify a field, I just 
>>> enter the string I’m looking for in the q parameter:
>>>
>>> Like this: I have a ppt containing the word “Microsoft”that is called 
>>> “Dynamics 365 Roadmap”, I do a query on “Microsoft”and it finds the document
>>> After update, it doesn’t find it unless I search for one of my custom 
>>> fields or something in the title like “Dynamics”
>>>
>>> So, my conclusion would be that you suggest I mark “_text_” as stored=true 
>>> in the schema, right?
>>> And reload core or even re-index.
>>>
>>> Thx a bunch
>>>
>>>
>>>
>>>
>>>> On 8 Mar 2017, at 17:46, Erick Erickson <erickerick...@gmail.com> wrote:
>>>>
>>>> bq: I wonder if it won’t be simpler for me to write a custom handler
>>>>
>>>> Probably not, that would be Java too ;)...
>>>>
>>>> OK, back up a bit. You can change your schema such that the full-text
>>>> field _is_ stored, I don't quite know what the default field is from
>>>> memory, but you must be searching against it ;). It sounds like you're
>>>> using the defaults and it's _probably_ _text_. And my guess is that
>>>> you're searching on that field even though you don't specify, see the
>>>> "df" entry in your solrconfig.xml file. There's no reason you can't
>>>> change that to stored="true" (reindex of course).
>>>>
>>>> Nothing that you've mentioned so far looks like it should take
>>>> anything except getting your configurations to be what you need, so
>>>> don't make more work for yourself than you need to ;).
>>>>
>>>> After that, see the link Shawn provided...
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Wed, Mar 8, 2017 at 8:22 AM, Nicolas Bouillon
>>>> <nico2000...@yahoo.com.invalid> wrote:
>>>>> Hi Erick
>>>>>
>>>>> Thanks a lot for the elaborated answer. Let me give some precisions:
>>>>>
>>>>> 1. I upload the docs using an AJAX post multiform to my server.
>>>>> 2. The PHP target of the post, takes the file and stores it on disk
>>>>> 3. If the file is moved successfully from TEMP files to final 
>>>>> destination, I then call SOLR as follows:
>>>>>
>>>>> It’s a curl POST request:
>>>>>
>>>>> URL: http://my_server:8983/solr/my_core/update/extract/?"; . $fields . 
>>>>> "&literal.id=" . $id . "&filetypes=*&commit=true
>>>>> HEADER: Content-type: multipart/form-data
>>>>> POSTFIELDS: the entire file that has just been stored
>>>>> (BTW, it’s PHP specific but I send a CurlFile in an array as follows: 
>>>>> array('myfile' => $cfile)
>>>>>
>>>>> In the URL, the parameter $fields contains the following:
>>>>>
>>>>> $fields = "literal.kref=" . $id . "&literal.ktype=" . $type . 
>>>>> "&literal.kattachment=" . $attachment;
>>>>>
>>>>> Where kref, ktype and kattachment are my custom fields (that I added to 
>>>>> the schema.xml previously)
>>>>>
>>>>> So, indeed it’s Tika that extracts the info. I didn’t change anything to 
>>>>> the ExtractHandler.
>>>>>
>>>>> I read about the fact that all fields must be marked as stored=true but:
>>>>>
>>>>> - I checked in the schema, all the fields that matter (Tika default 
>>>>> extracted fields) and my customer fields are stored=true.
>>>>> - I suppose that the full-text index is not stored in a field? And 
>>>>> therefore cannot be marked as stored?
>>>>>
>>>>> I manage to upload files and mark my docs with metadata but I have 
>>>>> existing files where I would like to update my fields (kref, …) without 
>>>>> re-extracting and I’d like also to allow for re-indexing if needed 
>>>>> without overriding my fields.
>>>>>
>>>>> I’m stuck… I wonder if it won’t be simpler for me to write a custom 
>>>>> handler of some sort but I don’t really program in Java.
>>>>>
>>>>> Cheers
>>>>>
>>>>> Nico
>>>>>
>>>>>> On 8 Mar 2017, at 17:03, Erick Erickson <erickerick...@gmail.com> wrote:
>>>>>>
>>>>>> Nico:
>>>>>>
>>>>>> This is the place  for such questions! I'm not quite sure the source
>>>>>> of the docs. When you say you "extract", does that mean you're using
>>>>>> the ExtractingRequestHandler, i.e. uploading PDF or Word etc. to Solr
>>>>>> and letting Tika parse it out? IOW, where is the fulltext coming from?
>>>>>>
>>>>>> For adding tags any time, Solr has "Atomic Updates" that has a couple
>>>>>> of requirements, mainly you have to set stored="true" for all your
>>>>>> fields _except_ the destinations for any <copyField> directives. Under
>>>>>> the covers this pulls the stored data from Solr, overlays it with the
>>>>>> new data you've sent and re-indexes it. The expense here is that your
>>>>>> index will increase in size, but storing the data doesn't mean much of
>>>>>> an increase in JVM requirements. That is, say your index doubles in
>>>>>> size. Your JVM heap requirements may increase 5% (and, frankly I doubt
>>>>>> that much, but I've never measured). FWIW, the on-disk size should
>>>>>> increase by roughly 50% of the raw data size. WARNING: "raw data size"
>>>>>> is the size _after_ extraction, so say you're indexing a 1K XML doc
>>>>>> where the tags are taking up .75K. Then the on-disk memory should go
>>>>>> up roughly .125K (50% of .25K)..
>>>>>>
>>>>>> Don't worry about "thousands" of docs ;) On my laptop I index over 1K
>>>>>> Wikipedia articles a second (YMMV of course). Without any particular
>>>>>> tuning. Without sharding. Very often the most expensive part of
>>>>>> indexing is acquiring the data in the first place, i.e. getting it
>>>>>> from a DB or extracting it from Tika. Solr will handle quite a load.
>>>>>>
>>>>>> And, if you're using the ExtractingRequestHandler, I'd seriously think
>>>>>> about moving it to a Client. Here's a Java example:
>>>>>> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>>>>>
>>>>>> Best,
>>>>>> Erick
>>>>>>
>>>>>> On Wed, Mar 8, 2017 at 7:46 AM, Nicolas Bouillon
>>>>>> <nico2000...@yahoo.com.invalid> wrote:
>>>>>>> Dear SOLR friends,
>>>>>>>
>>>>>>> I developed a small ERP. I produce PDF documents linked to objects in 
>>>>>>> my ERP: invoices, timesheets, contracts, etc...
>>>>>>> I have also the possibility to attach documents to a particular object 
>>>>>>> and when I view an invoice for instance, I can see the attached 
>>>>>>> documents.
>>>>>>>
>>>>>>> Until now, I was adding reference to these documents in my DB and store 
>>>>>>> docs on the server.
>>>>>>> Still, I found it cumbersome and not flexible enough, so I removed the 
>>>>>>> table documents from my DB and decided to use SOLR to add metadata to 
>>>>>>> the documents in the index.
>>>>>>>
>>>>>>> Currently, I have the following custom fields:
>>>>>>> - ktype (string): invoice, contract, etc…
>>>>>>> - kattachment (int): 0 or 1
>>>>>>> - kref (int): reference in DB of linked object, ex: 10 (for contract 10 
>>>>>>> in DB)
>>>>>>> - ktags (strings, mutifield): free tags, ex: customerX, consulting, 
>>>>>>> development
>>>>>>>
>>>>>>> Each time I upload a document, I store in on server and then add it to 
>>>>>>> SOLR using "extract" adding the metadata at the same time. It works 
>>>>>>> fine.
>>>>>>>
>>>>>>> I would like now 3 things:
>>>>>>>
>>>>>>> - For existing documents that have not been extracted with metadata 
>>>>>>> altogether at upload (documents uploaded before I developed the 
>>>>>>> functionality), I'd like to update them with the proper metadata 
>>>>>>> without losing the full-text search
>>>>>>> - Be able to add anytime tags to the ktags field after upload whilst 
>>>>>>> keeping full-text search
>>>>>>> - In case I have to re-index, I want to be sure I don't have to restart 
>>>>>>> everything from scratch.
>>>>>>>      In a few months, I expect to have thousands of docs in my 
>>>>>>> system....and then I'll add emails
>>>>>>>
>>>>>>> I have very little experience in SOLR. I know I can re-perform an 
>>>>>>> extract instead of an update when I modify a field but I'm pretty sure 
>>>>>>> it's not the right thing to do + performance problems can arise.
>>>>>>>
>>>>>>> What do you suggest me to do?
>>>>>>>
>>>>>>> I thought about storing the metadata linked to each document separately 
>>>>>>> (in DB or separate XML file individually or one XML for all) but I'm 
>>>>>>> pretty sure it will be very slow after a while.
>>>>>>>
>>>>>>> Thx a lot in advance fro your precious help.
>>>>>>> This is my first message to the user list, please excuse anything I may 
>>>>>>> have done wrong…I learn fast, don’t worry..
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Nico
>>>>>>>
>>>>>>> My configuration:
>>>>>>>
>>>>>>> Synology 1511 running DSM 6.1
>>>>>>> Docker container for SOLR using latest stable version
>>>>>>> 1 core called “katalyst” containing index of all documents
>>>>>>>
>>>>>>> ERP is written in PHP/Mysql for backend and Jquery/Bootstrap for 
>>>>>>> front-end
>>>>>>>
>>>>>>> I have a test env on OSX Sierra running docker, a prod environment on 
>>>>>>> Synology
>>>>>>>
>>>>>>>
>>>>>
>>>
>>

Re: SOLR Atomic update of custom stored metadata clears full-text index! How to add metadata without losing full-text search

Reply via email to