Re: SOLR Atomic update of custom stored metadata clears full-text index! How to add metadata without losing full-text search

Nicolas Bouillon Wed, 08 Mar 2017 09:27:08 -0800

Hi Erick, Shawn,

Thx really a lot for your swift reaction, it’s fantastic.
Let me answer both your answers:


1) the df entry in solrconfig.xml has not been changed:

<str name="df">_text_</str>

2)when I do a query for full-text search I don’t specify a field, I just enter 
the string I’m looking for in the q parameter:

Like this: I have a ppt containing the word “Microsoft”that is called “Dynamics 
365 Roadmap”, I do a query on “Microsoft”and it finds the document
After update, it doesn’t find it unless I search for one of my custom fields or 
something in the title like “Dynamics”

So, my conclusion would be that you suggest I mark “_text_” as stored=true in 
the schema, right?
And reload core or even re-index.

Thx a bunch




> On 8 Mar 2017, at 17:46, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> bq: I wonder if it won’t be simpler for me to write a custom handler
> 
> Probably not, that would be Java too ;)...
> 
> OK, back up a bit. You can change your schema such that the full-text
> field _is_ stored, I don't quite know what the default field is from
> memory, but you must be searching against it ;). It sounds like you're
> using the defaults and it's _probably_ _text_. And my guess is that
> you're searching on that field even though you don't specify, see the
> "df" entry in your solrconfig.xml file. There's no reason you can't
> change that to stored="true" (reindex of course).
> 
> Nothing that you've mentioned so far looks like it should take
> anything except getting your configurations to be what you need, so
> don't make more work for yourself than you need to ;).
> 
> After that, see the link Shawn provided...
> 
> Best,
> Erick
> 
> On Wed, Mar 8, 2017 at 8:22 AM, Nicolas Bouillon
> <nico2000...@yahoo.com.invalid> wrote:
>> Hi Erick
>> 
>> Thanks a lot for the elaborated answer. Let me give some precisions:
>> 
>> 1. I upload the docs using an AJAX post multiform to my server.
>> 2. The PHP target of the post, takes the file and stores it on disk
>> 3. If the file is moved successfully from TEMP files to final destination, I 
>> then call SOLR as follows:
>> 
>> It’s a curl POST request:
>> 
>> URL: http://my_server:8983/solr/my_core/update/extract/?"; . $fields . 
>> "&literal.id=" . $id . "&filetypes=*&commit=true
>> HEADER: Content-type: multipart/form-data
>> POSTFIELDS: the entire file that has just been stored
>> (BTW, it’s PHP specific but I send a CurlFile in an array as follows: 
>> array('myfile' => $cfile)
>> 
>> In the URL, the parameter $fields contains the following:
>> 
>> $fields = "literal.kref=" . $id . "&literal.ktype=" . $type . 
>> "&literal.kattachment=" . $attachment;
>> 
>> Where kref, ktype and kattachment are my custom fields (that I added to the 
>> schema.xml previously)
>> 
>> So, indeed it’s Tika that extracts the info. I didn’t change anything to the 
>> ExtractHandler.
>> 
>> I read about the fact that all fields must be marked as stored=true but:
>> 
>> - I checked in the schema, all the fields that matter (Tika default 
>> extracted fields) and my customer fields are stored=true.
>> - I suppose that the full-text index is not stored in a field? And therefore 
>> cannot be marked as stored?
>> 
>> I manage to upload files and mark my docs with metadata but I have existing 
>> files where I would like to update my fields (kref, …) without re-extracting 
>> and I’d like also to allow for re-indexing if needed without overriding my 
>> fields.
>> 
>> I’m stuck… I wonder if it won’t be simpler for me to write a custom handler 
>> of some sort but I don’t really program in Java.
>> 
>> Cheers
>> 
>> Nico
>> 
>>> On 8 Mar 2017, at 17:03, Erick Erickson <erickerick...@gmail.com> wrote:
>>> 
>>> Nico:
>>> 
>>> This is the place  for such questions! I'm not quite sure the source
>>> of the docs. When you say you "extract", does that mean you're using
>>> the ExtractingRequestHandler, i.e. uploading PDF or Word etc. to Solr
>>> and letting Tika parse it out? IOW, where is the fulltext coming from?
>>> 
>>> For adding tags any time, Solr has "Atomic Updates" that has a couple
>>> of requirements, mainly you have to set stored="true" for all your
>>> fields _except_ the destinations for any <copyField> directives. Under
>>> the covers this pulls the stored data from Solr, overlays it with the
>>> new data you've sent and re-indexes it. The expense here is that your
>>> index will increase in size, but storing the data doesn't mean much of
>>> an increase in JVM requirements. That is, say your index doubles in
>>> size. Your JVM heap requirements may increase 5% (and, frankly I doubt
>>> that much, but I've never measured). FWIW, the on-disk size should
>>> increase by roughly 50% of the raw data size. WARNING: "raw data size"
>>> is the size _after_ extraction, so say you're indexing a 1K XML doc
>>> where the tags are taking up .75K. Then the on-disk memory should go
>>> up roughly .125K (50% of .25K)..
>>> 
>>> Don't worry about "thousands" of docs ;) On my laptop I index over 1K
>>> Wikipedia articles a second (YMMV of course). Without any particular
>>> tuning. Without sharding. Very often the most expensive part of
>>> indexing is acquiring the data in the first place, i.e. getting it
>>> from a DB or extracting it from Tika. Solr will handle quite a load.
>>> 
>>> And, if you're using the ExtractingRequestHandler, I'd seriously think
>>> about moving it to a Client. Here's a Java example:
>>> https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>> 
>>> Best,
>>> Erick
>>> 
>>> On Wed, Mar 8, 2017 at 7:46 AM, Nicolas Bouillon
>>> <nico2000...@yahoo.com.invalid> wrote:
>>>> Dear SOLR friends,
>>>> 
>>>> I developed a small ERP. I produce PDF documents linked to objects in my 
>>>> ERP: invoices, timesheets, contracts, etc...
>>>> I have also the possibility to attach documents to a particular object and 
>>>> when I view an invoice for instance, I can see the attached documents.
>>>> 
>>>> Until now, I was adding reference to these documents in my DB and store 
>>>> docs on the server.
>>>> Still, I found it cumbersome and not flexible enough, so I removed the 
>>>> table documents from my DB and decided to use SOLR to add metadata to the 
>>>> documents in the index.
>>>> 
>>>> Currently, I have the following custom fields:
>>>> - ktype (string): invoice, contract, etc…
>>>> - kattachment (int): 0 or 1
>>>> - kref (int): reference in DB of linked object, ex: 10 (for contract 10 in 
>>>> DB)
>>>> - ktags (strings, mutifield): free tags, ex: customerX, consulting, 
>>>> development
>>>> 
>>>> Each time I upload a document, I store in on server and then add it to 
>>>> SOLR using "extract" adding the metadata at the same time. It works fine.
>>>> 
>>>> I would like now 3 things:
>>>> 
>>>> - For existing documents that have not been extracted with metadata 
>>>> altogether at upload (documents uploaded before I developed the 
>>>> functionality), I'd like to update them with the proper metadata without 
>>>> losing the full-text search
>>>> - Be able to add anytime tags to the ktags field after upload whilst 
>>>> keeping full-text search
>>>> - In case I have to re-index, I want to be sure I don't have to restart 
>>>> everything from scratch.
>>>>       In a few months, I expect to have thousands of docs in my 
>>>> system....and then I'll add emails
>>>> 
>>>> I have very little experience in SOLR. I know I can re-perform an extract 
>>>> instead of an update when I modify a field but I'm pretty sure it's not 
>>>> the right thing to do + performance problems can arise.
>>>> 
>>>> What do you suggest me to do?
>>>> 
>>>> I thought about storing the metadata linked to each document separately 
>>>> (in DB or separate XML file individually or one XML for all) but I'm 
>>>> pretty sure it will be very slow after a while.
>>>> 
>>>> Thx a lot in advance fro your precious help.
>>>> This is my first message to the user list, please excuse anything I may 
>>>> have done wrong…I learn fast, don’t worry..
>>>> 
>>>> Regards
>>>> 
>>>> Nico
>>>> 
>>>> My configuration:
>>>> 
>>>> Synology 1511 running DSM 6.1
>>>> Docker container for SOLR using latest stable version
>>>> 1 core called “katalyst” containing index of all documents
>>>> 
>>>> ERP is written in PHP/Mysql for backend and Jquery/Bootstrap for front-end
>>>> 
>>>> I have a test env on OSX Sierra running docker, a prod environment on 
>>>> Synology
>>>> 
>>>> 
>>

Re: SOLR Atomic update of custom stored metadata clears full-text index! How to add metadata without losing full-text search

Reply via email to