Nico:

This is the place  for such questions! I'm not quite sure the source
of the docs. When you say you "extract", does that mean you're using
the ExtractingRequestHandler, i.e. uploading PDF or Word etc. to Solr
and letting Tika parse it out? IOW, where is the fulltext coming from?

For adding tags any time, Solr has "Atomic Updates" that has a couple
of requirements, mainly you have to set stored="true" for all your
fields _except_ the destinations for any <copyField> directives. Under
the covers this pulls the stored data from Solr, overlays it with the
new data you've sent and re-indexes it. The expense here is that your
index will increase in size, but storing the data doesn't mean much of
an increase in JVM requirements. That is, say your index doubles in
size. Your JVM heap requirements may increase 5% (and, frankly I doubt
that much, but I've never measured). FWIW, the on-disk size should
increase by roughly 50% of the raw data size. WARNING: "raw data size"
is the size _after_ extraction, so say you're indexing a 1K XML doc
where the tags are taking up .75K. Then the on-disk memory should go
up roughly .125K (50% of .25K)..

Don't worry about "thousands" of docs ;) On my laptop I index over 1K
Wikipedia articles a second (YMMV of course). Without any particular
tuning. Without sharding. Very often the most expensive part of
indexing is acquiring the data in the first place, i.e. getting it
from a DB or extracting it from Tika. Solr will handle quite a load.

And, if you're using the ExtractingRequestHandler, I'd seriously think
about moving it to a Client. Here's a Java example:
https://lucidworks.com/2012/02/14/indexing-with-solrj/

Best,
Erick

On Wed, Mar 8, 2017 at 7:46 AM, Nicolas Bouillon
<nico2000...@yahoo.com.invalid> wrote:
> Dear SOLR friends,
>
> I developed a small ERP. I produce PDF documents linked to objects in my ERP: 
> invoices, timesheets, contracts, etc...
> I have also the possibility to attach documents to a particular object and 
> when I view an invoice for instance, I can see the attached documents.
>
> Until now, I was adding reference to these documents in my DB and store docs 
> on the server.
> Still, I found it cumbersome and not flexible enough, so I removed the table 
> documents from my DB and decided to use SOLR to add metadata to the documents 
> in the index.
>
> Currently, I have the following custom fields:
> - ktype (string): invoice, contract, etc…
> - kattachment (int): 0 or 1
> - kref (int): reference in DB of linked object, ex: 10 (for contract 10 in DB)
> - ktags (strings, mutifield): free tags, ex: customerX, consulting, 
> development
>
> Each time I upload a document, I store in on server and then add it to SOLR 
> using "extract" adding the metadata at the same time. It works fine.
>
> I would like now 3 things:
>
> - For existing documents that have not been extracted with metadata 
> altogether at upload (documents uploaded before I developed the 
> functionality), I'd like to update them with the proper metadata without 
> losing the full-text search
> - Be able to add anytime tags to the ktags field after upload whilst keeping 
> full-text search
> - In case I have to re-index, I want to be sure I don't have to restart 
> everything from scratch.
>         In a few months, I expect to have thousands of docs in my 
> system....and then I'll add emails
>
> I have very little experience in SOLR. I know I can re-perform an extract 
> instead of an update when I modify a field but I'm pretty sure it's not the 
> right thing to do + performance problems can arise.
>
> What do you suggest me to do?
>
> I thought about storing the metadata linked to each document separately (in 
> DB or separate XML file individually or one XML for all) but I'm pretty sure 
> it will be very slow after a while.
>
> Thx a lot in advance fro your precious help.
> This is my first message to the user list, please excuse anything I may have 
> done wrong…I learn fast, don’t worry..
>
> Regards
>
> Nico
>
> My configuration:
>
> Synology 1511 running DSM 6.1
> Docker container for SOLR using latest stable version
> 1 core called “katalyst” containing index of all documents
>
> ERP is written in PHP/Mysql for backend and Jquery/Bootstrap for front-end
>
> I have a test env on OSX Sierra running docker, a prod environment on Synology
>
>

Reply via email to