Re: indexing multiple levels of data

Jan Høydahl Fri, 16 Nov 2018 06:29:27 -0800

Hi Martin,

For a complex use case as this I would recommend you write a separate indexer 
application that crawls the files, looks up the correct metadata XMLs based on 
given business rules, and then constructs the full Solr document to send to 
Solr.
Even parsing full-text from PDF etc I would recommend to do in such an indexer 
application instead of relying on Solr's built-in Tika.


This gives you all the control you need, and the burden of building and running 
a separate app will probably be worth it.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 16. nov. 2018 kl. 12:24 skrev Martin Frank Hansen (MHQ) <m...@kmd.dk>:
> 
> Hi,
> 
> I am trying to add meta data and files to Solr, but are experiencing some 
> problems.
> 
> Data is divided on three two, cases and files. For each case the meta-data is 
> given in an xml document, while meta data for the files is given in another 
> xml document, and the actual files are kept in yet another place.
> For each case multiple files might exist.
> There is no unique key between the cases and the files.
> There is however an identifier for each of the cases which is present at file 
> level as well.
> 
> 
>  1.  I tried using atomic update, but that did not work since a unique key is 
> required.
>  2.  I thought about using a multivalued field for the files within a 
> case-document. The problem is that it is the files that I am interested in, 
> and if I query a specific file, the entire document is returned which is not 
> very helpful. Is there a way to specify which of the files actually match a 
> query within a document (see example below)? I was thinking about the 
> highlight component, but I am not sure if it will work.
> 
> {
> 
> id:case1
> 
> file:{file1, file2, file3…}
> 
> }
> 
>  1.  Another thing was using a join at query level, but it seems a bit 
> tedious. Is there a way to make a join at index-time?
> 
> Any suggestions are much appreciated.
> 
> Best regards
> 
> Martin
> 
> Beskyttelse af dine personlige oplysninger er vigtig for os. Her finder du 
> KMD’s Privatlivspolitik<http://www.kmd.dk/Privatlivspolitik>, der fortæller, 
> hvordan vi behandler oplysninger om dig.
> 
> Protection of your personal data is important to us. Here you can read KMD’s 
> Privacy Policy<http://www.kmd.net/Privacy-Policy> outlining how we process 
> your personal data.
> 
> Vi gør opmærksom på, at denne e-mail kan indeholde fortrolig information. 
> Hvis du ved en fejltagelse modtager e-mailen, beder vi dig venligst informere 
> afsender om fejlen ved at bruge svarfunktionen. Samtidig beder vi dig slette 
> e-mailen i dit system uden at videresende eller kopiere den. Selvom e-mailen 
> og ethvert vedhæftet bilag efter vores overbevisning er fri for virus og 
> andre fejl, som kan påvirke computeren eller it-systemet, hvori den modtages 
> og læses, åbnes den på modtagerens eget ansvar. Vi påtager os ikke noget 
> ansvar for tab og skade, som er opstået i forbindelse med at modtage og bruge 
> e-mailen.
> 
> Please note that this message may contain confidential information. If you 
> have received this message by mistake, please inform the sender of the 
> mistake by sending a reply, then delete the message from your system without 
> making, distributing or retaining any copies of it. Although we believe that 
> the message and any attachments are free from viruses and other errors that 
> might affect the computer or it-system where it is received and read, the 
> recipient opens the message at his or her own risk. We assume no 
> responsibility for any loss or damage arising from the receipt or use of this 
> message.

Re: indexing multiple levels of data

Reply via email to