Indexing both meta-data and full content of HTML

Davis, Daniel (NIH/NLM) [C] Sat, 19 Mar 2016 19:41:23 -0700

I have some XML that includes a stylesheet maintained by another organization 
that renders to HTML.   The HTML is pretty good - it is not "structured" in 
RDFa or schema.org, but has classes and anchors that can be used to find some 
key data.   So, I can probably get all the meta-data I want from the HTML, but 
this may be brittle.


The content *must* come from the HTML - the stylesheet transformation puts 
together content that may be separated into elements and attributes in the XML 
into semantically rich sentences a user will want to query such as "90 pills 
per bottle, 25mg per unit".

I think I could do this with Data Import Handler, but I really don't want to do 
so because I want this to work with SolrCloud so that I can continue to index 
even if any of my Solr nodes are down.

If the real best practice is to do this outside of Solr, and POST just fields, 
then I can do that - there are too many factors to ask this list for more 
details on how.   Plain old Java and plain old Python have plenty of libraries 
for me to apply, and I'll see what data extraction options are available to me 
in free versions of Talend.

What I'm really interested in is best solutions with Solr doing the extraction 
of text from the HTML, e.g.:

*        Is there any way I can POST both the HTML and the meta-data to Solr?   
 How will that work with SolrCloud (I'd prefer to avoid having each node 
running Tika on each document)?

*        Are their Update Processor Chain components that will run Tika and 
stuff the results into a field?

*        I've used Tika naively in the past - can I instruct Tika to instruct 
meta-data from the HTML as well as the text?

*        Can I POST the meta-data and the HTML as updates to the same document, 
taking care that they are posted together in the same batch?

*        Data Import Handler could handle this by extracting meta-data using 
one XSLT, generating HTML with another, and then running Tika on the HTML.   
However, how would that work with SolrCloud?    Do I have to use Data Import 
Handler to solve this problem with Solr?

Thanks,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH

Indexing both meta-data and full content of HTML

Reply via email to