I have some XML that includes a stylesheet maintained by another organization that renders to HTML. The HTML is pretty good - it is not "structured" in RDFa or schema.org, but has classes and anchors that can be used to find some key data. So, I can probably get all the meta-data I want from the HTML, but this may be brittle.
The content *must* come from the HTML - the stylesheet transformation puts together content that may be separated into elements and attributes in the XML into semantically rich sentences a user will want to query such as "90 pills per bottle, 25mg per unit". I think I could do this with Data Import Handler, but I really don't want to do so because I want this to work with SolrCloud so that I can continue to index even if any of my Solr nodes are down. If the real best practice is to do this outside of Solr, and POST just fields, then I can do that - there are too many factors to ask this list for more details on how. Plain old Java and plain old Python have plenty of libraries for me to apply, and I'll see what data extraction options are available to me in free versions of Talend. What I'm really interested in is best solutions with Solr doing the extraction of text from the HTML, e.g.: * Is there any way I can POST both the HTML and the meta-data to Solr? How will that work with SolrCloud (I'd prefer to avoid having each node running Tika on each document)? * Are their Update Processor Chain components that will run Tika and stuff the results into a field? * I've used Tika naively in the past - can I instruct Tika to instruct meta-data from the HTML as well as the text? * Can I POST the meta-data and the HTML as updates to the same document, taking care that they are posted together in the same batch? * Data Import Handler could handle this by extracting meta-data using one XSLT, generating HTML with another, and then running Tika on the HTML. However, how would that work with SolrCloud? Do I have to use Data Import Handler to solve this problem with Solr? Thanks, Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH