Unzipping things might be an issue. You may need to do that as part of a batch job outside of Solr. For the rest, go through the documentation first, it does answer a bunch of questions. There is also a page on the Wiki as well, not just in the reference guide.
Regards, Alex. ---- Sign up for my Solr resources newsletter at http://www.solr-start.com/ On 23 January 2015 at 14:51, Carl Roberts <carl.roberts.zap...@gmail.com> wrote: > Excellent - thanks Shalin. But how does delta-import work? Does it do a > clean also? Does it require a unique Id? Does it update existing records > and only add when necessary? > > And, how would I go about unzipping the content from a URL to then import > the unzipped XML? Is the recommended way to extend the URLDataSource class > or is there any built-in logic to plug in pre-processing handlers? > > > And, > > On 1/23/15, 2:39 PM, Shalin Shekhar Mangar wrote: >> >> If you add clean=false as a parameter to the full-import then deletion is >> disabled. Since you are ingesting RSS there is no need for deletion at all >> I guess. >> >> On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts >> <carl.roberts.zap...@gmail.com >>> >>> wrote: >>> OK - Thanks for the doc. >>> >>> Is it possible to just provide an empty value to preImportDeleteQuery to >>> disable the delete prior to import? >>> >>> Will the data still be deleted for each entity during a delta-import >>> instead of full-import? >>> >>> Is there any capability in the handler to unzip an XML file from a URL >>> prior to reading it or can I perhaps hook a custom pre-processing >>> handler? >>> >>> Regards, >>> >>> Joe >>> >>> >>> >>> On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote: >>> >>>> https://cwiki.apache.org/confluence/display/solr/ >>>> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler >>>> >>>> Admin UI has the interface, so you can play there once you define it. >>>> >>>> You do have to use Curl, there is no built-in scheduler. >>>> >>>> Regards, >>>> Alex. >>>> ---- >>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/ >>>> >>>> >>>> On 23 January 2015 at 13:29, Carl Roberts >>>> <carl.roberts.zap...@gmail.com> >>>> wrote: >>>> >>>>> Hi Alex, >>>>> >>>>> If I am understanding this correctly, I can define multiple entities >>>>> like >>>>> this? >>>>> >>>>> <document> >>>>> <entity/> >>>>> <entity/> >>>>> <entity/> >>>>> ... >>>>> </document> >>>>> >>>>> How would I trigger loading certain entities during start? >>>>> >>>>> How would I trigger loading other entities during update? >>>>> >>>>> Is there a way to set an auto-update for certain entities so that I >>>>> don't >>>>> have to invoke an update via curl? >>>>> >>>>> Where / how do I specify the preImportDeleteQuery to avoid deleting >>>>> everything upon each update? >>>>> >>>>> Is there an example or doc that shows how to do all this? >>>>> >>>>> Regards, >>>>> >>>>> Joe >>>>> >>>>> >>>>> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote: >>>>> >>>>>> You can define both multiple entities in the same file and nested >>>>>> entities if your list comes from an external source (e.g. a text file >>>>>> of URLs). >>>>>> You can also trigger DIH with a name of a specific entity to load just >>>>>> that. >>>>>> You can even pass DIH configuration file when you are triggering the >>>>>> processing start, so you can have different files completely for >>>>>> initial load and update. Though you can just do the same with >>>>>> entities. >>>>>> >>>>>> The only thing to be aware of is that before an entity definition is >>>>>> processed, a delete command is run. By default, it's "delete all", so >>>>>> executing one entity will delete everything but then just populate >>>>>> that one entity's results. You can avoid that by defining >>>>>> preImportDeleteQuery and having a clear identifier on content >>>>>> generated by each entity (e.g. source, either extracted or manually >>>>>> added with TemplateTransformer). >>>>>> >>>>>> Regards, >>>>>> Alex. >>>>>> >>>>>> ---- >>>>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/ >>>>>> >>>>>> >>>>>> On 23 January 2015 at 11:15, Carl Roberts < >>>>>> carl.roberts.zap...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I have the RSS DIH example working with my own RSS feed - here is the >>>>>>> configuration for it. >>>>>>> >>>>>>> <dataConfig> >>>>>>> <dataSource type="URLDataSource" /> >>>>>>> <document> >>>>>>> <entity name="nvd-rss" >>>>>>> pk="link" >>>>>>> url="https://nvd.nist.gov/download/nvd-rss.xml" >>>>>>> processor="XPathEntityProcessor" >>>>>>> forEach="/RDF/item" >>>>>>> transformer="DateFormatTransformer"> >>>>>>> >>>>>>> <field column="id" xpath="/RDF/item/title" >>>>>>> commonField="true" /> >>>>>>> <field column="link" xpath="/RDF/item/link" >>>>>>> commonField="true" >>>>>>> /> >>>>>>> <field column="summary" xpath="/RDF/item/description" >>>>>>> commonField="true" /> >>>>>>> <field column="date" xpath="/RDF/item/date" >>>>>>> commonField="true" >>>>>>> /> >>>>>>> >>>>>>> </entity> >>>>>>> </document> >>>>>>> </dataConfig> >>>>>>> >>>>>>> However, my problem is that I also have to load multiple XML feeds >>>>>>> into >>>>>>> the >>>>>>> same core. Here is one example (there are about 10 of them): >>>>>>> >>>>>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip >>>>>>> >>>>>>> >>>>>>> Is there any built-in functionality that would allow me to do this? >>>>>>> Basically, the use-case is to load and index all the XML ZIP files >>>>>>> first, >>>>>>> and then check the RSS feed every two hours and update the indexes >>>>>>> with >>>>>>> any >>>>>>> new ones. >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Joe >>>>>>> >>>>>>> >>>>>>> >> >