Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Alexandre Rafalovitch Fri, 23 Jan 2015 10:42:39 -0800

https://cwiki.apache.org/confluence/display/solr/Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler


Admin UI has the interface, so you can play there once you define it.

You do have to use Curl, there is no built-in scheduler.

Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 13:29, Carl Roberts <carl.roberts.zap...@gmail.com> wrote:
> Hi Alex,
>
> If I am understanding this correctly, I can define multiple entities like
> this?
>
> <document>
>     <entity/>
>     <entity/>
>     <entity/>
>     ...
> </document>
>
> How would I trigger loading certain entities during start?
>
> How would I trigger loading other entities during update?
>
> Is there a way to set an auto-update for certain entities so that I don't
> have to invoke an update via curl?
>
> Where / how do I specify the preImportDeleteQuery to avoid deleting
> everything upon each update?
>
> Is there an example or doc that shows how to do all this?
>
> Regards,
>
> Joe
>
>
> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>
>> You can define both multiple entities in the same file and nested
>> entities if your list comes from an external source (e.g. a text file
>> of URLs).
>> You can also trigger DIH with a name of a specific entity to load just
>> that.
>> You can even pass DIH configuration file when you are triggering the
>> processing start, so you can have different files completely for
>> initial load and update. Though you can just do the same with
>> entities.
>>
>> The only thing to be aware of is that before an entity definition is
>> processed, a delete command is run. By default, it's "delete all", so
>> executing one entity will delete everything but then just populate
>> that one entity's results. You can avoid that by defining
>> preImportDeleteQuery and having a clear identifier on content
>> generated by each entity (e.g. source, either extracted or manually
>> added with TemplateTransformer).
>>
>> Regards,
>>     Alex.
>>
>> ----
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 23 January 2015 at 11:15, Carl Roberts <carl.roberts.zap...@gmail.com>
>> wrote:
>>>
>>> Hi,
>>>
>>> I have the RSS DIH example working with my own RSS feed - here is the
>>> configuration for it.
>>>
>>> <dataConfig>
>>>      <dataSource type="URLDataSource" />
>>>      <document>
>>>          <entity name="nvd-rss"
>>>                  pk="link"
>>>                  url="https://nvd.nist.gov/download/nvd-rss.xml";
>>>                  processor="XPathEntityProcessor"
>>>                  forEach="/RDF/item"
>>>                  transformer="DateFormatTransformer">
>>>
>>>              <field column="id" xpath="/RDF/item/title"
>>> commonField="true" />
>>>              <field column="link" xpath="/RDF/item/link"
>>> commonField="true"
>>> />
>>>              <field column="summary" xpath="/RDF/item/description"
>>> commonField="true" />
>>>              <field column="date" xpath="/RDF/item/date"
>>> commonField="true"
>>> />
>>>
>>>          </entity>
>>>      </document>
>>> </dataConfig>
>>>
>>> However, my problem is that I also have to load multiple XML feeds into
>>> the
>>> same core.  Here is one example (there are about 10 of them):
>>>
>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>
>>>
>>> Is there any built-in functionality that would allow me to do this?
>>> Basically, the use-case is to load and index all the XML ZIP files first,
>>> and then check the RSS feed every two hours and update the indexes with
>>> any
>>> new ones.
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Reply via email to