Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Shalin Shekhar Mangar Fri, 23 Jan 2015 11:41:32 -0800

If you add clean=false as a parameter to the full-import then deletion is
disabled. Since you are ingesting RSS there is no need for deletion at all
I guess.


On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts <carl.roberts.zap...@gmail.com
> wrote:

> OK - Thanks for the doc.
>
> Is it possible to just provide an empty value to preImportDeleteQuery to
> disable the delete prior to import?
>
> Will the data still be deleted for each entity during a delta-import
> instead of full-import?
>
> Is there any capability in the handler to unzip an XML file from a URL
> prior to reading it or can I perhaps hook a custom pre-processing handler?
>
> Regards,
>
> Joe
>
>
>
> On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:
>
>> https://cwiki.apache.org/confluence/display/solr/
>> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
>>
>> Admin UI has the interface, so you can play there once you define it.
>>
>> You do have to use Curl, there is no built-in scheduler.
>>
>> Regards,
>>     Alex.
>> ----
>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>
>>
>> On 23 January 2015 at 13:29, Carl Roberts <carl.roberts.zap...@gmail.com>
>> wrote:
>>
>>> Hi Alex,
>>>
>>> If I am understanding this correctly, I can define multiple entities like
>>> this?
>>>
>>> <document>
>>>      <entity/>
>>>      <entity/>
>>>      <entity/>
>>>      ...
>>> </document>
>>>
>>> How would I trigger loading certain entities during start?
>>>
>>> How would I trigger loading other entities during update?
>>>
>>> Is there a way to set an auto-update for certain entities so that I don't
>>> have to invoke an update via curl?
>>>
>>> Where / how do I specify the preImportDeleteQuery to avoid deleting
>>> everything upon each update?
>>>
>>> Is there an example or doc that shows how to do all this?
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>>> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>>
>>>> You can define both multiple entities in the same file and nested
>>>> entities if your list comes from an external source (e.g. a text file
>>>> of URLs).
>>>> You can also trigger DIH with a name of a specific entity to load just
>>>> that.
>>>> You can even pass DIH configuration file when you are triggering the
>>>> processing start, so you can have different files completely for
>>>> initial load and update. Though you can just do the same with
>>>> entities.
>>>>
>>>> The only thing to be aware of is that before an entity definition is
>>>> processed, a delete command is run. By default, it's "delete all", so
>>>> executing one entity will delete everything but then just populate
>>>> that one entity's results. You can avoid that by defining
>>>> preImportDeleteQuery and having a clear identifier on content
>>>> generated by each entity (e.g. source, either extracted or manually
>>>> added with TemplateTransformer).
>>>>
>>>> Regards,
>>>>      Alex.
>>>>
>>>> ----
>>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>>
>>>>
>>>> On 23 January 2015 at 11:15, Carl Roberts <
>>>> carl.roberts.zap...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have the RSS DIH example working with my own RSS feed - here is the
>>>>> configuration for it.
>>>>>
>>>>> <dataConfig>
>>>>>       <dataSource type="URLDataSource" />
>>>>>       <document>
>>>>>           <entity name="nvd-rss"
>>>>>                   pk="link"
>>>>>                   url="https://nvd.nist.gov/download/nvd-rss.xml";
>>>>>                   processor="XPathEntityProcessor"
>>>>>                   forEach="/RDF/item"
>>>>>                   transformer="DateFormatTransformer">
>>>>>
>>>>>               <field column="id" xpath="/RDF/item/title"
>>>>> commonField="true" />
>>>>>               <field column="link" xpath="/RDF/item/link"
>>>>> commonField="true"
>>>>> />
>>>>>               <field column="summary" xpath="/RDF/item/description"
>>>>> commonField="true" />
>>>>>               <field column="date" xpath="/RDF/item/date"
>>>>> commonField="true"
>>>>> />
>>>>>
>>>>>           </entity>
>>>>>       </document>
>>>>> </dataConfig>
>>>>>
>>>>> However, my problem is that I also have to load multiple XML feeds into
>>>>> the
>>>>> same core.  Here is one example (there are about 10 of them):
>>>>>
>>>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>>>
>>>>>
>>>>> Is there any built-in functionality that would allow me to do this?
>>>>> Basically, the use-case is to load and index all the XML ZIP files
>>>>> first,
>>>>> and then check the RSS feed every two hours and update the indexes with
>>>>> any
>>>>> new ones.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Joe
>>>>>
>>>>>
>>>>>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Reply via email to