Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Alexandre Rafalovitch Fri, 23 Jan 2015 12:26:05 -0800

Unzipping things might be an issue. You may need to do that as part of
a batch job outside of Solr. For the rest, go through the
documentation first, it does answer a bunch of questions. There is
also a page on the Wiki as well, not just in the reference guide.


Regards,
   Alex.
----
Sign up for my Solr resources newsletter at http://www.solr-start.com/


On 23 January 2015 at 14:51, Carl Roberts <carl.roberts.zap...@gmail.com> wrote:
> Excellent - thanks Shalin.  But how does delta-import work?  Does it do a
> clean also?  Does it require a unique Id?  Does it update existing records
> and only add when necessary?
>
> And, how would I go about unzipping the content from a URL to then import
> the unzipped XML?  Is the recommended way to extend the URLDataSource class
> or is there any built-in logic to plug in pre-processing handlers?
>
>
> And,
>
> On 1/23/15, 2:39 PM, Shalin Shekhar Mangar wrote:
>>
>> If you add clean=false as a parameter to the full-import then deletion is
>> disabled. Since you are ingesting RSS there is no need for deletion at all
>> I guess.
>>
>> On Fri, Jan 23, 2015 at 7:31 PM, Carl Roberts
>> <carl.roberts.zap...@gmail.com
>>>
>>> wrote:
>>> OK - Thanks for the doc.
>>>
>>> Is it possible to just provide an empty value to preImportDeleteQuery to
>>> disable the delete prior to import?
>>>
>>> Will the data still be deleted for each entity during a delta-import
>>> instead of full-import?
>>>
>>> Is there any capability in the handler to unzip an XML file from a URL
>>> prior to reading it or can I perhaps hook a custom pre-processing
>>> handler?
>>>
>>> Regards,
>>>
>>> Joe
>>>
>>>
>>>
>>> On 1/23/15, 1:40 PM, Alexandre Rafalovitch wrote:
>>>
>>>> https://cwiki.apache.org/confluence/display/solr/
>>>> Uploading+Structured+Data+Store+Data+with+the+Data+Import+Handler
>>>>
>>>> Admin UI has the interface, so you can play there once you define it.
>>>>
>>>> You do have to use Curl, there is no built-in scheduler.
>>>>
>>>> Regards,
>>>>      Alex.
>>>> ----
>>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>>
>>>>
>>>> On 23 January 2015 at 13:29, Carl Roberts
>>>> <carl.roberts.zap...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Alex,
>>>>>
>>>>> If I am understanding this correctly, I can define multiple entities
>>>>> like
>>>>> this?
>>>>>
>>>>> <document>
>>>>>       <entity/>
>>>>>       <entity/>
>>>>>       <entity/>
>>>>>       ...
>>>>> </document>
>>>>>
>>>>> How would I trigger loading certain entities during start?
>>>>>
>>>>> How would I trigger loading other entities during update?
>>>>>
>>>>> Is there a way to set an auto-update for certain entities so that I
>>>>> don't
>>>>> have to invoke an update via curl?
>>>>>
>>>>> Where / how do I specify the preImportDeleteQuery to avoid deleting
>>>>> everything upon each update?
>>>>>
>>>>> Is there an example or doc that shows how to do all this?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Joe
>>>>>
>>>>>
>>>>> On 1/23/15, 11:24 AM, Alexandre Rafalovitch wrote:
>>>>>
>>>>>> You can define both multiple entities in the same file and nested
>>>>>> entities if your list comes from an external source (e.g. a text file
>>>>>> of URLs).
>>>>>> You can also trigger DIH with a name of a specific entity to load just
>>>>>> that.
>>>>>> You can even pass DIH configuration file when you are triggering the
>>>>>> processing start, so you can have different files completely for
>>>>>> initial load and update. Though you can just do the same with
>>>>>> entities.
>>>>>>
>>>>>> The only thing to be aware of is that before an entity definition is
>>>>>> processed, a delete command is run. By default, it's "delete all", so
>>>>>> executing one entity will delete everything but then just populate
>>>>>> that one entity's results. You can avoid that by defining
>>>>>> preImportDeleteQuery and having a clear identifier on content
>>>>>> generated by each entity (e.g. source, either extracted or manually
>>>>>> added with TemplateTransformer).
>>>>>>
>>>>>> Regards,
>>>>>>       Alex.
>>>>>>
>>>>>> ----
>>>>>> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>>>>>>
>>>>>>
>>>>>> On 23 January 2015 at 11:15, Carl Roberts <
>>>>>> carl.roberts.zap...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have the RSS DIH example working with my own RSS feed - here is the
>>>>>>> configuration for it.
>>>>>>>
>>>>>>> <dataConfig>
>>>>>>>        <dataSource type="URLDataSource" />
>>>>>>>        <document>
>>>>>>>            <entity name="nvd-rss"
>>>>>>>                    pk="link"
>>>>>>>                    url="https://nvd.nist.gov/download/nvd-rss.xml";
>>>>>>>                    processor="XPathEntityProcessor"
>>>>>>>                    forEach="/RDF/item"
>>>>>>>                    transformer="DateFormatTransformer">
>>>>>>>
>>>>>>>                <field column="id" xpath="/RDF/item/title"
>>>>>>> commonField="true" />
>>>>>>>                <field column="link" xpath="/RDF/item/link"
>>>>>>> commonField="true"
>>>>>>> />
>>>>>>>                <field column="summary" xpath="/RDF/item/description"
>>>>>>> commonField="true" />
>>>>>>>                <field column="date" xpath="/RDF/item/date"
>>>>>>> commonField="true"
>>>>>>> />
>>>>>>>
>>>>>>>            </entity>
>>>>>>>        </document>
>>>>>>> </dataConfig>
>>>>>>>
>>>>>>> However, my problem is that I also have to load multiple XML feeds
>>>>>>> into
>>>>>>> the
>>>>>>> same core.  Here is one example (there are about 10 of them):
>>>>>>>
>>>>>>> http://static.nvd.nist.gov/feeds/xml/cve/nvdcve-2.0-2014.xml.zip
>>>>>>>
>>>>>>>
>>>>>>> Is there any built-in functionality that would allow me to do this?
>>>>>>> Basically, the use-case is to load and index all the XML ZIP files
>>>>>>> first,
>>>>>>> and then check the RSS feed every two hours and update the indexes
>>>>>>> with
>>>>>>> any
>>>>>>> new ones.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Joe
>>>>>>>
>>>>>>>
>>>>>>>
>>
>

Re: Is it possible to read multiple RSS feeds and XML Zip file feeds with DIH into one core?

Reply via email to