Re: How to index data from multiple data source

Diego Pino Wed, 21 Jan 2015 08:09:12 -0800

Hi Yusniel,

Solr manages documents as a whole. This means updating an existing document 
means replacing. So you should/could index metadata and full text in one step, 
one solr document under one unique ID. That would the simplest case. You could 
also also use nested  child documents to use block joins(depending on what 
version of Solr you are using, more info here: 
http://blog.griddynamics.com/2013/09/solr-block-join-support.html), but in my 
opinion this would be an overkill. We also manage a type of "semantic - linked 
data" mimic using  additional fields(named by real ontology predicate/property 
names to join documents that are related, see 
https://wiki.apache.org/solr/Join). So you could add the full text as an 
additional document with it's own ID and fill a solr document field with the ID 
of the parent metadata document. The on query time you can join them. Joins in 
solr always give as result the joined document(TO), not both (it's no like a 
SQL join, more like and inner query), so we experimented with self joins (the 
field holding the parent ID document also holds it's own ID), but as you can 
understand this is in no way optimal.

Related: We are using a Digital Objects Repository (Fedora Commons + Islandora) 
to archive exactly what you wan't to do. Our PDF files, and also many other 
type of data and metadata, are ingested as objects inside the repository, 
including technical metadata, MODS, DC, binary stream and full text. Then this 
whole object (as a FOXML) goes through an XSLT transformation and into Solr. If 
you are interested you can browse Islandoras google group. 
https://groups.google.com/forum/#!forum/islandora and visit Islandora's WIKI. 
https://wiki.duraspace.org/display/ISLANDORA714/Islandora. There is much 
documentation under the fedoragsearch module that does the real indexing. You 
can see our schemas and solr config there. 

Feel free to write me if you need/wan't more data.

Cheers

Diego Pino Navarro
Krayon Media
Pedro de Valdivia 575
Pucón - Chile
F:+56-45-2442469

On Jan 21, 2015, at 2:43 AM, Yusniel Hidalgo Delgado <yhdelg...@uci.cu> wrote:

> 
> 
> Dear Solr community, 
> 
> 
> 
> 
> I am diving into Solr recently and I need help in the following usage 
> scenery. I am working on a project for extract and search bibliographic 
> metadata from PDF files. Firstly, my PDF files are processed to extract 
> bibliographic metadata such as title, authors, affiliations, keywords and 
> abstract. These metadata are stored in a relational database and then are 
> indexed in Solr via DIH, however, I need to index also the fulltext of PDF 
> and maintain the same ID between metadata indexed and fulltext of PDF indexed 
> in Solr index. How to do that? How to configure sorlconfig.xml and schema.xml 
> to do it? 
> 
> 
> 
> 
> Thanks in advance. 
> 
> 
> 
> 
> Best regards 
> 
> Yusniel Hidalgo Delgado 
> Semantic Web Research Group 
> University of Informatics Sciences 
> http://gws-uci.blogspot.com/ 
> Havana, Cuba 
> 
> 
> 
> 
> ---------------------------------------------------
> XII Aniversario de la creación de la Universidad de las Ciencias 
> Informáticas. 12 años de historia junto a Fidel. 12 de diciembre de 2014.

Re: How to index data from multiple data source

Reply via email to