What is the difference between SolrCell based Tika and Tika in Nuch?

2015-03-21 Thread zhangxin0804
Hi All,

 I am new to Solr. I have a question as follows:
 Is there any difference between extract metadata using Tika in Nutch
and extract metadata using SolrCell based Tika? I used these two ways to
extract metada from PDF files and PNG files, but they almost same. Can
anyone tell me about this ?
Thank you so much.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/What-is-the-difference-between-SolrCell-based-Tika-and-Tika-in-Nuch-tp4194372.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR indexing strategy

2015-03-21 Thread varun sharma
Its more of a financial message where for each customer there are various 
fields that specify various aspects of the transaction  


 On Friday, 20 March 2015 8:09 PM, Priceputu Cristian 
 wrote:
   

 Why would you need 1000 fields for ?
C

On Fri, Mar 20, 2015 at 1:12 PM, varun sharma  wrote:

Requirements of the system that we are trying to build are for each date we 
need to create a SOLR index containing about 350-500 million documents , where 
each document is a single structured record having about 1000 fields .Then 
query same based on index keys & date, for instance we will try to search 
records related to a particular user where date between Jan-1-2015 to 
Jan-31-2015. This query should load only indexes within this date range into 
memory and return rows corresponding to the search pattern.Please suggest how 
this can be implemented using SOLR/Lucene.Thank you ,Varun.





-- 
Regards,
Cristian.


  

Re: SOLR indexing strategy

2015-03-21 Thread varun sharma
1. All fields should be retrievable and are populated for each row , may be 
with default values for some.2. Out of 1000 fields , 10-15 are need to be 
indexed.
In our current proprietary  solution , index as well as data files(compressed) 
reside together on SAN storage , and based on date range date specific index 
files are loaded into memory , which in turn are used to fetch data blocks. 


 On Saturday, 21 March 2015 12:08 PM, Jack Krupansky 
 wrote:
   

 1. With 1000 fields, you may only get 10 to 25 million rows per node. So, a 
single date may take 15 to 50 nodes.2. How many of the fields need to be 
indexed for reference in a query?3. Are all the fields populated for each 
row?4. Maybe you could split each row, so that one Solr collection would have a 
slice of the fields. Then separate Solr clusters could be used for each of the 
slices.
-- Jack Krupansky
On Fri, Mar 20, 2015 at 7:12 AM, varun sharma  wrote:

Requirements of the system that we are trying to build are for each date we 
need to create a SOLR index containing about 350-500 million documents , where 
each document is a single structured record having about 1000 fields .Then 
query same based on index keys & date, for instance we will try to search 
records related to a particular user where date between Jan-1-2015 to 
Jan-31-2015. This query should load only indexes within this date range into 
memory and return rows corresponding to the search pattern.Please suggest how 
this can be implemented using SOLR/Lucene.Thank you ,Varun.





  

Re: SOLR indexing strategy

2015-03-21 Thread Jack Krupansky
Don't you have a number of "types" of transactions, where some fields may
be common to all transactions, but with plenty of fields that are not
common to all transactions? The point is that if the number of fields that
need to be populated for each document type is relatively low, it becomes
much more practical. But if all 1000 fields must always be populated...
that's much, much harder.

Default values? Try as hard as you can to not store default values in the
index - they take up space and transfer time. Lucene is much more efficient
at storing empty field values.

If you are only indexing 10-15 fields, that's a very good thing, but not
enough by itself.

An alternate model: use Solr to index your 10-15 fields and only store the
native key for each record in Solr. That will keep your Solr index much
smaller. Then, you perform your query in Solr and get back only the native
keys for the matching records, and then you would do a database lookup in
your bulk storage engine directly by those keys to fetch just the records
that match the query results.

What do your queries tend to look like?


-- Jack Krupansky

On Sat, Mar 21, 2015 at 5:36 AM, varun sharma 
wrote:

> Its more of a financial message where for each customer there are various
> fields that specify various aspects of the transaction
>
>
>  On Friday, 20 March 2015 8:09 PM, Priceputu Cristian <
> priceputu.crist...@gmail.com> wrote:
>
>
>  Why would you need 1000 fields for ?
> C
>
> On Fri, Mar 20, 2015 at 1:12 PM, varun sharma 
> wrote:
>
> Requirements of the system that we are trying to build are for each date
> we need to create a SOLR index containing about 350-500 million documents ,
> where each document is a single structured record having about 1000 fields
> .Then query same based on index keys & date, for instance we will try to
> search records related to a particular user where date between Jan-1-2015
> to Jan-31-2015. This query should load only indexes within this date range
> into memory and return rows corresponding to the search pattern.Please
> suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.
>
>
>
>
>
> --
> Regards,
> Cristian.
>
>
>
>


Re: What is the difference between SolrCell based Tika and Tika in Nuch?

2015-03-21 Thread Erick Erickson
Well, they could be different versions of Tika, don't know. You can
tell this from the respective jars in the two projects.

But more importantly, _how_ the fields from Nutch-based Tika maps into
Solr fields and how they're mapped in SolrCel may be different, but
this would be because your configurations are different. What I'm
saying is that _you_ have to insure that your configs do the same
mapping of extracted meta-data to Solr fields.

Best,
Erick

On Fri, Mar 20, 2015 at 9:11 PM, zhangxin0804  wrote:
> Hi All,
>
>  I am new to Solr. I have a question as follows:
>  Is there any difference between extract metadata using Tika in Nutch
> and extract metadata using SolrCell based Tika? I used these two ways to
> extract metada from PDF files and PNG files, but they almost same. Can
> anyone tell me about this ?
> Thank you so much.
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/What-is-the-difference-between-SolrCell-based-Tika-and-Tika-in-Nuch-tp4194372.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: What is the difference between SolrCell based Tika and Tika in Nuch?

2015-03-21 Thread Furkan KAMACI
Hi,

Which versions of Solr and Nutch do you use? Nutch and Solr supports Tika
1.7 at their recent versions.

Kind Regards,
Furkan KAMACI

On Sat, Mar 21, 2015 at 6:46 PM, Erick Erickson 
wrote:

> Well, they could be different versions of Tika, don't know. You can
> tell this from the respective jars in the two projects.
>
> But more importantly, _how_ the fields from Nutch-based Tika maps into
> Solr fields and how they're mapped in SolrCel may be different, but
> this would be because your configurations are different. What I'm
> saying is that _you_ have to insure that your configs do the same
> mapping of extracted meta-data to Solr fields.
>
> Best,
> Erick
>
> On Fri, Mar 20, 2015 at 9:11 PM, zhangxin0804 
> wrote:
> > Hi All,
> >
> >  I am new to Solr. I have a question as follows:
> >  Is there any difference between extract metadata using Tika in Nutch
> > and extract metadata using SolrCell based Tika? I used these two ways to
> > extract metada from PDF files and PNG files, but they almost same. Can
> > anyone tell me about this ?
> > Thank you so much.
> >
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/What-is-the-difference-between-SolrCell-based-Tika-and-Tika-in-Nuch-tp4194372.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>


Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-21 Thread Martin Wunderlich
Hi all, 

I am trying to create a data import handler (DIH) to import XML files. The 
source XML should be transformed using XSLT into the standard Solr import 
format. I have tested the XSLT and successfully imported data using the 
Java-based simple import tool. However, when I try to import the same XML files 
with the same XSLT pre-processing using a DIH configured in solrconfig.xml, it 
doesn’t work. I can execute the DIH from the admin interface, but no documents 
get imported. The logging console doesn’t give any errors. 

Could someone who has managed to successfully set up a similar configuration 
(XML import via DIH with XSL pre-processing), provide with the basic 
configuration, so that I can check what might be wrong in mine? 

Thanks a lot. 

Cheers, 

Martin
 



Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor

2015-03-21 Thread Alexandre Rafalovitch
What do you mean using DIH with XSLT together? DIH uses a basic XPath
parser, but not full XSLT.

So, it's not very clear what the question actually means. How did you
configure it all?

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 21 March 2015 at 14:14, Martin Wunderlich  wrote:
> Hi all,
>
> I am trying to create a data import handler (DIH) to import XML files. The 
> source XML should be transformed using XSLT into the standard Solr import 
> format. I have tested the XSLT and successfully imported data using the 
> Java-based simple import tool. However, when I try to import the same XML 
> files with the same XSLT pre-processing using a DIH configured in 
> solrconfig.xml, it doesn’t work. I can execute the DIH from the admin 
> interface, but no documents get imported. The logging console doesn’t give 
> any errors.
>
> Could someone who has managed to successfully set up a similar configuration 
> (XML import via DIH with XSL pre-processing), provide with the basic 
> configuration, so that I can check what might be wrong in mine?
>
> Thanks a lot.
>
> Cheers,
>
> Martin
>
>