What is the difference between SolrCell based Tika and Tika in Nuch?
Hi All, I am new to Solr. I have a question as follows: Is there any difference between extract metadata using Tika in Nutch and extract metadata using SolrCell based Tika? I used these two ways to extract metada from PDF files and PNG files, but they almost same. Can anyone tell me about this ? Thank you so much. -- View this message in context: http://lucene.472066.n3.nabble.com/What-is-the-difference-between-SolrCell-based-Tika-and-Tika-in-Nuch-tp4194372.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: SOLR indexing strategy
Its more of a financial message where for each customer there are various fields that specify various aspects of the transaction On Friday, 20 March 2015 8:09 PM, Priceputu Cristian wrote: Why would you need 1000 fields for ? C On Fri, Mar 20, 2015 at 1:12 PM, varun sharma wrote: Requirements of the system that we are trying to build are for each date we need to create a SOLR index containing about 350-500 million documents , where each document is a single structured record having about 1000 fields .Then query same based on index keys & date, for instance we will try to search records related to a particular user where date between Jan-1-2015 to Jan-31-2015. This query should load only indexes within this date range into memory and return rows corresponding to the search pattern.Please suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun. -- Regards, Cristian.
Re: SOLR indexing strategy
1. All fields should be retrievable and are populated for each row , may be with default values for some.2. Out of 1000 fields , 10-15 are need to be indexed. In our current proprietary solution , index as well as data files(compressed) reside together on SAN storage , and based on date range date specific index files are loaded into memory , which in turn are used to fetch data blocks. On Saturday, 21 March 2015 12:08 PM, Jack Krupansky wrote: 1. With 1000 fields, you may only get 10 to 25 million rows per node. So, a single date may take 15 to 50 nodes.2. How many of the fields need to be indexed for reference in a query?3. Are all the fields populated for each row?4. Maybe you could split each row, so that one Solr collection would have a slice of the fields. Then separate Solr clusters could be used for each of the slices. -- Jack Krupansky On Fri, Mar 20, 2015 at 7:12 AM, varun sharma wrote: Requirements of the system that we are trying to build are for each date we need to create a SOLR index containing about 350-500 million documents , where each document is a single structured record having about 1000 fields .Then query same based on index keys & date, for instance we will try to search records related to a particular user where date between Jan-1-2015 to Jan-31-2015. This query should load only indexes within this date range into memory and return rows corresponding to the search pattern.Please suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun.
Re: SOLR indexing strategy
Don't you have a number of "types" of transactions, where some fields may be common to all transactions, but with plenty of fields that are not common to all transactions? The point is that if the number of fields that need to be populated for each document type is relatively low, it becomes much more practical. But if all 1000 fields must always be populated... that's much, much harder. Default values? Try as hard as you can to not store default values in the index - they take up space and transfer time. Lucene is much more efficient at storing empty field values. If you are only indexing 10-15 fields, that's a very good thing, but not enough by itself. An alternate model: use Solr to index your 10-15 fields and only store the native key for each record in Solr. That will keep your Solr index much smaller. Then, you perform your query in Solr and get back only the native keys for the matching records, and then you would do a database lookup in your bulk storage engine directly by those keys to fetch just the records that match the query results. What do your queries tend to look like? -- Jack Krupansky On Sat, Mar 21, 2015 at 5:36 AM, varun sharma wrote: > Its more of a financial message where for each customer there are various > fields that specify various aspects of the transaction > > > On Friday, 20 March 2015 8:09 PM, Priceputu Cristian < > priceputu.crist...@gmail.com> wrote: > > > Why would you need 1000 fields for ? > C > > On Fri, Mar 20, 2015 at 1:12 PM, varun sharma > wrote: > > Requirements of the system that we are trying to build are for each date > we need to create a SOLR index containing about 350-500 million documents , > where each document is a single structured record having about 1000 fields > .Then query same based on index keys & date, for instance we will try to > search records related to a particular user where date between Jan-1-2015 > to Jan-31-2015. This query should load only indexes within this date range > into memory and return rows corresponding to the search pattern.Please > suggest how this can be implemented using SOLR/Lucene.Thank you ,Varun. > > > > > > -- > Regards, > Cristian. > > > >
Re: What is the difference between SolrCell based Tika and Tika in Nuch?
Well, they could be different versions of Tika, don't know. You can tell this from the respective jars in the two projects. But more importantly, _how_ the fields from Nutch-based Tika maps into Solr fields and how they're mapped in SolrCel may be different, but this would be because your configurations are different. What I'm saying is that _you_ have to insure that your configs do the same mapping of extracted meta-data to Solr fields. Best, Erick On Fri, Mar 20, 2015 at 9:11 PM, zhangxin0804 wrote: > Hi All, > > I am new to Solr. I have a question as follows: > Is there any difference between extract metadata using Tika in Nutch > and extract metadata using SolrCell based Tika? I used these two ways to > extract metada from PDF files and PNG files, but they almost same. Can > anyone tell me about this ? > Thank you so much. > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/What-is-the-difference-between-SolrCell-based-Tika-and-Tika-in-Nuch-tp4194372.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: What is the difference between SolrCell based Tika and Tika in Nuch?
Hi, Which versions of Solr and Nutch do you use? Nutch and Solr supports Tika 1.7 at their recent versions. Kind Regards, Furkan KAMACI On Sat, Mar 21, 2015 at 6:46 PM, Erick Erickson wrote: > Well, they could be different versions of Tika, don't know. You can > tell this from the respective jars in the two projects. > > But more importantly, _how_ the fields from Nutch-based Tika maps into > Solr fields and how they're mapped in SolrCel may be different, but > this would be because your configurations are different. What I'm > saying is that _you_ have to insure that your configs do the same > mapping of extracted meta-data to Solr fields. > > Best, > Erick > > On Fri, Mar 20, 2015 at 9:11 PM, zhangxin0804 > wrote: > > Hi All, > > > > I am new to Solr. I have a question as follows: > > Is there any difference between extract metadata using Tika in Nutch > > and extract metadata using SolrCell based Tika? I used these two ways to > > extract metada from PDF files and PNG files, but they almost same. Can > > anyone tell me about this ? > > Thank you so much. > > > > > > > > > > -- > > View this message in context: > http://lucene.472066.n3.nabble.com/What-is-the-difference-between-SolrCell-based-Tika-and-Tika-in-Nuch-tp4194372.html > > Sent from the Solr - User mailing list archive at Nabble.com. >
Need help using DIH with FileListEntityProcessor with XPathEntityProcessor
Hi all, I am trying to create a data import handler (DIH) to import XML files. The source XML should be transformed using XSLT into the standard Solr import format. I have tested the XSLT and successfully imported data using the Java-based simple import tool. However, when I try to import the same XML files with the same XSLT pre-processing using a DIH configured in solrconfig.xml, it doesn’t work. I can execute the DIH from the admin interface, but no documents get imported. The logging console doesn’t give any errors. Could someone who has managed to successfully set up a similar configuration (XML import via DIH with XSL pre-processing), provide with the basic configuration, so that I can check what might be wrong in mine? Thanks a lot. Cheers, Martin
Re: Need help using DIH with FileListEntityProcessor with XPathEntityProcessor
What do you mean using DIH with XSLT together? DIH uses a basic XPath parser, but not full XSLT. So, it's not very clear what the question actually means. How did you configure it all? Regards, Alex. Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: http://www.solr-start.com/ On 21 March 2015 at 14:14, Martin Wunderlich wrote: > Hi all, > > I am trying to create a data import handler (DIH) to import XML files. The > source XML should be transformed using XSLT into the standard Solr import > format. I have tested the XSLT and successfully imported data using the > Java-based simple import tool. However, when I try to import the same XML > files with the same XSLT pre-processing using a DIH configured in > solrconfig.xml, it doesn’t work. I can execute the DIH from the admin > interface, but no documents get imported. The logging console doesn’t give > any errors. > > Could someone who has managed to successfully set up a similar configuration > (XML import via DIH with XSL pre-processing), provide with the basic > configuration, so that I can check what might be wrong in mine? > > Thanks a lot. > > Cheers, > > Martin > >