Thanks a lot Alexandre for the response much appreciated. Thanks Saurabh
On Fri, Mar 28, 2014 at 8:56 AM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > 1. You don't actually put PDF/Word into Solr. Instead, it is run > through content and metadata extraction process and then index that. > This is important because "a computer" does not understand what you > are looking for when you open a PDF. It only understand whatever text > is possible to extract. In case of PDF it is often not much at all, > unless it was generated with accessibility layer in place. You can > experiment with what you can extract by downloading a standalone > Apache Tika install, which has a command line version or using Solr's > extractOnly flag. Solr, internally, uses Tika, so the results should > be the same. > > 2) When you do a search you can do "field:(Keyword1 Keyword2 Keyword3 > Keyword4) and you get as results any document that matches one of > those. Not sure about 1000 of them in one go, but certainly a large > number. > > On the other hand, if you have same keywords all the time and you are > trying to match documents against them, you might be more interested > in Elastic Search's percolator > (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html > ) or in Luwak (https://github.com/flaxsearch/luwak). > > Regards, > Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Fri, Mar 28, 2014 at 10:05 AM, Saurabh Agarwal > <sagarwal1...@gmail.com> wrote: >> Thanks a lot Alex for your reply, Appreciate the same. >> >> So if i leave the line no part. >> 1. I guess putting pdf/word in solr for search can be done, These >> documents will go go in solr. >> 2. For search any automatic way to give a excel sheet or large search >> keywords to search for . >> ie i have 1000's of words that i want to search in doc can i do it >> collectively or send search queries one by one. >> >> Thanks >> Saurabh >> >> >> >> On Fri, Mar 28, 2014 at 6:48 AM, Alexandre Rafalovitch >> <arafa...@gmail.com> wrote: >>> This feels somewhat backwards. It's very hard to extract Line-Number >>> information out of MSWord and next to impossible from PDF. So, it's >>> not whether the Solr is a good fit or not here is that maybe your >>> whole architecture has a major issue. Can you do this/what you want by >>> hand at least once? Down to the precision you want? >>> >>> If you can, then yes you probably can automate the searching with >>> Solr, though you will still have serious issues (sentence crossing >>> line-boundaries, etc). But I suspect your whole approach will change >>> once you try to do this manually. >>> >>> Regards, >>> Alex. >>> Personal website: http://www.outerthoughts.com/ >>> Current project: http://www.solr-start.com/ - Accelerating your Solr >>> proficiency >>> >>> >>> On Thu, Mar 27, 2014 at 11:46 PM, Saurabh Agarwal >>> <sagarwal1...@gmail.com> wrote: >>>> Can anyone help me please. >>>> >>>> Hi All, >>>> >>>> I am new to Solr and from initial reading i am quite convinced Solr >>>> will be of great help. Can anyone help in making that decision. >>>> >>>> Usecase: >>>> 1. I will have PDF,Word docs generated daily/weekly ( lot of them ) >>>> which kinds of get overwritten frequently. >>>> 2. I have a dictionary kind of thing ( having a list of which >>>> words/small sentences should be part of above docs , words which >>>> cannot be and alternatives for some ). >>>> 3. Now i want Solr to search my Docs produced in step 1 to be searched >>>> for words/small sentences from step 2 and give me my Doc Name/line no >>>> in which they exist. >>>> >>>> Will Solr be a good help to me, If anybody can help giving some >>>> examples that will be great. >>>> >>>> Appreciate your help and patience. >>>> >>>> Thanks >>>> Saurabh