1. You don't actually put PDF/Word into Solr. Instead, it is run through content and metadata extraction process and then index that. This is important because "a computer" does not understand what you are looking for when you open a PDF. It only understand whatever text is possible to extract. In case of PDF it is often not much at all, unless it was generated with accessibility layer in place. You can experiment with what you can extract by downloading a standalone Apache Tika install, which has a command line version or using Solr's extractOnly flag. Solr, internally, uses Tika, so the results should be the same.
2) When you do a search you can do "field:(Keyword1 Keyword2 Keyword3 Keyword4) and you get as results any document that matches one of those. Not sure about 1000 of them in one go, but certainly a large number. On the other hand, if you have same keywords all the time and you are trying to match documents against them, you might be more interested in Elastic Search's percolator (http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html ) or in Luwak (https://github.com/flaxsearch/luwak). Regards, Alex. Personal website: http://www.outerthoughts.com/ Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency On Fri, Mar 28, 2014 at 10:05 AM, Saurabh Agarwal <sagarwal1...@gmail.com> wrote: > Thanks a lot Alex for your reply, Appreciate the same. > > So if i leave the line no part. > 1. I guess putting pdf/word in solr for search can be done, These > documents will go go in solr. > 2. For search any automatic way to give a excel sheet or large search > keywords to search for . > ie i have 1000's of words that i want to search in doc can i do it > collectively or send search queries one by one. > > Thanks > Saurabh > > > > On Fri, Mar 28, 2014 at 6:48 AM, Alexandre Rafalovitch > <arafa...@gmail.com> wrote: >> This feels somewhat backwards. It's very hard to extract Line-Number >> information out of MSWord and next to impossible from PDF. So, it's >> not whether the Solr is a good fit or not here is that maybe your >> whole architecture has a major issue. Can you do this/what you want by >> hand at least once? Down to the precision you want? >> >> If you can, then yes you probably can automate the searching with >> Solr, though you will still have serious issues (sentence crossing >> line-boundaries, etc). But I suspect your whole approach will change >> once you try to do this manually. >> >> Regards, >> Alex. >> Personal website: http://www.outerthoughts.com/ >> Current project: http://www.solr-start.com/ - Accelerating your Solr >> proficiency >> >> >> On Thu, Mar 27, 2014 at 11:46 PM, Saurabh Agarwal >> <sagarwal1...@gmail.com> wrote: >>> Can anyone help me please. >>> >>> Hi All, >>> >>> I am new to Solr and from initial reading i am quite convinced Solr >>> will be of great help. Can anyone help in making that decision. >>> >>> Usecase: >>> 1. I will have PDF,Word docs generated daily/weekly ( lot of them ) >>> which kinds of get overwritten frequently. >>> 2. I have a dictionary kind of thing ( having a list of which >>> words/small sentences should be part of above docs , words which >>> cannot be and alternatives for some ). >>> 3. Now i want Solr to search my Docs produced in step 1 to be searched >>> for words/small sentences from step 2 and give me my Doc Name/line no >>> in which they exist. >>> >>> Will Solr be a good help to me, If anybody can help giving some >>> examples that will be great. >>> >>> Appreciate your help and patience. >>> >>> Thanks >>> Saurabh