Re: Start up script for solr?
I blogged about it last month, here ya go. http://www.digital39.com/programming/solr-chkconfig-and-startstop-scripts/2007/07/304/ - Pete On 8/19/07, Jack L <[EMAIL PROTECTED]> wrote: > Hi, > > Sorry that this is not strictly a solr specific question - > > I wonder if anyone has a script to start solr on Linux when the > system boots up? Or better yet, supports shutdown and restart? > > -- > Thanks, > Jack > >
Re: Start up script for solr?
I forgot to mention, that is for a RHEL box, but can easily be adapted. It will work like the standard scripts for RHEL /etc/init.d/solr start /etc/init.d/solr stop /etc/init.d/solr restart or you can just run the solr.start and solr.stop scripts individually On 8/19/07, Peter Manis <[EMAIL PROTECTED]> wrote: > I blogged about it last month, here ya go. > > http://www.digital39.com/programming/solr-chkconfig-and-startstop-scripts/2007/07/304/ > > - Pete > > On 8/19/07, Jack L <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Sorry that this is not strictly a solr specific question - > > > > I wonder if anyone has a script to start solr on Linux when the > > system boots up? Or better yet, supports shutdown and restart? > > > > -- > > Thanks, > > Jack > > > > >
Re: Re[2]: Start up script for solr?
Interesting, it worked fine on the server. Try moving the -stop at the end of the line to just before the -jar. - Pete On 8/19/07, Jack L <[EMAIL PROTECTED]> wrote: > Hello Peter, > > Many thanks! > > solr.start works fine but I'm getting an error with solr.stop and solr is not > being stopped: > (I've replaced my app dir with /opt/directory in the log: > > $ /etc/init.d/solr stop > Stopping Solr... java.net.BindException: Address already in use > WARN: Not listening on monitor port: 8079 > 2007-08-19 15:23:02.533::INFO: Logging to STDERR via > org.mortbay.log.StdErrLog > 2007-08-19 15:23:02.563::WARN: EXCEPTION > java.io.FileNotFoundException: /opt/directory/-stop (No such file or > directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:106) > at java.io.FileInputStream.(FileInputStream.java:66) > at > sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:70) > at > sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:161) > at > com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:653) > at > com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:186) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771) > at > com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) > at > com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:107) > at > com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) > at > com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) > at org.mortbay.xml.XmlParser.parse(XmlParser.java:188) > at org.mortbay.xml.XmlParser.parse(XmlParser.java:204) > at org.mortbay.xml.XmlConfiguration.(XmlConfiguration.java:100) > at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:916) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.mortbay.start.Main.invokeMain(Main.java:183) > at org.mortbay.start.Main.start(Main.java:497) > at org.mortbay.start.Main.main(Main.java:115) > OK > > > -- > Best regards, > Jack > > Sunday, August 19, 2007, 11:43:16 AM, you wrote: > > > I blogged about it last month, here ya go. > > > http://www.digital39.com/programming/solr-chkconfig-and-startstop-scripts/2007/07/304/ > > > - Pete > > > On 8/19/07, Jack L <[EMAIL PROTECTED]> wrote: > >> Hi, > >> > >> Sorry that this is not strictly a solr specific question - > >> > >> I wonder if anyone has a script to start solr on Linux when the > >> system boots up? Or better yet, supports shutdown and restart? > >> > >> -- > >> Thanks, > >> Jack > >> > >> > >
Re: Re[4]: Start up script for solr?
Sorry about that, I left out the 2nd dash when I added it to the blog. Glad it is working now On 8/19/07, Jack L <[EMAIL PROTECTED]> wrote: > Actually it's --stop. Thanks! > > > Interesting, it worked fine on the server. Try moving the -stop at > > the end of the line to just before the -jar. > > > - Pete > > >
Re: Indexing large documents
Fouad, I would check the error log or console for any possible errors first. They may not show up, it really depends on how you are processing the word document (custom solr, feeding the text to it, etc). We are using a custom version of solr with PDF, DOC, XLS, etc text extraction and I have successfully indexed 40mb documents. I did have indexing problems with a large document or two and simply increasing the heap size fixed the problem. - Pete On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote: > Hello, > > I am using solr to index text extracted from word documents, and it is > working really well. > Recently i started noticing that some documents are not indexed, that is i > know that the word foobar is in a document, but when i search for foobar the > id of that document is not returned. > I suspect that this has to do with the size of the document, and that > documents with a lot of text are not being indexed. > Please advise. > > thanks, > fmardini >
Re: Indexing large documents
The that should show some errors if something goes wrong, if not the console usually will. The errors will look like a java stacktrace output. Did increasing the heap do anything for you? Changing mine to 256mb max worked fine for all of our files. On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote: > Well, I am using the java textmining library to extract text from documents, > then i do a post to solr > I do not have an error log, i only have *.request.log files in the logs > directory > > Thanks > > On 8/20/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > > > Fouad, > > > > I would check the error log or console for any possible errors first. > > They may not show up, it really depends on how you are processing the > > word document (custom solr, feeding the text to it, etc). We are > > using a custom version of solr with PDF, DOC, XLS, etc text extraction > > and I have successfully indexed 40mb documents. I did have indexing > > problems with a large document or two and simply increasing the heap > > size fixed the problem. > > > > - Pete > > > > On 8/20/07, Fouad Mardini <[EMAIL PROTECTED]> wrote: > > > Hello, > > > > > > I am using solr to index text extracted from word documents, and it is > > > working really well. > > > Recently i started noticing that some documents are not indexed, that is > > i > > > know that the word foobar is in a document, but when i search for foobar > > the > > > id of that document is not returned. > > > I suspect that this has to do with the size of the document, and that > > > documents with a lot of text are not being indexed. > > > Please advise. > > > > > > thanks, > > > fmardini > > > > > >
Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)
Christian, Eric Pugh created implemented this functionality for a project we were doing and has released to code on JIRA. We have had very good results with it. If I can be of any help using it beyond the Java code itself let me know. The last revision I used with it was 552853, so if the build happens to fail you can roll back to that and it will work. https://issues.apache.org/jira/browse/SOLR-284 - Pete On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote: > Hi Solr Users, > > i have set up a Solr-Server with a custom Schema. > Now i have updated the index with some content form > xml-files. > > Now i try to update the contents of a folder. > The folder consits of various document-types > (pdf,doc,xls,...). > > Is there anywhere an howto how can i parse the > documents, make an xml of the paresed content > and post it to the solr server? > > Thanks in advance. > > Christian > >
Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)
Installing the patch requires downloading the latest solr via subversion and applying the patch to the source. Eric has updated his patch with various revisions of subversion. To make sure it will compile I suggest getting the revision he lists. As for using the features of this patch. This is the url that would be called /solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description Breaking this down You have stream.file which will be the absolute path to the file you want to index. You then have stream.type which specifies the type of file, which currently supports pdf, xls, doc, ppt. The next field is the id, which is where you specify the unique value for the id in your schema. Example is we had a document reference in a database, and that id was 103, so we would specify the value 103 to identify which document it was in the index. Stream.fieldname is the name of the field in your index that will actually be storing the text from the document. We had the field 'data' so it would be stream.fieldname=data in the url. The parameter fieldnames is any additional fields in your index that need to be filled. We were passing a category, description for the document, a name, and the type. So you just need to specify the names of the fields. Solr will then look for corresponding parameters with those names, which you can see at the end of my URL. The values passed for the additional parameters need to be sent url encoded. I'm not a Java programmer so if you have questions about the internals of the code, definitely direct those to Eric as I cannot help. I have only implemented it in web applications. If you have any other questions about the use of the patch I can answer those questions. Enjoy! - Pete On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote: > There seems to be some code out for Tika now (not packaged/announced yet, > but...). Could someone please take a look at it and see if that could fit > in? I am eagerly waiting for a reply back from tika-dev, but no luck yet. > > http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/ > > I see that Eric's patch uses POI (for most of it)...so that's great! I have > seen too many duplicated efforts, even in Apache projects alone, and this is > one step close to fixing it (other than Tika, which isnt' 'complete' yet). > Are there any plans on releasing this patch with Solr dist? Or, any > instructions on using/installing the patch itself? > > Thanks > Vish > > > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > > > Christian, > > > > Eric Pugh created implemented this functionality for a project we were > > doing and has released to code on JIRA. We have had very good results > > with it. If I can be of any help using it beyond the Java code itself > > let me know. The last revision I used with it was 552853, so if the > > build happens to fail you can roll back to that and it will work. > > > > https://issues.apache.org/jira/browse/SOLR-284 > > > > - Pete > > > > On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote: > > > Hi Solr Users, > > > > > > i have set up a Solr-Server with a custom Schema. > > > Now i have updated the index with some content form > > > xml-files. > > > > > > Now i try to update the contents of a folder. > > > The folder consits of various document-types > > > (pdf,doc,xls,...). > > > > > > Is there anywhere an howto how can i parse the > > > documents, make an xml of the paresed content > > > and post it to the solr server? > > > > > > Thanks in advance. > > > > > > Christian > > > > > > > > >
Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)
I am a little confused how you have things setup, so these meta data files contain certain information and there may or may not be a pdf, xls, doc that it is associated with? If that is the case, if it were me I would write something to parse the meta data files, and if there is a binary file associated with it submit it using the url I showed you. If the meta data is just that and has no associated documents submit it in XML form. The script shouldn't be too complicated, but that would depend on the complexity of the meta data you are parsing. To give you an idea how I use it, we have hundreds of documents in PDF, DOC, XLS, HTML, TXT, CSV, and PPT formats. When a document is to be indexed by solr we look at the extension, if it is a txt or html,htm we read the data in and submit it with the xml handler. If the document is one of the binary formats we submit it with the url I showed you. All information about these files is stored in a database and some of the 'documents' in the database are just links to external documents. In that case we are only indexing a description, title, and category. You are correct, it would overwrite the data by doing an update unless you parsed the meta data, and if you are parsing the meta data you might as well just parse it from the start and index once. How are you handling these meta data files right now? are they simply xml files like in the solr example where you are just running the bash script on or is something parsing the contents already? - Pete On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote: > Pete, > > Thanks for the great explanation. > > Thinking it through my process, I am not sure how to use it: > > I have a bunch of docs that pretty much contain a lot of meta-data, some > which include full-text files (.pdf, .ppt, etc...). I use these docs > correctly to index/update into Solr. The next step now is to somehow index > the text from the full-text files. One way to think about it is, I could > have a placeholder field 'data' and keep it empty for the first pass, and > then run update/rich to index the actual full-text, but using the same > unique doc id. But this would actually overwrite the doc in the index, won't > it? And, there really isn't a 'merge' operation, right? > > There might be a better way to use this full-text indexing option, > schema-wise, say: > > - have a new option richData that will take in a source field name, > - validate it's value (valid filename/file), > - recognize the file type, > - and put the 'data' into another field > > What do you think? I am not a true Java developer, so not sure if I could > do it myself, but only hope that someone else on the project could ;-)... > > Rao > > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > > > Installing the patch requires downloading the latest solr via > > subversion and applying the patch to the source. Eric has updated his > > patch with various revisions of subversion. To make sure it will > > compile I suggest getting the revision he lists. > > > > As for using the features of this patch. This is the url that would be > > called > > > > > > /solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description > > > > Breaking this down > > > > You have stream.file which will be the absolute path to the file you > > want to index. You then have stream.type which specifies the type of > > file, which currently supports pdf, xls, doc, ppt. The next field is > > the id, which is where you specify the unique value for the id in your > > schema. Example is we had a document reference in a database, and > > that id was 103, so we would specify the value 103 to identify which > > document it was in the index. Stream.fieldname is the name of the > > field in your index that will actually be storing the text from the > > document. We had the field 'data' so it would be > > stream.fieldname=data in the url. > > > > The parameter fieldnames is any additional fields in your index that > > need to be filled. We were passing a category, description for the > > document, a name, and the type. So you just need to specify the names > > of the fields. Solr will then look for corresponding parameters with > > those names, which you can see at the end of my URL. The values > > passed for the additional parameters need to be sent url encoded. > > > > I'm not a Java programmer so if you have questions about the internals > > of the code, definitely direct those t
Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)
I cant find the documentation, but I believe apache's max url is 8192, so I would assume a lot of other apps like tomcat and jetty would be similar. I havn't run into any problems yet. Maybe shoot Eric an email and see if he would be interested in adapting the code to take XML as well so that you can just include the file location in the xml. - Pete On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote: > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote: > > > > I am a little confused how you have things setup, so these meta data > > files contain certain information and there may or may not be a pdf, > > xls, doc that it is associated with? > > > Yes, you have it right. > > If that is the case, if it were me I would write something to parse > > the meta data files, and if there is a binary file associated with it > > submit it using the url I showed you. If the meta data is just that > > and has no associated documents submit it in XML form. The script > > shouldn't be too complicated, but that would depend on the complexity > > of the meta data you are parsing. > > > > To give you an idea how I use it, we have hundreds of documents in > > PDF, DOC, XLS, HTML, TXT, CSV, and PPT formats. When a document is to > > be indexed by solr we look at the extension, if it is a txt or > > html,htm we read the data in and submit it with the xml handler. If > > the document is one of the binary formats we submit it with the url I > > showed you. All information about these files is stored in a database > > and some of the 'documents' in the database are just links to external > > documents. In that case we are only indexing a description, title, > > and category. > > > > You are correct, it would overwrite the data by doing an update unless > > you parsed the meta data, and if you are parsing the meta data you > > might as well just parse it from the start and index once. > > > > How are you handling these meta data files right now? are they simply > > xml files like in the solr example where you are just running the bash > > script on or is something parsing the contents already? > > > Yes, I am running a similar bash script to index these meta-data xml docs. > The big downside in using the url way is that, for one thing, it has the > characters-limit (1024, is it?). So, if I had a lot of meta-data, or even a > long description for a record, that might not work all that well. I am > guessing you haven't run into this issue yet, right? > > - Pete > > > The proposed schema additions might not make sense for everyone, since the > actual requirements might be more complex than just that (i.e., say you want > to extract text, structure it in various elements, update your doc xml, and > then index). But, it goes well with Solr's search-engine-in-a-box > perception, but now with full-text- prefix to it. Another way I can see it > happen, is to extend the default handler and still take in a xml doc, but > look out for, say, a field name ''. From here on, within the handler, > you can validate the filename, handle it anyways you want (create extra > elements, create '' for pdf files and '' for html files, etc..), > etc... This strips out having to deal with if/else scripting outside of > Solr. > > Rao > > > > On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote: > > > Pete, > > > > > > Thanks for the great explanation. > > > > > > Thinking it through my process, I am not sure how to use it: > > > > > > I have a bunch of docs that pretty much contain a lot of meta-data, some > > > which include full-text files (.pdf, .ppt, etc...). I use these docs > > > correctly to index/update into Solr. The next step now is to somehow > > index > > > the text from the full-text files. One way to think about it is, I could > > > have a placeholder field 'data' and keep it empty for the first pass, > > and > > > then run update/rich to index the actual full-text, but using the same > > > unique doc id. But this would actually overwrite the doc in the index, > > won't > > > it? And, there really isn't a 'merge' operation, right? > > > > > > There might be a better way to use this full-text indexing option, > > > schema-wise, say: > > > > > > - have a new option richData that will take in a source field name, > > > - validate it's value (valid filename/file), > > > - recognize the file type, > > > - and put the 'data' into another field > &g
Re: Indexing HTML and other doc types
A coworker of mine posted the code that we used for adding pdf, doc, xls, etc documents into solr. You can find the files at the following location. https://issues.apache.org/jira/browse/SOLR-284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel Just apply the patch and put the lib files in the lib directory, run `ant compile`, yada yada and you should be good to go. If the build fails update to revision 552853, that is the latest revision I have compiled with the patch so I know it works. Usually if the build fails it is something unrelated to Eric's code and will be fixed in a new few revisions. . Peter Manis On 7/3/07, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote: Solr looks very good for indexing and searching strcutured data. But I noticed there is no tool in the Solr distribution with which documents of other doc types can be indexed. Are there other side projects that develop Solr clients for indexing documents of other doc types? Or is the generic full-text search really a wrong area to apply Solr, and should I be using something like Nutch? -kuro
Re: Indexing HTML and other doc types
I guess I misread your original question. I believe Nutch would be the choice for crawling, however I do not know about its abilities for indexing other document types. If you needed to index multiple document types such as PDF, DOC, etc and Nutch does not provide functionality to do so you would probably need to write a script or program that can feed the crawl results to solr. I believe there is a python script somewhere that is simple and will crawl sites, it would of course need modification but it would provide a starting point. I have not worked with Nutch so I may be speaking incorrectly, but by having a separate script/application handle the crawl you may have more control over what it sent to solr to be index. Nutch may already include a lot of functionality to process incoming content. . Pete On 7/5/07, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote: Thank you, Otis and Peter, for your replies. > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > doc of some type -> parse content into various fields -> post to Solr I understand this part, but the question is who should do this. I was under assumption that it's Solr client's job to crawl the net, read documents, parse them, put the contents into different fields (the "contents", title, author, date, URL, etc.), then post the result to Solr via HTTP in HTML or CSV. And I was asking if there are open-source projects to build such clients. Peter's approach is different; he adds the intelligence of parsing of document to Solr itself. (I guess the crawling has to be done by clients.) I wonder this fits in the model that Solr has. Or is it just my illusion ? -kuro
Re: most popular/most commonly accessed records
Maybe create a snippet of code in the page of the video information that if the page was accessed from search results it will increment a counter within a database (sqlite, mysql, etc). You can then update solr every so often (daily, hourly, twice a day, etc) and include the hits. This would then allow you to run a query on solr similar to "q=video&sort=hits desc&rows=10" that would return the top 10 results. Depending on the amount of videos and the data you are indexing I think it would be a quick update. If it is a lot of information you could even setup a secondary solr instance that would only contain information to build such queries. For example if you were only returning the hits and the name+id (to build link) the schema would be simple and updates should be quick. Maybe even quick enough to update as the video is accessed. Someone more familiar with solr would know how intense that would be on resources better than I would. The first choice would probably give you more options since you can include category breakdowns or other breakdowns within your query. . Pete On 7/6/07, Walter Underwood <[EMAIL PROTECTED]> wrote: Solr doesn't have a record of what documents were accessed. The document cache shows which documents were in the parts of search result list which were served, but probably not a count of those inclusions. Luckily, this information is trivial to get from HTTP server access logs. Look for documents with a referrer that is the search page. Odd, I'm grepping our logs for that sort of data today. wunder On 7/6/07 6:59 AM, "Karen Loughran" <[EMAIL PROTECTED]> wrote: > > Hi all, > Is there a way through solr to find out about "most commonly accessed" solr > documents ? So for example, my client may wish to list the top 10 most > popular videos, based on previous accesses to them in the solr server db. > > If there are any solr features to help with this can someone point me to > them ? Had a browse through the user documentation, but can't see anything > obvious ? > > Many thanks > Karen Loughran