Images for the DataImportHandler page
There is some very useful information on the http://wiki.apache.org/solr/DataImportHandler page about indexing database contents, but the page contains three images whose links are broken. The descriptions of those images sound like it would be quite handy to see them in the page. Could someone please fix the links so the images are displayed? Thanks, Mike
Identifying common text in documents
I am looking for a way to identify blocks of text that occur in several documents in a corpus for a research project with electronic medical records. They can be copied and pasted sections inserted into another document, text from a previous email in the corpus that is repeated in a follow-up email, text templates that get inserted into groups of documents, and occurrences of the same template more than once in the same document. Any of these duplicated text blocks may contain minor differences from one instance to another. I read in a document called "What's new in Solr 1.4" that there has been support since 1.4 came out for duplicate text detection using the SignatureUpdateProcessor and TextProfileSignature classes. Can these be used to detect portions of documents that are alike or nearly alike, or are they intended to detect entire documents that are alike or nearly alike? Has additional support for duplicate detection been added to Solr since 1.4? It seems like some of the features of Solr and Lucene such as term positions and shingling could help in finding sections of matching or nearly matching text in documents. Does anyone have any experience in this area that they would be willing to share? Thanks, Mike
Getting started with indexing a database
I am trying to index the contents of a database for the first time, and I am only getting the primary key of the table represented by the top level entity in my data-config.xml file to be indexed. The database I am starting with has three tables: The table called docs has columns called doc_id, type and last_modified. The primary key is doc_id. The table called codes has columns called id, doc_id, origin, type, code and last_modified. The primary key is id. doc_id is a foreign key to the doc_id column in the docs table. The table called texts has columns called id, doc_id, origin, type, text and last_modified. The primary key is id. doc_id is a foreign key to the doc_id column in the docs table. My data-config.xml file looks like this: I added these lines to the schema.xml file: ... DOC_ID NOTE_TEXT When I run the full-import operation, only the DOC_ID values are written to the index. When I run a program that dumps the index contents as an xml string, the output looks like this: ... Since this is new to me, I am sure that I have simply left something out or specified something the wrong way, but I haven't been able to spot what I have been doing wrong when I have gone over the configuration files that I am using. Can anyone help me figure out why the other database contents are not being indexed? Thanks, Mike
Setting up logging for a Solr project that isn't in tomcat/webapps/solr
I set up a Solr project to run with Tomcat for indexing contents of a database by following a web tutorial that described how to put the project directory anywhere you want and then put a file called .xml in the tomcat/conf/Catalina/localhost directory that contains contents like this: I got this working, and now I would like to create a logging.properties file for Solr only, as described in the Apache Solr Reference Guide distributed by Lucid. It says: To change logging settings for Solr only, edit tomcat/webapps/solr/WEB-INF/classes/logging.properties. You will need to create the classes directory and the logging.properties file. You can set levels from FINEST to SEVERE for a class or an entire package. Here are a couple of examples: org.apache.commons.digester.Digester.level = FINEST org.apache.solr.level = WARNING I think this explanation assumes that the Solr project is in tomcat/webapps/solr. I tried putting a logging.properties file in various locations where I hoped Tomcat would pick it up, but none of them worked. If I have a solr_db.xml file in tomcat/conf/Catalina/localhost that points to a Solr project in C:/projects/solr_apps/solr_db (that was created by copying the contents of the apache-solr-3.5.0/example/solr directory to C:/projects/solr_apps/solr_db and going from there), where is the right place to put a "Solr only" logging.properties file? Thanks, Mike
Recovering from database connection resets in DataimportHandler
I am trying to use Solr's DataImportHandler to index a large number of database records in a SQL Server database that is owned and managed by a group we are collaborating with. The indexing jobs I have run so far, except for the initial very small test runs, have failed due to database connection resets. I have gotten indexing jobs to go further by using CachedSqlEntityProcessor and specifying responseBuffering=adaptive in the connection url, but I think in order to index that data I'm going to have to work out how to catch database connection reset exceptions and resubmit the queries that failed. Can anyone can suggest a good way to approach this? Or have any of you encountered this problem and worked out a solution to it already? Thanks, Mike
Using nested entities in FileDataSource import of xml file contents
Can anybody help me understand the right way to define a data-config.xml file with nested entities for indexing the contents of an XML file? I used this data-config.xml file to index a database containing sample patient records: I would like to do the same thing with an XML file containing the same data as is in the database. That XML file looks like this: 786.2 786.2 786.2 786.2 Seventeen year old with cough. Normal. I tried using this data-config.xml file, in order to preserve the nested entity structure used with the database case: This is wrong, and it fails to index any of the and blocks in the XML file. I'm sure that part of the problem must be that the xpath expressions such as "/docs/doc[@id='${doc.doc_id}']/texts/text/@origin" fail to match anything in the XML file, because when I try the same import without nested entities, using this data-config.xml file, the and blocks are also not indexed: However, when I use this data-config.xml file, which doesn't use nested entities, all of the fields are included in the index: but I don't think any correspondence is maintained between the code_origin, code_type and code_value field values and the note_origin, note_type and note_text field values that are grouped together in the input XML file. It has taken me a while to get this far, and obviously I don't have it right yet. Can anybody help me define a data-config.xml file with nested entities for indexing an XML file? Thanks, Mike
Do nested entities have a representation in Solr indexes?
The data-config.xml file that I have for indexing database contents has nested entity nodes within a document node, and each of the entities contains field nodes. Lucene indexes consist of documents that contain fields. What about entities? If you change the way entities are structured in a data-config.xml file, in what way (if any) does it change how the contents are stored in the index. When I created the entities I am using, and defined the fields in one of the inner entities to be multivalued, I thought that the fields of that entity type would be grouped logically somehow in the index, but then I remembered that Lucene doesn't have a concept of sub-documents (that I know of), so each of the field values will be added to a list, and the extent of the logical grouping would be that the field values that were indexed together would be at the same position in their respective lists. Am I understanding this right, or do entities as defined in data-config.xml have some kind of representation in the index like document and field do? Thanks, Mike
RE: Recovering from database connection resets in DataimportHandler
Could you point me to the most non-intimidating introduction to SolrJ that you know of? I have a passing familiarity with Javascript and, with few exceptions, I haven't developing software that has a graphical user interface of any kind in about 25 years. I like the idea of having finer control over data imported from a database though. Thanks, Mike -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Monday, February 13, 2012 6:19 AM To: solr-user@lucene.apache.org Subject: Re: Recovering from database connection resets in DataimportHandler I'd seriously consider using SolrJ and your favorite JDBC driver instead. It's actually quite easy to create one, although as always it may be a bit intimidating to get started. This allows you much finer control over error conditions than DIH does, so may be more suited to your needs. Best Erick On Sat, Feb 11, 2012 at 2:40 AM, Mike O'Leary wrote: > I am trying to use Solr's DataImportHandler to index a large number of > database records in a SQL Server database that is owned and managed by a > group we are collaborating with. The indexing jobs I have run so far, except > for the initial very small test runs, have failed due to database connection > resets. I have gotten indexing jobs to go further by using > CachedSqlEntityProcessor and specifying responseBuffering=adaptive in the > connection url, but I think in order to index that data I'm going to have to > work out how to catch database connection reset exceptions and resubmit the > queries that failed. Can anyone can suggest a good way to approach this? Or > have any of you encountered this problem and worked out a solution to it > already? > Thanks, > Mike
Is there a way to write a DataImportHandler deltaQuery that compares contents still to be imported to contents in the index?
I am working on indexing the contents of a database that I don't have permission to alter. In particular, the DataImportHandler examples that show how to specify a deltaQuery attribute value show database tables that have a last_modified column, and it compares these values with last_index_time values stored in the dataimport.properties file. The tables in the database I am working with don't have anything like a last_modified column. An indexing job I was running yesterday failed, and I would like to restart it so that it only imports the data that it hasn't already indexed. As a one-off, I could create a list of the keys of the database records that have been indexed and hack in something that reads that list as part of how it figures out what to index, but I was wondering if there is something built in that would allow me to do the same kind of comparison in a likely far more elegant way. What kinds of information do the deltaQuery attributes have access to, apart from the database tables, columns, etc., and do they have access to any information that would help me with what I want to do? Thanks, Mike P.S. While we're on the subject of delta... attributes, can someone explain to me what the difference is between the deltaQuery and the deltaImportQuery attributes?
RE: Indexing taking so much time to complete.
What's your secret? OK, that question is not the kind recommended in the UsingMailingLists suggestions, so I will write again soon with a description of my data and what I am trying to do, and ask more specific questions. And I don't mean to hijack the thread, but I am in the same boat as the poster. I just started working with Solr less than two months ago, and after beginning with a completely naïve approach to indexing database contents with DataImportHandler and then making small adjustments to improve performance as I learned about them, I have gotten some smaller datasets to import in a reasonable amount of time, but the 60GB data set that I will need to index for the project I am working on would take over three days to import using the configuration that I have now. Obviously you're doing something different than I am... What things would you say have made the biggest improvement in indexing performance with the 32GB data set that you mentioned? How long do you think it would take to index that same data set if you used Solr more or less out of the box with no attempts to improve its performance? Thanks, Mike -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Saturday, February 25, 2012 2:51 PM To: solr-user@lucene.apache.org Subject: Re: Indexing taking so much time to complete. You have to tell us a lot more about what you're trying to do. I can import 32G in about 20 minutes, so obviously you're doing something different than I am... Perhaps you might review: http://wiki.apache.org/solr/UsingMailingLists Best Erick On Sat, Feb 25, 2012 at 12:00 AM, Suneel wrote: > Hi All, > > I am using Apache solr 3.1 and trying to caching 50 gb records but it > is taking more then 20 hours this is very painful to update records. > > 1. Is there any way to reduce caching time or this time is ok for 50 > gb records ?. > > 2. What is the delta-import, this will be helpful for me cache only > updated record not rather then caching all records ?. > > > > Please help me in above mentioned question. > > > Thanks & Regards, > > - > Suneel Pandey > Sr. Software Developer > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-com > plete-tp3774464p3774464.html Sent from the Solr - User mailing list > archive at Nabble.com.
Including an attribute value from a higher level entity when using DIH to index an XML file
I have an XML file that I would like to index, that has a structure similar to this: [message text] ... ... I would like to have the documents in the index correspond to the messages in the xml file, and have the user's [id-num] value stored as a field in each of the user's documents. I think this means that I have to define an entity for message that looks like this: but I don't know where to put the field definition for the user id. It would look like I can't put it within the message entity, because it is defined with forEach="/data/user/message/" and the id field's xpath value is outside of the entity's scope. Putting the id field definition there causes a null pointer exception. I don't think I want to create a "user" entity that the "message" entity is nested inside of, or is there a way to do that and still have the index documents correspond to messages from the file? Are there one or more attributes or values of attribute that I haven't run across in my searching that provide a way to do what I need to do? Thanks, Mike
RE: Including an attribute value from a higher level entity when using DIH to index an XML file
I found an answer to my question, but it comes with a cost. With an XML file like this (this is simplified to remove extraneous elements and attributes): [message text] ... ... I can index the user id as a field in documents that represent each of the user's messages with this data-config expression: I didn't realize that commonField would work for cases in which the previously encountered field is in an element that encompasses the other elements, but it does. The forEach value has to be "/data/user/message | /data/user" in order for the user id to be located, since it is not under /data/user/message. By specifying forEach="/data/user/message | /data/user" I am saying that each /data/user or /data/user/message element is a document in the index, but I don't really want /data/user elements to be treated this way. As luck would have it, those documents are filtered out, only because date and text are required fields, and they have not been assigned values yet when a document is created for a /data/user element, so an exception is thrown. I could live with this, but it's kind of ugly. I don't see any other way of doing what I need to do with embedded XML elements though. I tried creating nested entities in the data-config file, but each one of them is required to have a url attribute, and I think that caused the input file to be read twice. The only other possibility I could see from reading the DataImportHandler documentation was to specify an XSL file and change the XML file's structure so that the user id attribute is moved down to be an attribute of the message element. I'm not sure it's worth it to do something like that for what seems like a small problem, and I wonder how much it would slow down the importing of a large XML file. Are there any other ways of handling cases like this, where an attribute of an outer element is to be included in an index document that corresponds to an element nested inside it? Thanks, Mike -Original Message- From: Mike O'Leary [mailto:tmole...@uw.edu] Sent: Friday, March 02, 2012 3:30 PM To: Solr-User (solr-user@lucene.apache.org) Subject: Including an attribute value from a higher level entity when using DIH to index an XML file I have an XML file that I would like to index, that has a structure similar to this: [message text] ... ... I would like to have the documents in the index correspond to the messages in the xml file, and have the user's [id-num] value stored as a field in each of the user's documents. I think this means that I have to define an entity for message that looks like this: but I don't know where to put the field definition for the user id. It would look like I can't put it within the message entity, because it is defined with forEach="/data/user/message/" and the id field's xpath value is outside of the entity's scope. Putting the id field definition there causes a null pointer exception. I don't think I want to create a "user" entity that the "message" entity is nested inside of, or is there a way to do that and still have the index documents correspond to messages from the file? Are there one or more attributes or values of attribute that I haven't run across in my searching that provide a way to do what I need to do? Thanks, Mike
SolrJ updating indexed documents?
I am working on a component for indexing documents from a database that contains medical records. The information is organized across several tables and I am supposed to index records for varying sizes of sets of patients for others to do IR experiments with. Each patient record has one or more main documents associated with it, and each main document has zero or more addenda associated with it. (The main documents and addenda are treated alike for the most part, except for a parent record field that is null for main documents and has the number of a main document for addenda. Addenda cannot have addenda.) Also, each main document has one or more diagnosis records. I am trying to figure out the best performing way to select all of the records for each patient, including the main documents, addenda and diagnoses. I tried indexing sets of these records using DataImportHandler and nested Entity blocks in a way similar to the Full Import example on the http://wiki.apache.org/solr/DataImportHandler page, with a select for all patients and main records in a data set, and nested selects that get all of the addenda and all of the diagnoses for each patient, but it didn't run very fast and a database resource person who looked into it with me said that issuing a million SQL queries for addenda and a million queries for diagnoses, one each for the million patient documents in a typical set of 10,000 patients, was very inefficient, and I should look for a different way of getting the data. I switched to using SolrJ, and I am trying to figure out which of two ways to use to index this data. One would be to use one large SQL statement to get all of the data for a patient set. The results would contain duplication due to the way tables are joined together that I would need to sort out in the Java code, but that is doable. The other way would be to 1. Get all of the main document data with one SQL query, create index documents with the data that they contain and store them in the index, 2. Issue another SQL query that gets all of the addenda for all of the patients in the data set and an id number for each one that tells which main document an addendum belongs with, retrieve the main documents from the index, add the addenda fields to the document and put them back in the index 3. Do the same with diagnosis data. It would be great to be able to keep the main document data that is retrieved from the database in a hash table, update each of those objects with addenda and diagnoses, and write completely filled out documents to the index once, but I don't have enough memory available to do this for the patient sets I am working with now, and they want this indexing process to scale up to patient sets that are ten times as large and eventually much larger than that. Essentially for the second approach I am wondering if a Lucene index can be made to serve as a hash table for storing intermediate results, and whether SolrJ has an API for retrieving individual index documents so they can be updated. Basically it would be shifting from iterating over SQL queries to iterating over Lucene index updates. If this way of doing things is also likely to be slow, or the SolrJ API doesn't provide a way to do this, or there are other problems with it, I can go with selecting all of the data in one large query and dealing with the duplication. Thanks, Mike
waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)
If you index a set of documents with SolrJ and use StreamingUpdateSolrServer.add(Collection docs, int commitWithinMs), it will perform a commit within the time specified, and it seems to use default values for waitFlush and waitSearcher. Is there a place where you can specify different values for waitFlush and waitSearcher, or if you want to use different values do you have to call StreamingUpdateSolrServer.add(Collection docs) and then call StreamingUpdateSolrServer.commit(waitFlush, waitSearcher) explicitly? Thanks, Mike
RE: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)
I am indexing some database contents using add(docs, commitWithinMs), and those add calls are taking over 80% of the time once the database begins returning results. I was wondering if setting waitSearcher to false would speed this up. Many of the calls take 1 to 6 seconds, with one outlier that took over 11 minutes. Thanks, Mike -Original Message- From: Mark Miller [mailto:markrmil...@gmail.com] Sent: Wednesday, April 04, 2012 4:15 PM To: solr-user@lucene.apache.org Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs) On Apr 4, 2012, at 6:50 PM, Mike O'Leary wrote: > If you index a set of documents with SolrJ and use > StreamingUpdateSolrServer.add(Collection docs, int > commitWithinMs), it will perform a commit within the time specified, and it > seems to use default values for waitFlush and waitSearcher. > > Is there a place where you can specify different values for waitFlush > and waitSearcher, or if you want to use different values do you have > to call StreamingUpdateSolrServer.add(Collection > docs) and then call StreamingUpdateSolrServer.commit(waitFlush, waitSearcher) > explicitly? > Thanks, > Mike waitFlush actually does nothing in recent versions of Solr. waitSearcher doesn't seem so important when the commit is not done explicitly by the user or a client. - Mark Miller lucidimagination.com
RE: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)
First of all, what I was seeing was different from what I thought I was seeing because a few weeks ago I uncommented the block in the solrconfig.xml file and I didn't realize it until yesterday just before I went home, so that was controlling the commits more than the add and commit calls that I was making. When I commented that block out again, the times for index with add(docs, commitWithinMs) and with add(docs) and commit(false, false) were very similar. Both of them were about 20 minutes faster (38 minutes instead of about an hour) than indexing with set to commit after every 1,000 documents or fifteen minutes. Is this the blog post you are talking about: http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/? It seems to be about the right topic. I am using Solr 3.5. The feature matrix on one of the Lucid Imagination web pages says that DocumentWriterPerThread is available in Solr 4.0 and LucidWorks 2.0. I assume that means LucidWorks Enterprise. Is that right? Thanks, Mike -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Thursday, April 05, 2012 2:45 PM To: solr-user@lucene.apache.org Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs) Solr version? I suspect your outlier is due to merging segments, if so this should have happened quite some time into the run. See Simon Wilnauer's blog post on DocumenWriterPerThread (trunk) code. What commitWithin time are you using? Best Erick On Wed, Apr 4, 2012 at 7:50 PM, Mike O'Leary wrote: > I am indexing some database contents using add(docs, commitWithinMs), and > those add calls are taking over 80% of the time once the database begins > returning results. I was wondering if setting waitSearcher to false would > speed this up. Many of the calls take 1 to 6 seconds, with one outlier that > took over 11 minutes. > Thanks, > Mike > > -Original Message- > From: Mark Miller [mailto:markrmil...@gmail.com] > Sent: Wednesday, April 04, 2012 4:15 PM > To: solr-user@lucene.apache.org > Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, > commitWithinMs) > > > On Apr 4, 2012, at 6:50 PM, Mike O'Leary wrote: > >> If you index a set of documents with SolrJ and use >> StreamingUpdateSolrServer.add(Collection docs, int >> commitWithinMs), it will perform a commit within the time specified, and it >> seems to use default values for waitFlush and waitSearcher. >> >> Is there a place where you can specify different values for waitFlush >> and waitSearcher, or if you want to use different values do you have >> to call StreamingUpdateSolrServer.add(Collection >> docs) and then call StreamingUpdateSolrServer.commit(waitFlush, >> waitSearcher) explicitly? >> Thanks, >> Mike > > > waitFlush actually does nothing in recent versions of Solr. waitSearcher > doesn't seem so important when the commit is not done explicitly by the user > or a client. > > - Mark Miller > lucidimagination.com > > > > > > > > > > >
DocumentWriterPerThread sample code?
Does anyone know of sample code that illustrates how to use the DocumentWriterPerThread class in indexing? Thanks, Mike
Writing index files that have the right owner
I have been putting together an application using Quartz to run several indexing jobs in sequence using SolrJ and Tomcat on Windows. I would like the Quartz job to do the following: 1. Delete index directories from the cores so each indexing job starts fresh with empty indexes to populate. 2. Start the Tomcat server. 3. Run the indexing job. 4. Stop the Tomcat server. 5. Copy the index directories to an archive. Steps 2-5 work fine, but I haven't been able to find a way to delete the index directories from within Java. I also can't delete them from a Windows command shell window: I get an error message that says "Access is denied". The reason for this is that the index directories and files have the owner "BUILTIN\Administrators". Although I am an administrator on this machine, the fact that these files have a different owner means that I can only delete them in a Windows command shell window if I start it with "Run as administrator". I spent a bunch of time today trying every Java function and Windows shell command I could find that would let me change the owner of these files, grant my user account the capability to delete the files, etc. Nothing I tried worked, likely because along with not having permission to delete the files, I also don't have permission to give myself permission to delete the files. At a certain point I stopped wondering how to change the files owner or permissions and started wondering why the files have "BUILTIN\Administrators" as owner, and the permissions associated with that owner, in the first place. Is there somewhere in the Solr or Tomcat configuration files, or in the SolrJ code, where I can set who the owner of files written to the index directories should be? Thanks, Mike
Problem with Solr not finding a class that is in lucene-analyzers.jar
I have been running Solr with Tomcat, and I recently wrote a Quartz program that starts and stops Tomcat, starts Solr indexing jobs, and does a few other things. When I start Tomcat programmatically in this way, Solr starts initializing, and when it hits the text_ws field type in schema.xml, it throws an exception saying that it can't find the SynonymFilter class. text_ws refers to solr.SynonymFilterFactory, which must need to find lucene.SynonymFilter, and I am guessing it is the first Lucene class encountered while initializing the schema that isn't in lucene-core.jar. I thought it would be easy to fix this by looking through the solr config files for the location where it specifies where it looks for Lucene jar files, and check to make sure that both the lucene-core and lucene-analyzers jar files are there. I see where there is a line in the solrconfig.xml file that says LUCENE_36, but not one that says to look for the Lucene jar files in a particular directory. Are the Lucene jar files packaged in the solr.war file? I also looked for directories that contain Lucene jar files within my Solr project, which is called tiudocumentsearch, and the one I found was in tomcat/work/Catalina/localhost/tiudocumentsearch/WEB-INF/lib, but the lucene-core and lucene-analyzers jar files are both there. So the two things I am asking for help in figuring out are how to indicate to Solr where the lucene-analyzers.jar file is, so it can find the SynonymFilters class during initialization, and why this exception isn't thrown when I start Tomcat for this Solr project in a command prompt window, but it occurs when I start Tomcat from a Java application. I am using Solr and Lucene 3.6. Thank you for any help or suggestions you can provide, Mike