Images for the DataImportHandler page

2011-12-09 Thread Mike O'Leary
There is some very useful information on the 
http://wiki.apache.org/solr/DataImportHandler page about indexing database 
contents, but the page contains three images whose links are broken. The 
descriptions of those images sound like it would be quite handy to see them in 
the page. Could someone please fix the links so the images are displayed?
Thanks,
Mike


Identifying common text in documents

2011-12-24 Thread Mike O'Leary
I am looking for a way to identify blocks of text that occur in several 
documents in a corpus for a research project with electronic medical records. 
They can be copied and pasted sections inserted into another document, text 
from a previous email in the corpus that is repeated in a follow-up email, text 
templates that get inserted into groups of documents, and occurrences of the 
same template more than once in the same document. Any of these duplicated text 
blocks may contain minor differences from one instance to another.

I read in a document called "What's new in Solr 1.4" that there has been 
support since 1.4 came out for duplicate text detection using the 
SignatureUpdateProcessor and TextProfileSignature classes. Can these be used to 
detect portions of documents that are alike or nearly alike, or are they 
intended to detect entire documents that are alike or nearly alike? Has 
additional support for duplicate detection been added to Solr since 1.4? It 
seems like some of the features of Solr and Lucene such as term positions and 
shingling could help in finding sections of matching or nearly matching text in 
documents. Does anyone have any experience in this area that they would be 
willing to share?
Thanks,
Mike


Getting started with indexing a database

2012-01-09 Thread Mike O'Leary
I am trying to index the contents of a database for the first time, and I am 
only getting the primary key of the table represented by the top level entity 
in my data-config.xml file to be indexed. The database I am starting with has 
three tables:

The table called docs has columns called doc_id, type and last_modified. The 
primary key is doc_id.
The table called codes has columns called id, doc_id, origin, type, code and 
last_modified. The primary key is id. doc_id is a foreign key to the doc_id 
column in the docs table.
The table called texts has columns called id, doc_id, origin, type, text and 
last_modified. The primary key is id. doc_id is a foreign key to the doc_id 
column in the docs table.

My data-config.xml file looks like this:


  
  

  
  
  





  
  





  

  


I added these lines to the schema.xml file:














...

DOC_ID
NOTE_TEXT

When I run the full-import operation, only the DOC_ID values are written to the 
index. When I run a program that dumps the index contents as an xml string, the 
output looks like this:



  


  
  


  
...


Since this is new to me, I am sure that I have simply left something out or 
specified something the wrong way, but I haven't been able to spot what I have 
been doing wrong when I have gone over the configuration files that I am using. 
Can anyone help me figure out why the other database contents are not being 
indexed?
Thanks,
Mike



Setting up logging for a Solr project that isn't in tomcat/webapps/solr

2012-02-10 Thread Mike O'Leary
I set up a Solr project to run with Tomcat for indexing contents of a database 
by following a web tutorial that described how to put the project directory 
anywhere you want and then put a file called .xml in the 
tomcat/conf/Catalina/localhost directory that contains contents like this:



  


I got this working, and now I would like to create a logging.properties file 
for Solr only, as described in the Apache Solr Reference Guide distributed by 
Lucid. It says:

To change logging settings for Solr only, edit 
tomcat/webapps/solr/WEB-INF/classes/logging.properties. You will need to create 
the classes directory and the logging.properties file. You can set levels from 
FINEST to SEVERE for a class or an entire package. Here are a couple of 
examples:
org.apache.commons.digester.Digester.level = FINEST
org.apache.solr.level = WARNING

I think this explanation assumes that the Solr project is in 
tomcat/webapps/solr. I tried putting a logging.properties file in various 
locations where I hoped Tomcat would pick it up, but none of them worked. If I 
have a solr_db.xml file in tomcat/conf/Catalina/localhost that points to a Solr 
project in C:/projects/solr_apps/solr_db (that was created by copying the 
contents of the apache-solr-3.5.0/example/solr directory to 
C:/projects/solr_apps/solr_db and going from there), where is the right place 
to put a "Solr only" logging.properties file?
Thanks,
Mike


Recovering from database connection resets in DataimportHandler

2012-02-10 Thread Mike O'Leary
I am trying to use Solr's DataImportHandler to index a large number of database 
records in a SQL Server database that is owned and managed by a group we are 
collaborating with. The indexing jobs I have run so far, except for the initial 
very small test runs, have failed due to database connection resets. I have 
gotten indexing jobs to go further by using CachedSqlEntityProcessor and 
specifying responseBuffering=adaptive in the connection url, but I think in 
order to index that data I'm going to have to work out how to catch database 
connection reset exceptions and resubmit the queries that failed. Can anyone 
can suggest a good way to approach this? Or have any of you encountered this 
problem and worked out a solution to it already?
Thanks,
Mike


Using nested entities in FileDataSource import of xml file contents

2012-02-17 Thread Mike O'Leary
Can anybody help me understand the right way to define a data-config.xml file 
with nested entities for indexing the contents of an XML file?

I used this data-config.xml file to index a database containing sample patient 
records:


  
  

  
  
  



  
  



  

  


I would like to do the same thing with an XML file containing the same data as 
is in the database. That XML file looks like this:


  

  786.2
  786.2
  786.2
  786.2


  Seventeen year old 
with cough.
  Normal.

  
  


I tried using this data-config.xml file, in order to preserve the nested entity 
structure used with the database case:


  
  

  
  
  



  
  
   
   
   
  

  


This is wrong, and it fails to index any of the  and  blocks in 
the XML file. I'm sure that part of the problem  must be that the xpath 
expressions such as "/docs/doc[@id='${doc.doc_id}']/texts/text/@origin" fail to 
match anything in the XML file, because when I try the same import without 
nested entities, using this data-config.xml file, the  and  
blocks are also not indexed:


  
  

  
  
  
  
  
  
  
  

  


However, when I use this data-config.xml file, which doesn't use nested 
entities, all of the fields are included in the index:


  
  

  
  
  
  
  
  
  
  

  


but I don't think any correspondence is maintained between the code_origin, 
code_type and code_value field values and the note_origin, note_type and 
note_text field values that are grouped together in the input XML file.

It has taken me a while to get this far, and obviously I don't have it right 
yet. Can anybody help me define a data-config.xml file with nested entities for 
indexing an XML file?
Thanks,
Mike


Do nested entities have a representation in Solr indexes?

2012-02-22 Thread Mike O'Leary
The data-config.xml file that I have for indexing database contents has nested 
entity nodes within a document node, and each of the entities contains field 
nodes. Lucene indexes consist of documents that contain fields. What about 
entities? If you change the way entities are structured in a data-config.xml 
file, in what way (if any) does it change how the contents are stored in the 
index. When I created the entities I am using, and defined the fields in one of 
the inner entities to be multivalued, I thought that the fields of that entity 
type would be grouped logically somehow in the index, but then I remembered 
that Lucene doesn't have a concept of sub-documents (that I know of), so each 
of the field values will be added to a list, and the extent of the logical 
grouping would be that the field values that were indexed together would be at 
the same position in their respective lists. Am I understanding this right, or 
do entities as defined in data-config.xml have some kind of representation in 
the index like document and field do?
Thanks,
Mike


RE: Recovering from database connection resets in DataimportHandler

2012-02-22 Thread Mike O'Leary
Could you point me to the most non-intimidating introduction to SolrJ that you 
know of? I have a passing familiarity with Javascript and, with few exceptions, 
I haven't developing software that has a graphical user interface of any kind 
in about 25 years. I like the idea of having finer control over data imported 
from a database though.
Thanks,
Mike

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Monday, February 13, 2012 6:19 AM
To: solr-user@lucene.apache.org
Subject: Re: Recovering from database connection resets in DataimportHandler

I'd seriously consider using SolrJ and your favorite JDBC driver instead. It's 
actually quite easy to create one, although as always it may be a bit 
intimidating to get started. This allows you much finer control over error  
conditions than DIH does, so may be more suited to your needs.

Best
Erick

On Sat, Feb 11, 2012 at 2:40 AM, Mike O'Leary  wrote:
> I am trying to use Solr's DataImportHandler to index a large number of 
> database records in a SQL Server database that is owned and managed by a 
> group we are collaborating with. The indexing jobs I have run so far, except 
> for the initial very small test runs, have failed due to database connection 
> resets. I have gotten indexing jobs to go further by using 
> CachedSqlEntityProcessor and specifying responseBuffering=adaptive in the 
> connection url, but I think in order to index that data I'm going to have to 
> work out how to catch database connection reset exceptions and resubmit the 
> queries that failed. Can anyone can suggest a good way to approach this? Or 
> have any of you encountered this problem and worked out a solution to it 
> already?
> Thanks,
> Mike


Is there a way to write a DataImportHandler deltaQuery that compares contents still to be imported to contents in the index?

2012-02-22 Thread Mike O'Leary
I am working on indexing the contents of a database that I don't have 
permission to alter. In particular, the DataImportHandler examples that show 
how to specify a deltaQuery attribute value show database tables that have a 
last_modified column, and it compares these values with last_index_time values 
stored in the dataimport.properties file. The tables in the database I am 
working with don't have anything like a last_modified column. An indexing job I 
was running yesterday failed, and I would like to restart it so that it only 
imports the data that it hasn't already indexed. As a one-off, I could create a 
list of the keys of the database records that have been indexed and hack in 
something that reads that list as part of how it figures out what to index, but 
I was wondering if there is something built in that would allow me to do the 
same kind of comparison in a likely far more elegant way. What kinds of 
information do the deltaQuery attributes have access to, apart from the 
database tables, columns, etc., and do they have access to any information that 
would help me with what I want to do?
Thanks,
Mike

P.S. While we're on the subject of delta... attributes, can someone explain to 
me what the difference is between the deltaQuery and the deltaImportQuery 
attributes?


RE: Indexing taking so much time to complete.

2012-02-25 Thread Mike O'Leary
What's your secret?

OK, that question is not the kind recommended in the UsingMailingLists 
suggestions, so I will write again soon with a description of my data and what 
I am trying to do, and ask more specific questions. And I don't mean to hijack 
the thread, but I am in the same boat as the poster.

I just started working with Solr less than two months ago, and after beginning 
with a completely naïve approach to indexing database contents with 
DataImportHandler and then making small adjustments to improve performance as I 
learned about them, I have gotten some smaller datasets to import in a 
reasonable amount of time, but the 60GB data set that I will need to index for 
the project I am working on would take over three days to import using the 
configuration that I have now. Obviously you're doing something different than 
I am...

What things would you say have made the biggest improvement in indexing 
performance with the 32GB data set that you mentioned? How long do you think it 
would take to index that same data set if you used Solr more or less out of the 
box with no attempts to improve its performance?
Thanks,
Mike

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Saturday, February 25, 2012 2:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing taking so much time to complete.

You have to tell us a lot more about what you're trying to do. I can import 32G 
in about 20 minutes, so obviously you're doing something different than I am...

Perhaps you might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Sat, Feb 25, 2012 at 12:00 AM, Suneel  wrote:
> Hi All,
>
> I am using Apache solr 3.1 and trying to caching 50 gb records but it 
> is taking more then 20 hours this is very painful to update records.
>
> 1. Is there any way to reduce caching time or this time is ok for 50 
> gb records ?.
>
> 2. What is the delta-import, this will be helpful for me cache only 
> updated record not rather then caching all records ?.
>
>
>
> Please help me in above mentioned question.
>
>
> Thanks & Regards,
>
> -
> Suneel Pandey
> Sr. Software Developer
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-com
> plete-tp3774464p3774464.html Sent from the Solr - User mailing list 
> archive at Nabble.com.


Including an attribute value from a higher level entity when using DIH to index an XML file

2012-03-02 Thread Mike O'Leary
I have an XML file that I would like to index, that has a structure similar to 
this:


  
[message text]
...
  
  ...


I would like to have the documents in the index correspond to the messages in 
the xml file, and have the user's [id-num] value stored as a field in each of 
the user's documents. I think this means that I have to define an entity for 
message that looks like this:


  
  

  
  
   
  


but I don't know where to put the field definition for the user id. It would 
look like



I can't put it within the message entity, because it is defined with 
forEach="/data/user/message/" and the id field's xpath value is outside of the 
entity's scope. Putting the id field definition there causes a null pointer 
exception. I don't think I want to create a "user" entity that the "message" 
entity is nested inside of, or is there a way to do that and still have the 
index documents correspond to messages from the file? Are there one or more 
attributes or values of attribute that I haven't run across in my searching 
that provide a way to do what I need to do?
Thanks,
Mike




RE: Including an attribute value from a higher level entity when using DIH to index an XML file

2012-03-12 Thread Mike O'Leary
I found an answer to my question, but it comes with a cost. With an XML file 
like this (this is simplified to remove extraneous elements and attributes):


  
[message text]
...
  
  ...


I can index the user id as a field in documents that represent each of the 
user's messages with this data-config expression:


  
  

  
  
  
   
  


I didn't realize that commonField would work for cases in which the previously 
encountered field is in an element that encompasses the other elements, but it 
does. The forEach value has to be "/data/user/message | /data/user" in order 
for the user id to be located, since it is not under /data/user/message.

By specifying forEach="/data/user/message | /data/user" I am saying that each 
/data/user or /data/user/message element is a document in the index, but I 
don't really want /data/user elements to be treated this way. As luck would 
have it, those documents are filtered out, only because date and text are 
required fields, and they have not been assigned values yet when a document is 
created for a /data/user element, so an exception is thrown. I could live with 
this, but it's kind of ugly.

I don't see any other way of doing what I need to do with embedded XML elements 
though. I tried creating nested entities in the data-config file, but each one 
of them is required to have a url attribute, and I think that caused the input 
file to be read twice.

The only other possibility I could see from reading the DataImportHandler 
documentation was to specify an XSL file and change the XML file's structure so 
that the user id attribute is moved down to be an attribute of the message 
element. I'm not sure it's worth it to do something like that for what seems 
like a small problem, and I wonder how much it would slow down the importing of 
a large XML file.

Are there any other ways of handling cases like this, where an attribute of an 
outer element is to be included in an index document that corresponds to an 
element nested inside it?
Thanks,
Mike

-Original Message-
From: Mike O'Leary [mailto:tmole...@uw.edu] 
Sent: Friday, March 02, 2012 3:30 PM
To: Solr-User (solr-user@lucene.apache.org)
Subject: Including an attribute value from a higher level entity when using DIH 
to index an XML file

I have an XML file that I would like to index, that has a structure similar to 
this:


  
[message text]
...
  
  ...


I would like to have the documents in the index correspond to the messages in 
the xml file, and have the user's [id-num] value stored as a field in each of 
the user's documents. I think this means that I have to define an entity for 
message that looks like this:


  
  

  
  
   
  


but I don't know where to put the field definition for the user id. It would 
look like



I can't put it within the message entity, because it is defined with 
forEach="/data/user/message/" and the id field's xpath value is outside of the 
entity's scope. Putting the id field definition there causes a null pointer 
exception. I don't think I want to create a "user" entity that the "message" 
entity is nested inside of, or is there a way to do that and still have the 
index documents correspond to messages from the file? Are there one or more 
attributes or values of attribute that I haven't run across in my searching 
that provide a way to do what I need to do?
Thanks,
Mike




SolrJ updating indexed documents?

2012-04-02 Thread Mike O'Leary
I am working on a component for indexing documents from a database that 
contains medical records. The information is organized across several tables 
and I am supposed to index records for varying sizes of sets of patients for 
others to do IR experiments with. Each patient record has one or more main 
documents associated with it, and each main document has zero or more addenda 
associated with it. (The main documents and addenda are treated alike for the 
most part, except for a parent record field that is null for main documents and 
has the number of a main document for addenda. Addenda cannot have addenda.) 
Also, each main document has one or more diagnosis records. I am trying to 
figure out the best performing way to select all of the records for each 
patient, including the main documents, addenda and diagnoses.

I tried indexing sets of these records using DataImportHandler and nested 
Entity blocks in a way similar to the Full Import example on the 
http://wiki.apache.org/solr/DataImportHandler page, with a select for all 
patients and main records in a data set, and nested selects that get all of the 
addenda and all of the diagnoses for each patient, but it didn't run very fast 
and a database resource person who looked into it with me said that issuing a 
million SQL queries for addenda and a million queries for diagnoses, one each 
for the million patient documents in a typical set of 10,000 patients, was very 
inefficient, and I should look for a different way of getting the data.

I switched to using SolrJ, and I am trying to figure out which of two ways to 
use to index this data. One would be to use one large SQL statement to get all 
of the data for a patient set. The results would contain duplication due to the 
way tables are joined together that I would need to sort out in the Java code, 
but that is doable.

The other way would be to

1.   Get all of the main document data with one SQL query, create index 
documents with the data that they contain and store them in the index,

2.   Issue another SQL query that gets all of the addenda for all of the 
patients in the data set and an id number for each one that tells which main 
document an addendum belongs with, retrieve the main documents from the index, 
add the addenda fields to the document and put them back in the index

3.   Do the same with diagnosis data.
It would be great to be able to keep the main document data that is retrieved 
from the database in a hash table, update each of those objects with addenda 
and diagnoses, and write completely filled out documents to the index once, but 
I don't have enough memory available to do this for the patient sets I am 
working with now, and they want this indexing process to scale up to patient 
sets that are ten times as large and eventually much larger than that.

Essentially for the second approach I am wondering if a Lucene index can be 
made to serve as a hash table for storing intermediate results, and whether 
SolrJ has an API for retrieving individual index documents so they can be 
updated. Basically it would be shifting from iterating over SQL queries to 
iterating over Lucene index updates. If this way of doing things is also likely 
to be slow, or the SolrJ API doesn't provide a way to do this, or there are 
other problems with it, I can go with selecting all of the data in one large 
query and dealing with the duplication.
Thanks,
Mike


waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)

2012-04-04 Thread Mike O'Leary
If you index a set of documents with SolrJ and use
StreamingUpdateSolrServer.add(Collection docs, int 
commitWithinMs),
it will perform a commit within the time specified, and it seems to use default 
values for waitFlush and waitSearcher.

Is there a place where you can specify different values for waitFlush and 
waitSearcher, or if you want to use different values do you have to call 
StreamingUpdateSolrServer.add(Collection docs)
and then call
StreamingUpdateSolrServer.commit(waitFlush, waitSearcher)
explicitly?
Thanks,
Mike


RE: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)

2012-04-04 Thread Mike O'Leary
I am indexing some database contents using add(docs, commitWithinMs), and those 
add calls are taking over 80% of the time once the database begins returning 
results. I was wondering if setting waitSearcher to false would speed this up. 
Many of the calls take 1 to 6 seconds, with one outlier that took over 11 
minutes.
Thanks,
Mike

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: Wednesday, April 04, 2012 4:15 PM
To: solr-user@lucene.apache.org
Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, 
commitWithinMs)


On Apr 4, 2012, at 6:50 PM, Mike O'Leary wrote:

> If you index a set of documents with SolrJ and use 
> StreamingUpdateSolrServer.add(Collection docs, int 
> commitWithinMs), it will perform a commit within the time specified, and it 
> seems to use default values for waitFlush and waitSearcher.
> 
> Is there a place where you can specify different values for waitFlush 
> and waitSearcher, or if you want to use different values do you have 
> to call StreamingUpdateSolrServer.add(Collection 
> docs) and then call StreamingUpdateSolrServer.commit(waitFlush, waitSearcher) 
> explicitly?
> Thanks,
> Mike


waitFlush actually does nothing in recent versions of Solr. waitSearcher 
doesn't seem so important when the commit is not done explicitly by the user or 
a client.

- Mark Miller
lucidimagination.com













RE: waitFlush and waitSearcher with SolrServer.add(docs, commitWithinMs)

2012-04-05 Thread Mike O'Leary
First of all, what I was seeing was different from what I thought I was seeing 
because a few weeks ago I uncommented the  block in the 
solrconfig.xml file and I didn't realize it until yesterday just before I went 
home, so that was controlling the commits more than the add and commit calls 
that I was making. When I commented that block out again, the times for index 
with add(docs, commitWithinMs) and with add(docs) and commit(false, false) were 
very similar. Both of them were about 20 minutes faster (38 minutes instead of 
about an hour) than indexing with  set to commit after every 1,000 
documents or fifteen minutes.

Is this the blog post you are talking about: 
http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/?
 It seems to be about the right topic.

I am using Solr 3.5. The feature matrix on one of the Lucid Imagination web 
pages says that DocumentWriterPerThread is available in Solr 4.0 and LucidWorks 
2.0. I assume that means LucidWorks Enterprise. Is that right?
Thanks,
Mike

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Thursday, April 05, 2012 2:45 PM
To: solr-user@lucene.apache.org
Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, 
commitWithinMs)

Solr version? I suspect your outlier is due to merging segments, if so this 
should have happened quite some time into the run. See Simon Wilnauer's blog 
post on DocumenWriterPerThread (trunk) code.

What commitWithin time are you using?


Best
Erick

On Wed, Apr 4, 2012 at 7:50 PM, Mike O'Leary  wrote:
> I am indexing some database contents using add(docs, commitWithinMs), and 
> those add calls are taking over 80% of the time once the database begins 
> returning results. I was wondering if setting waitSearcher to false would 
> speed this up. Many of the calls take 1 to 6 seconds, with one outlier that 
> took over 11 minutes.
> Thanks,
> Mike
>
> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Wednesday, April 04, 2012 4:15 PM
> To: solr-user@lucene.apache.org
> Subject: Re: waitFlush and waitSearcher with SolrServer.add(docs, 
> commitWithinMs)
>
>
> On Apr 4, 2012, at 6:50 PM, Mike O'Leary wrote:
>
>> If you index a set of documents with SolrJ and use 
>> StreamingUpdateSolrServer.add(Collection docs, int 
>> commitWithinMs), it will perform a commit within the time specified, and it 
>> seems to use default values for waitFlush and waitSearcher.
>>
>> Is there a place where you can specify different values for waitFlush 
>> and waitSearcher, or if you want to use different values do you have 
>> to call StreamingUpdateSolrServer.add(Collection
>> docs) and then call StreamingUpdateSolrServer.commit(waitFlush, 
>> waitSearcher) explicitly?
>> Thanks,
>> Mike
>
>
> waitFlush actually does nothing in recent versions of Solr. waitSearcher 
> doesn't seem so important when the commit is not done explicitly by the user 
> or a client.
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>


DocumentWriterPerThread sample code?

2012-04-16 Thread Mike O'Leary
Does anyone know of sample code that illustrates how to use the 
DocumentWriterPerThread class in indexing?
Thanks,
Mike


Writing index files that have the right owner

2012-06-15 Thread Mike O'Leary
I have been putting together an application using Quartz to run several 
indexing jobs in sequence using SolrJ and Tomcat on Windows. I would like the 
Quartz job to do the following:

1.   Delete index directories from the cores so each indexing job starts 
fresh with empty indexes to populate.

2.   Start the Tomcat server.

3.   Run the indexing job.

4.   Stop the Tomcat server.

5.   Copy the index directories to an archive.

Steps 2-5 work fine, but I haven't been able to find a way to delete the index 
directories from within Java. I also can't delete them from a Windows command 
shell window: I get an error message that says "Access is denied". The reason 
for this is that the index directories and files have the owner 
"BUILTIN\Administrators". Although I am an administrator on this machine, the 
fact that these files have a different owner means that I can only delete them 
in a Windows command shell window if I start it with "Run as administrator". I 
spent a bunch of time today trying every Java function and Windows shell 
command I could find that would let me change the owner of these files, grant 
my user account the capability to delete the files, etc. Nothing I tried 
worked, likely because along with not having permission to delete the files, I 
also don't have permission to give myself permission to delete the files.

At a certain point I stopped wondering how to change the files owner or 
permissions and started wondering why the files have "BUILTIN\Administrators" 
as owner, and the permissions associated with that owner, in the first place. 
Is there somewhere in the Solr or Tomcat configuration files, or in the SolrJ 
code, where I can set who the owner of files written to the index directories 
should be?
Thanks,
Mike


Problem with Solr not finding a class that is in lucene-analyzers.jar

2012-07-10 Thread Mike O'Leary
I have been running Solr with Tomcat, and I recently wrote a Quartz program 
that starts and stops Tomcat, starts Solr indexing jobs, and does a few other 
things. When I start Tomcat programmatically in this way, Solr starts 
initializing, and when it hits the text_ws field type in schema.xml, it throws 
an exception saying that it can't find the SynonymFilter class. text_ws refers 
to solr.SynonymFilterFactory, which must need to find lucene.SynonymFilter, and 
I am guessing it is the first Lucene class encountered while initializing the 
schema that isn't in lucene-core.jar.

I thought it would be easy to fix this by looking through the solr config files 
for the location where it specifies where it looks for Lucene jar files, and 
check to make sure that both the lucene-core and lucene-analyzers jar files are 
there. I see where there is a line in the solrconfig.xml file that says 
LUCENE_36, but not one that says to 
look for the Lucene jar files in a particular directory. Are the Lucene jar 
files packaged in the solr.war file?

I also looked for directories that contain Lucene jar files within my Solr 
project, which is called tiudocumentsearch, and the one I found was in 
tomcat/work/Catalina/localhost/tiudocumentsearch/WEB-INF/lib, but the 
lucene-core and lucene-analyzers jar files are both there.

So the two things I am asking for help in figuring out are how to indicate to 
Solr where the lucene-analyzers.jar file is, so it can find the SynonymFilters 
class during initialization, and why this exception isn't thrown when I start 
Tomcat for this Solr project in a command prompt window, but it occurs when I 
start Tomcat from a Java application. I am using Solr and Lucene 3.6.
Thank you for any help or suggestions you can provide,
Mike