Re: DataImportHandlerException for custom DIH Transformer
Resurrecting an old thread. I faced exact problem as Tommy and the jar was in {solr.home}/lib as Noble had suggested. My custom transformer overrides following method as per the specification of Transformer class. public Object transformRow(Map row, Context context); But, in the code (EntityProcessorWrapper.java), I see the following line. final Method meth = clazz.getMethod(TRANSFORM_ROW, Map.class); This doesn't match the method signature in Transformer. I think this should be final Method meth = clazz.getMethod(TRANSFORM_ROW, Map.class, Context.class); I have verified that adding a method transformRow(Map row) works. Am I missing something? --shashi 2010/2/8 Noble Paul നോബിള് नोब्ळ् > On Mon, Feb 8, 2010 at 9:13 AM, Tommy Chheng > wrote: > > I'm having trouble making a custom DIH transformer in solr 1.4. > > > > I compiled the "General TrimTransformer" into a jar. (just copy/paste > sample > > code from http://wiki.apache.org/solr/DIHCustomTransformer) > > I placed the jar along with the dataimporthandler jar in solr/lib (same > > directory as the jetty jar) > > do not keep in solr/lib it wont work. keep it in {solr.home}/lib > > > > Then I added to my DIH data-config.xml file: > > transformer="DateFormatTransformer, RegexTransformer, > > com.chheng.dih.transformers.TrimTransformer" > > > > Now I get this exception when I try running the import. > > org.apache.solr.handler.dataimport.DataImportHandlerException: > > java.lang.NoSuchMethodException: > > com.chheng.dih.transformers.TrimTransformer.transformRow(java.util.Map) > >at > > > org.apache.solr.handler.dataimport.EntityProcessorWrapper.loadTransformers(EntityProcessorWrapper.java:120) > > > > I noticed the exception lists TrimTransformer.transformRow(java.util.Map) > > but the abstract Transformer class defines a two parameter method: > > transformRow(Map row, Context context)? > > > > > > -- > > Tommy Chheng > > Programmer and UC Irvine Graduate Student > > Twitter @tommychheng > > http://tommy.chheng.com > > > > > > -- > - > Noble Paul | Systems Architect| AOL | http://aol.com >
Concurrent DB updates and delta import misses few records
Hi, I'm using DIH to index records from a database. After every update on (MySQL) DB, Solr DIH is invoked for delta import. In my tests, I have observed that if db updates and DIH import is happening concurrently, import misses few records. Here is how it happens. The table has a column 'lastUpdated' which has default value of current timestamp. Many records are added to database in a single transaction that takes several seconds. For example, if 10,000 rows are being inserted, the rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20 18:21:26'. These rows become visible only after transaction is committed. That happens at, say, '2010-09-20 18:21:30'. If Solr is import gets triggered at '18:20:29', it will use a timestamp of last import for delta query. This import will not see the records added in the aforementioned transaction as transaction was not committed at that instant. After this import, the dataimport.properties will have last index time as '18:20:29'. The next import will not able to get all the rows of previously referred trasaction as some of the rows have timestamp earlier than '18:20:29'. While I am testing extreme conditions, there is a possibility of missing out on some data. I could not find any solution in Solr framework to handle this. The table has an auto increment key, all updates are deletes followed by inserts. So, having last_indexed_id would have helped, where last_indexed_id is the max value of id fetched in that import. The query would then become "Select id where id>last_indexed_id.' I suppose, Solr does not have any provision like this. Two options I could think of are: (a) Ensure at application level that there are no concurrent DB updates and DIH import requests going concurrently. (b) Use exclusive locking during DB update What is the best way to address this problem? Thank you, --shashi
Re: Concurrent DB updates and delta import misses few records
Thanks for the pointer, Shawn. It, definitely, is useful. I am wondering if you could retrieve minDid from the solr rather than storing it externally. Max id from Solr index and max id from DB should define the lower and upper thresholds, respectively, of the delta range. Am I missing something? --shashi On Wed, Sep 22, 2010 at 6:47 PM, Shawn Heisey wrote: > On 9/22/2010 1:39 AM, Shashikant Kore wrote: > >> Hi, >> >> I'm using DIH to index records from a database. After every update on >> (MySQL) DB, Solr DIH is invoked for delta import. In my tests, I have >> observed that if db updates and DIH import is happening concurrently, >> import >> misses few records. >> >> Here is how it happens. >> >> The table has a column 'lastUpdated' which has default value of current >> timestamp. Many records are added to database in a single transaction that >> takes several seconds. For example, if 10,000 rows are being inserted, the >> rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20 >> 18:21:26'. These rows become visible only after transaction is committed. >> That happens at, say, '2010-09-20 18:21:30'. >> >> If Solr is import gets triggered at '18:20:29', it will use a timestamp of >> last import for delta query. This import will not see the records added in >> the aforementioned transaction as transaction was not committed at that >> instant. After this import, the dataimport.properties will have last index >> time as '18:20:29'. The next import will not able to get all the rows of >> previously referred trasaction as some of the rows have timestamp earlier >> than '18:20:29'. >> >> While I am testing extreme conditions, there is a possibility of missing >> out >> on some data. >> >> I could not find any solution in Solr framework to handle this. The table >> has an auto increment key, all updates are deletes followed by inserts. >> So, >> having last_indexed_id would have helped, where last_indexed_id is the max >> value of id fetched in that import. The query would then become "Select id >> where id>last_indexed_id.' I suppose, Solr does not have any provision >> like >> this. >> >> Two options I could think of are: >> (a) Ensure at application level that there are no concurrent DB updates >> and >> DIH import requests going concurrently. >> (b) Use exclusive locking during DB update >> >> What is the best way to address this problem? >> > > Shashi, > > I was not solving the same problem, but perhaps you can adapt my solution > to yours. My main problem was that I don't have a modified date in my > database, and due to the size of the table, it is impractical to add one. > Instead, I chose to track the database primary key (a simple autoincrement) > outside of Solr and pass min/max values into DIH for it to use in the SELECT > statement. You can see a simplified version of my entity here, with a URL > showing how to send the parameters in via the dataimport GET: > > http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html > > The update script that runs every two minutes gets MAX(did) from the > database, retrieves the minDid from a file on an NFS share, and runs a > delta-import with those two values. When the import is reported successful, > it writes the maxDid value to the minDid file on the network share for the > next run. If the import fails, it sends an alarm and doesn't update the > minDid. > > Shawn > >
Retrieving a field from all result docuemnts & couple of more queries
Hi, I am familiar with Lucene and trying out Solr. I have index which was created outside solr. The index is fairly simple with two field - document_id & content. The query result needs to return all the document IDs. The result need not be ordered by the score. For this, in Lucene, I use custom hit collector with search to get results quickly. The index has a few million documents and queries returning hundreds of thousands of documents are not uncommon. So, the speed is crucial here. Since retrieving the document_id for each document is slow, I am using FileldCache to store the values of document_id. For all the results collected (in a bitset) with hit collector, document_id field is retrieved from the fieldcache. 1. How can I effectively disable scoring? I have read that ConstantScoreQuery is quite fast, but from the code, I see that it is used only for wildcard queries. How can I use ConstantScoreQuery for all the queries (boolean, term, phrase, ..)? Also, is ConstantScoreQuery as fast as a custom hit collector? 2. How can Solr take advantage of the fieldcache while returning the field document_id? The documentation says, fieldcache can be explicitly auto warmed with Solr. If fieldcache is available and initialized at the beginning, will solr look into the cache to retrieve the fields to be returned? 3. If there is an additional field for stemmed_content on which search needs to use different analyzer, I suppose, that could be specified by fieldType attribute in the schema. Thank you, --shashi
Re: Retrieving a field from all result docuemnts & couple of more queries
Thanks, Abhay. Can someone please throw light on how to disable scoring? --shashi On Wed, Sep 16, 2009 at 11:55 AM, abhay kumar wrote: > Hi, > > 1)Solr has various type of caches . We can specify how many documents cache > can have at a time. > e.g. if windowsize=50 > 50 results will be cached in queryResult Cache. > if user makes a new request to server for results after 50 > documents a new request will be sent to the server & server will retrieve > next 50 results in the cache. > http://wiki.apache.org/solr/SolrCaching > Yes, solr looks into the cache to retrieve the fields to be returned. > > 2) Yes, we can have different tokenizers or filters for index & search. We > need not create a different fieldtype. We need to configure the same > fieldtype (datatype) for index & search analyzers sections differently. > > e.g. > > positionIncrementGap="100" stored="false" multiValued="true"> > ** > > > > > words="stopwords.txt"/> > > > > * * > > > > > > > > > > > Regards, > Abhay > > On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore wrote: > >> Hi, >> >> I am familiar with Lucene and trying out Solr. >> >> I have index which was created outside solr. The index is fairly >> simple with two field - document_id & content. The query result needs >> to return all the document IDs. The result need not be ordered by the >> score. For this, in Lucene, I use custom hit collector with search to >> get results quickly. The index has a few million documents and queries >> returning hundreds of thousands of documents are not uncommon. So, the >> speed is crucial here. >> >> Since retrieving the document_id for each document is slow, I am using >> FileldCache to store the values of document_id. For all the results >> collected (in a bitset) with hit collector, document_id field is >> retrieved from the fieldcache. >> >> 1. How can I effectively disable scoring? I have read that >> ConstantScoreQuery is quite fast, but from the code, I see that it is >> used only for wildcard queries. How can I use ConstantScoreQuery for >> all the queries (boolean, term, phrase, ..)? Also, is >> ConstantScoreQuery as fast as a custom hit collector? >> >> 2. How can Solr take advantage of the fieldcache while returning the >> field document_id? The documentation says, fieldcache can be >> explicitly auto warmed with Solr. If fieldcache is available and >> initialized at the beginning, will solr look into the cache to >> retrieve the fields to be returned? >> >> 3. If there is an additional field for stemmed_content on which search >> needs to use different analyzer, I suppose, that could be specified by >> fieldType attribute in the schema. >> >> Thank you, >> >> --shashi >> >
Re: Retrieving a field from all result docuemnts & couple of more queries
No, I don't wish to put a custom Similarity. Rather, I want an equivalent of HitCollector where I can bypass the scoring altogether. And I prefer to do it by changing the configuration. --shashi On Wed, Sep 16, 2009 at 6:36 PM, rajan chandi wrote: > You might be talking about modifying the similarity object to modify scoring > formula in Lucene! > > $searcher->setSimilarity($similarity); > $writer->setSimilarity($similarity); > > > This can very well be done in Solr as SolrIndexWriter inherits from Lucene > IndexWriter class. > You might want to download the Solr Source code and take a look at the > SolrIndexWriter to begin with! > > It's in the package - org.apache.solr.update > > Thanks > Rajan > > On Wed, Sep 16, 2009 at 5:42 PM, Shashikant Kore wrote: > >> Thanks, Abhay. >> >> Can someone please throw light on how to disable scoring? >> >> --shashi >> >> On Wed, Sep 16, 2009 at 11:55 AM, abhay kumar wrote: >> > Hi, >> > >> > 1)Solr has various type of caches . We can specify how many documents >> cache >> > can have at a time. >> > e.g. if windowsize=50 >> > 50 results will be cached in queryResult Cache. >> > if user makes a new request to server for results after 50 >> > documents a new request will be sent to the server & server will retrieve >> > next 50 results in the cache. >> > http://wiki.apache.org/solr/SolrCaching >> > Yes, solr looks into the cache to retrieve the fields to be >> returned. >> > >> > 2) Yes, we can have different tokenizers or filters for index & search. >> We >> > need not create a different fieldtype. We need to configure the same >> > fieldtype (datatype) for index & search analyzers sections differently. >> > >> > e.g. >> > >> > > > positionIncrementGap="100" stored="false" multiValued="true"> >> > ** >> > >> > >> > >> > >> > > > words="stopwords.txt"/> >> > >> > >> > >> > * * >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > >> > Regards, >> > Abhay >> > >> > On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore > >wrote: >> > >> >> Hi, >> >> >> >> I am familiar with Lucene and trying out Solr. >> >> >> >> I have index which was created outside solr. The index is fairly >> >> simple with two field - document_id & content. The query result needs >> >> to return all the document IDs. The result need not be ordered by the >> >> score. For this, in Lucene, I use custom hit collector with search to >> >> get results quickly. The index has a few million documents and queries >> >> returning hundreds of thousands of documents are not uncommon. So, the >> >> speed is crucial here. >> >> >> >> Since retrieving the document_id for each document is slow, I am using >> >> FileldCache to store the values of document_id. For all the results >> >> collected (in a bitset) with hit collector, document_id field is >> >> retrieved from the fieldcache. >> >> >> >> 1. How can I effectively disable scoring? I have read that >> >> ConstantScoreQuery is quite fast, but from the code, I see that it is >> >> used only for wildcard queries. How can I use ConstantScoreQuery for >> >> all the queries (boolean, term, phrase, ..)? Also, is >> >> ConstantScoreQuery as fast as a custom hit collector? >> >> >> >> 2. How can Solr take advantage of the fieldcache while returning the >> >> field document_id? The documentation says, fieldcache can be >> >> explicitly auto warmed with Solr. If fieldcache is available and >> >> initialized at the beginning, will solr look into the cache to >> >> retrieve the fields to be returned? >> >> >> >> 3. If there is an additional field for stemmed_content on which search >> >> needs to use different analyzer, I suppose, that could be specified by >> >> fieldType attribute in the schema. >> >> >> >> Thank you, >> >> >> >> --shashi >> >> >> > >> >
Re: Retrieving a field from all result docuemnts & couple of more queries
Hoss, As I mentioned previously, I prefer to do this with as little java code as possible. That's the motivation for me to take a look at solr. Here is the code snippet. OpenBitSet resultBitset = new OpenBitSet(this.searcher.maxDoc()); this.searcher.search(query, new HitCollector() { @Override public void collect(int docID, float arg1) { resultBitset.set(docID); } }); Then I retrieve the stored field and look up the results present in the resultBitset. int[] docIDs = FieldCache.DEFAULT.getInts(this.luceneIndex.reader, FIELD_DOCUMENT_ID); I need to do this as I need all the matching results, but order is not important (for this search.) In the index, the content field has term vector with it, which I can't drop. There are other types of searches where relevance ranking is required. Can I achieve the same with Solr? Thanks, --shashi On Fri, Sep 18, 2009 at 3:21 AM, Chris Hostetter wrote: > > : You will need to get SolrIndexSearcher.java and modify following:- > : > : public static final int GET_SCORES = 0x01; > > No. Do not do that. There is no reason for anyone, to EVER modify that > line of code. Absolutely NONE > > If you've made that change to your version of Solr, pelase start a new > thread on solr-user explaining your goal, and what things you tried before > ultimately amking that change, because i garuntee you that if you are > willing to modify java files to change that line, there will be a more > general purpose reusable way to solve your goal besides that (which won't > silently break alot of other functionality) > > : > No, I don't wish to put a custom Similarity. Rather, I want an > : > equivalent of HitCollector where I can bypass the scoring altogether. > : > And I prefer to do it by changing the configuration. > > ...there is no pure configuration way to obtain the same logic you could > get from a custom HitCollector. You haven't elaborated on what exactly > your HitCollector looked like, but so far you've mentioned that it > ignored the scores, and used the FieldCache to get a field value w/o > dealing with stored fields -- you can achieve something roughly > functionally similar by writing a custom RequestHandler that uses > SolrIndexSearcher.getDocSet (which skips scoring and sorting) and then > iterate over that DocSet and fetch the values you want from the > FieldCache. > > or you could write a RequestHandler that uses your HitCollector as is -- > but then you aren't really leveraging any value from Solr at all, the > previous suggestion has the value add of utilizing Solr's filterCache for > frequent queries (which can be really handy if your queries can be > easily broken apart into pieces and dealt with using DocSet > union/intersection operations -- like q/fq are dealt with in > SearchHandler) > > > -Hoss > >