Re: DataImportHandlerException for custom DIH Transformer

2010-09-08 Thread Shashikant Kore
Resurrecting an old thread.

I faced exact problem as Tommy and the jar was in {solr.home}/lib as Noble
had suggested.

My custom transformer overrides following method as per the specification of
Transformer class.

public Object transformRow(Map row, Context
context);

But, in the code (EntityProcessorWrapper.java), I see the following line.

  final Method meth = clazz.getMethod(TRANSFORM_ROW, Map.class);

This doesn't match the method signature in Transformer. I think this should
be

  final Method meth = clazz.getMethod(TRANSFORM_ROW, Map.class,
Context.class);

I have verified that adding a method transformRow(Map row)
works.

Am I missing something?

--shashi

2010/2/8 Noble Paul നോബിള്‍ नोब्ळ् 

> On Mon, Feb 8, 2010 at 9:13 AM, Tommy Chheng 
> wrote:
> >  I'm having trouble making a custom DIH transformer in solr 1.4.
> >
> > I compiled the "General TrimTransformer" into a jar. (just copy/paste
> sample
> > code from http://wiki.apache.org/solr/DIHCustomTransformer)
> > I placed the jar along with the dataimporthandler jar in solr/lib (same
> > directory as the jetty jar)
>
> do not keep in solr/lib it wont work. keep it in {solr.home}/lib
> >
> > Then I added to my DIH data-config.xml file:
> > transformer="DateFormatTransformer, RegexTransformer,
> > com.chheng.dih.transformers.TrimTransformer"
> >
> > Now I get this exception when I try running the import.
> > org.apache.solr.handler.dataimport.DataImportHandlerException:
> > java.lang.NoSuchMethodException:
> > com.chheng.dih.transformers.TrimTransformer.transformRow(java.util.Map)
> >at
> >
> org.apache.solr.handler.dataimport.EntityProcessorWrapper.loadTransformers(EntityProcessorWrapper.java:120)
> >
> > I noticed the exception lists TrimTransformer.transformRow(java.util.Map)
> > but the abstract Transformer class defines a two parameter method:
> > transformRow(Map row, Context context)?
> >
> >
> > --
> > Tommy Chheng
> > Programmer and UC Irvine Graduate Student
> > Twitter @tommychheng
> > http://tommy.chheng.com
> >
> >
>
> --
> -
> Noble Paul | Systems Architect| AOL | http://aol.com
>


Concurrent DB updates and delta import misses few records

2010-09-22 Thread Shashikant Kore
Hi,

I'm using DIH to index records from a database. After every update on
(MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
observed that if db updates and DIH import is happening concurrently, import
misses few records.

Here is how it happens.

The table has a column 'lastUpdated' which has default value of current
timestamp. Many records are added to database in a single transaction that
takes several seconds. For example, if 10,000 rows are being inserted, the
rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
18:21:26'. These rows become visible only after transaction is committed.
That happens at, say, '2010-09-20 18:21:30'.

If Solr is import gets triggered at '18:20:29', it will use a timestamp of
last import for delta query. This import will not see the records added in
the aforementioned transaction as transaction was not committed at that
instant. After this import, the dataimport.properties will have last index
time as '18:20:29'.  The next import will not able to get all the rows of
previously referred trasaction as some of the rows have timestamp earlier
than '18:20:29'.

While I am testing extreme conditions, there is a possibility of missing out
on some data.

I could not find any solution in Solr framework to handle this. The table
has an auto increment key, all updates are deletes followed by inserts. So,
having last_indexed_id would have helped, where last_indexed_id is the max
value of id fetched in that import. The query would then become "Select id
where id>last_indexed_id.' I suppose, Solr does not have any provision like
this.

Two options I could think of are:
(a) Ensure at application level that there are no concurrent DB updates and
DIH import requests going concurrently.
(b) Use exclusive locking during DB update

What is the best way to address this problem?

Thank you,

--shashi


Re: Concurrent DB updates and delta import misses few records

2010-09-22 Thread Shashikant Kore
Thanks for the pointer, Shawn.  It, definitely, is useful.

I am wondering if you could retrieve minDid from the solr rather than
storing it externally. Max id from Solr index and max id from DB should
define the lower and upper thresholds, respectively, of the delta range. Am
I missing something?

--shashi

On Wed, Sep 22, 2010 at 6:47 PM, Shawn Heisey  wrote:

>  On 9/22/2010 1:39 AM, Shashikant Kore wrote:
>
>> Hi,
>>
>> I'm using DIH to index records from a database. After every update on
>> (MySQL) DB, Solr DIH is invoked for delta import.  In my tests, I have
>> observed that if db updates and DIH import is happening concurrently,
>> import
>> misses few records.
>>
>> Here is how it happens.
>>
>> The table has a column 'lastUpdated' which has default value of current
>> timestamp. Many records are added to database in a single transaction that
>> takes several seconds. For example, if 10,000 rows are being inserted, the
>> rows may get timestamp values from '2010-09-20 18:21:20' to '2010-09-20
>> 18:21:26'. These rows become visible only after transaction is committed.
>> That happens at, say, '2010-09-20 18:21:30'.
>>
>> If Solr is import gets triggered at '18:20:29', it will use a timestamp of
>> last import for delta query. This import will not see the records added in
>> the aforementioned transaction as transaction was not committed at that
>> instant. After this import, the dataimport.properties will have last index
>> time as '18:20:29'.  The next import will not able to get all the rows of
>> previously referred trasaction as some of the rows have timestamp earlier
>> than '18:20:29'.
>>
>> While I am testing extreme conditions, there is a possibility of missing
>> out
>> on some data.
>>
>> I could not find any solution in Solr framework to handle this. The table
>> has an auto increment key, all updates are deletes followed by inserts.
>> So,
>> having last_indexed_id would have helped, where last_indexed_id is the max
>> value of id fetched in that import. The query would then become "Select id
>> where id>last_indexed_id.' I suppose, Solr does not have any provision
>> like
>> this.
>>
>> Two options I could think of are:
>> (a) Ensure at application level that there are no concurrent DB updates
>> and
>> DIH import requests going concurrently.
>> (b) Use exclusive locking during DB update
>>
>> What is the best way to address this problem?
>>
>
> Shashi,
>
> I was not solving the same problem, but perhaps you can adapt my solution
> to yours.  My main problem was that I don't have a modified date in my
> database, and due to the size of the table, it is impractical to add one.
>  Instead, I chose to track the database primary key (a simple autoincrement)
> outside of Solr and pass min/max values into DIH for it to use in the SELECT
> statement.  You can see a simplified version of my entity here, with a URL
> showing how to send the parameters in via the dataimport GET:
>
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg40466.html
>
> The update script that runs every two minutes gets MAX(did) from the
> database, retrieves the minDid from a file on an NFS share, and runs a
> delta-import with those two values.  When the import is reported successful,
> it writes the maxDid value to the minDid file on the network share for the
> next run.  If the import fails, it sends an alarm and doesn't update the
> minDid.
>
> Shawn
>
>


Retrieving a field from all result docuemnts & couple of more queries

2009-09-15 Thread Shashikant Kore
Hi,

I am familiar with Lucene and trying out Solr.

I have index which was created outside solr. The index is fairly
simple with two field - document_id  & content. The query result needs
to return all the document IDs. The result need not be ordered by the
score. For this, in Lucene, I use custom hit collector with search to
get results quickly. The index has a few million documents and queries
returning hundreds of thousands of documents are not uncommon. So, the
speed is crucial here.

Since retrieving the document_id for each document is slow, I am using
FileldCache to store the values of document_id. For all the results
collected (in a bitset) with hit collector, document_id field is
retrieved from the fieldcache.

1. How can I effectively disable scoring? I have read that
ConstantScoreQuery is quite fast, but from the code, I see that it is
used only for wildcard queries. How can I use ConstantScoreQuery for
all the queries (boolean, term, phrase, ..)?  Also, is
ConstantScoreQuery as fast as a custom hit collector?

2. How can Solr take advantage of the fieldcache while returning the
field document_id? The documentation says, fieldcache can be
explicitly auto warmed with Solr.  If fieldcache is available and
initialized at the beginning, will solr look into the cache to
retrieve the fields to be returned?

3. If there is an additional field for stemmed_content on which search
needs to use different analyzer, I suppose, that could be specified by
fieldType attribute in the schema.

Thank you,

--shashi


Re: Retrieving a field from all result docuemnts & couple of more queries

2009-09-16 Thread Shashikant Kore
Thanks, Abhay.

Can someone please throw light on how to disable scoring?

--shashi

On Wed, Sep 16, 2009 at 11:55 AM, abhay kumar  wrote:
> Hi,
>
> 1)Solr has various type of caches . We can specify how many documents cache
> can have at a time.
>       e.g. if windowsize=50
>           50 results will be cached in queryResult Cache.
>            if user makes a new request to server for results after 50
> documents a new request will be sent to the server & server will retrieve
> next             50 results in the cache.
>       http://wiki.apache.org/solr/SolrCaching
>       Yes, solr looks into the cache to retrieve the fields to be returned.
>
> 2) Yes, we can have different tokenizers or filters for index & search. We
> need not create a different fieldtype. We need to configure the same
> fieldtype (datatype) for index & search analyzers sections differently.
>
>   e.g.
>
>         positionIncrementGap="100" stored="false" multiValued="true">
>          **
>         
>         
>
>         
>          words="stopwords.txt"/>
>         
>         
>       
>      * *
>         
>         
>
>         
>         
>      
>    
>
>
>
> Regards,
> Abhay
>
> On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore wrote:
>
>> Hi,
>>
>> I am familiar with Lucene and trying out Solr.
>>
>> I have index which was created outside solr. The index is fairly
>> simple with two field - document_id  & content. The query result needs
>> to return all the document IDs. The result need not be ordered by the
>> score. For this, in Lucene, I use custom hit collector with search to
>> get results quickly. The index has a few million documents and queries
>> returning hundreds of thousands of documents are not uncommon. So, the
>> speed is crucial here.
>>
>> Since retrieving the document_id for each document is slow, I am using
>> FileldCache to store the values of document_id. For all the results
>> collected (in a bitset) with hit collector, document_id field is
>> retrieved from the fieldcache.
>>
>> 1. How can I effectively disable scoring? I have read that
>> ConstantScoreQuery is quite fast, but from the code, I see that it is
>> used only for wildcard queries. How can I use ConstantScoreQuery for
>> all the queries (boolean, term, phrase, ..)?  Also, is
>> ConstantScoreQuery as fast as a custom hit collector?
>>
>> 2. How can Solr take advantage of the fieldcache while returning the
>> field document_id? The documentation says, fieldcache can be
>> explicitly auto warmed with Solr.  If fieldcache is available and
>> initialized at the beginning, will solr look into the cache to
>> retrieve the fields to be returned?
>>
>> 3. If there is an additional field for stemmed_content on which search
>> needs to use different analyzer, I suppose, that could be specified by
>> fieldType attribute in the schema.
>>
>> Thank you,
>>
>> --shashi
>>
>


Re: Retrieving a field from all result docuemnts & couple of more queries

2009-09-16 Thread Shashikant Kore
No, I don't wish to put a custom Similarity.  Rather, I want an
equivalent of HitCollector where I can bypass the scoring altogether.
And I prefer to do it by changing the configuration.

--shashi

On Wed, Sep 16, 2009 at 6:36 PM, rajan chandi  wrote:
> You might be talking about modifying the similarity object to modify scoring
> formula in Lucene!
>
>  $searcher->setSimilarity($similarity);
>  $writer->setSimilarity($similarity);
>
>
> This can very well be done in Solr as SolrIndexWriter inherits from Lucene
> IndexWriter class.
> You might want to download the Solr Source code and take a look at the
> SolrIndexWriter to begin with!
>
> It's in the package - org.apache.solr.update
>
> Thanks
> Rajan
>
> On Wed, Sep 16, 2009 at 5:42 PM, Shashikant Kore wrote:
>
>> Thanks, Abhay.
>>
>> Can someone please throw light on how to disable scoring?
>>
>> --shashi
>>
>> On Wed, Sep 16, 2009 at 11:55 AM, abhay kumar  wrote:
>> > Hi,
>> >
>> > 1)Solr has various type of caches . We can specify how many documents
>> cache
>> > can have at a time.
>> >       e.g. if windowsize=50
>> >           50 results will be cached in queryResult Cache.
>> >            if user makes a new request to server for results after 50
>> > documents a new request will be sent to the server & server will retrieve
>> > next             50 results in the cache.
>> >       http://wiki.apache.org/solr/SolrCaching
>> >       Yes, solr looks into the cache to retrieve the fields to be
>> returned.
>> >
>> > 2) Yes, we can have different tokenizers or filters for index & search.
>> We
>> > need not create a different fieldtype. We need to configure the same
>> > fieldtype (datatype) for index & search analyzers sections differently.
>> >
>> >   e.g.
>> >
>> >        > > positionIncrementGap="100" stored="false" multiValued="true">
>> >          **
>> >         
>> >         
>> >
>> >         
>> >         > > words="stopwords.txt"/>
>> >         
>> >         
>> >       
>> >      * *
>> >         
>> >         
>> >
>> >         
>> >         
>> >      
>> >    
>> >
>> >
>> >
>> > Regards,
>> > Abhay
>> >
>> > On Tue, Sep 15, 2009 at 6:41 PM, Shashikant Kore > >wrote:
>> >
>> >> Hi,
>> >>
>> >> I am familiar with Lucene and trying out Solr.
>> >>
>> >> I have index which was created outside solr. The index is fairly
>> >> simple with two field - document_id  & content. The query result needs
>> >> to return all the document IDs. The result need not be ordered by the
>> >> score. For this, in Lucene, I use custom hit collector with search to
>> >> get results quickly. The index has a few million documents and queries
>> >> returning hundreds of thousands of documents are not uncommon. So, the
>> >> speed is crucial here.
>> >>
>> >> Since retrieving the document_id for each document is slow, I am using
>> >> FileldCache to store the values of document_id. For all the results
>> >> collected (in a bitset) with hit collector, document_id field is
>> >> retrieved from the fieldcache.
>> >>
>> >> 1. How can I effectively disable scoring? I have read that
>> >> ConstantScoreQuery is quite fast, but from the code, I see that it is
>> >> used only for wildcard queries. How can I use ConstantScoreQuery for
>> >> all the queries (boolean, term, phrase, ..)?  Also, is
>> >> ConstantScoreQuery as fast as a custom hit collector?
>> >>
>> >> 2. How can Solr take advantage of the fieldcache while returning the
>> >> field document_id? The documentation says, fieldcache can be
>> >> explicitly auto warmed with Solr.  If fieldcache is available and
>> >> initialized at the beginning, will solr look into the cache to
>> >> retrieve the fields to be returned?
>> >>
>> >> 3. If there is an additional field for stemmed_content on which search
>> >> needs to use different analyzer, I suppose, that could be specified by
>> >> fieldType attribute in the schema.
>> >>
>> >> Thank you,
>> >>
>> >> --shashi
>> >>
>> >
>>
>


Re: Retrieving a field from all result docuemnts & couple of more queries

2009-09-17 Thread Shashikant Kore
Hoss,

As I mentioned previously, I prefer to do this with as little java
code as possible. That's the motivation for me to take a look at solr.

Here is the code snippet.

OpenBitSet resultBitset = new OpenBitSet(this.searcher.maxDoc());

this.searcher.search(query, new HitCollector() {
@Override
public void collect(int docID, float arg1) {
resultBitset.set(docID);
}
});

Then I retrieve the stored field and look up the results present in
the resultBitset.

int[] docIDs = FieldCache.DEFAULT.getInts(this.luceneIndex.reader,
FIELD_DOCUMENT_ID);

I need to do this as I need all the matching results, but order is not
important (for this search.) In the index, the content field has term
vector with it, which I can't drop. There are other types of searches
where relevance ranking is required.

Can I achieve the same with Solr?

Thanks,

--shashi

On Fri, Sep 18, 2009 at 3:21 AM, Chris Hostetter
 wrote:
>
> : You will need to get SolrIndexSearcher.java and modify following:-
> :
> : public static final int GET_SCORES             =       0x01;
>
> No.  Do not do that.  There is no reason for anyone, to EVER modify that
> line of code. Absolutely NONE
>
> If you've made that change to your version of Solr, pelase start a new
> thread on solr-user explaining your goal, and what things you tried before
> ultimately amking that change, because i garuntee you that if you are
> willing to modify java files to change that line, there will be a more
> general purpose reusable way to solve your goal besides that (which won't
> silently break alot of other functionality)
>
> : > No, I don't wish to put a custom Similarity.  Rather, I want an
> : > equivalent of HitCollector where I can bypass the scoring altogether.
> : > And I prefer to do it by changing the configuration.
>
> ...there is no pure configuration way to obtain the same logic you could
> get from a custom HitCollector.  You haven't elaborated on what exactly
> your HitCollector looked like, but so far you've mentioned that it
> ignored the scores, and used the FieldCache to get a field value w/o
> dealing with stored fields -- you can achieve something roughly
> functionally similar by writing a custom RequestHandler that uses
> SolrIndexSearcher.getDocSet (which skips scoring and sorting) and then
> iterate over that DocSet and fetch the values you want from the
> FieldCache.
>
> or you could write a RequestHandler that uses your HitCollector as is --
> but then you aren't really leveraging any value from Solr at all, the
> previous suggestion has the value add of utilizing Solr's filterCache for
> frequent queries (which can be really handy if your queries can be
> easily broken apart into pieces and dealt with using DocSet
> union/intersection operations -- like q/fq are dealt with in
> SearchHandler)
>
>
> -Hoss
>
>