from:"Alexandre Rocco"

Relevancy and random sorting

2012-01-11 Thread Alexandre Rocco

Hello all,

Recently i've been trying to tweak some aspects of relevancy in one listing
project.
I need to give a higher score to newer documents and also boost the
document based on a boolean field that indicates the listing has pictures.
On top of that, in some situations we need a random sorting for the records
but also preserving the ranking.

I tried to combine some techniques described in the Solr Relevancy FAQ
wiki, but when I add the random sorting, the ranking gets messy (as
expected).

This works well:
http://localhost:18979/solr/select/?start=0&rows=15&q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22&fl=*,score

This does not work, gives a random order on what is already ranked
http://localhost:18979/solr/select/?start=0&rows=15&q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22&fl=*,score&sort=random_1+desc

The only way I see is to create another field on the schema containing a
random value and use it to boost the document the same way that was tone on
the boolean field.
Anyone tried something like this before and knows some way to get it
working?

Thanks,
Alexandre

Re: Relevancy and random sorting

2012-01-11 Thread Alexandre Rocco

Erick,

Probably I really written something silly. You are right on either sorting
by field or ranking.
I just need to change the ranking to shift things around as you said.

To clarify the use case:
We have a listing aggregator that gets product listings from a lot of
different sites and since they are added in batches, sometimes you see a
lot of pages from the same source (site). We are working on some changes to
shift things around and reduce this "blocking" effect, so we can present
mixed sources on the result pages.

I guess I will start with the document random field and later try to
develop a custom plugin to make things better.

Thanks for the pointers.

Regards,
Alexandre

On Wed, Jan 11, 2012 at 1:58 PM, Erick Erickson wrote:

> I really don't understand what this means:
> "random sorting for the records but also preserving the ranking"
>
> Either you're sorting on rank or you're not. If you mean you're
> trying to shift things around just a little bit, *mostly* respecting
> relevance then I guess you can do what you're thinking.
>
> You could create your own function query to do the boosting, see:
> http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser
>
> which would keep you from having to re-index your data to get
> a different "randomness".
>
> You could also consider external file fields, but I think your
> own function query would be cleaner. I don't think math.random
> is a supported function OOB
>
> Best
> Erick
>
>
> On Wed, Jan 11, 2012 at 8:29 AM, Alexandre Rocco 
> wrote:
> > Hello all,
> >
> > Recently i've been trying to tweak some aspects of relevancy in one
> listing
> > project.
> > I need to give a higher score to newer documents and also boost the
> > document based on a boolean field that indicates the listing has
> pictures.
> > On top of that, in some situations we need a random sorting for the
> records
> > but also preserving the ranking.
> >
> > I tried to combine some techniques described in the Solr Relevancy FAQ
> > wiki, but when I add the random sorting, the ranking gets messy (as
> > expected).
> >
> > This works well:
> >
> http://localhost:18979/solr/select/?start=0&rows=15&q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22&fl=*,score
> >
> > This does not work, gives a random order on what is already ranked
> >
> http://localhost:18979/solr/select/?start=0&rows=15&q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22&fl=*,score&sort=random_1+desc
> >
> > The only way I see is to create another field on the schema containing a
> > random value and use it to boost the document the same way that was tone
> on
> > the boolean field.
> > Anyone tried something like this before and knows some way to get it
> > working?
> >
> > Thanks,
> > Alexandre
>

Re: Relevancy and random sorting

2012-01-12 Thread Alexandre Rocco

Erick,

This document already has a field that indicates the source (site).
The issue we are trying to solve is when we list all documents without any
specific criteria. Since we bring the most recent ones and the ones that
contains images, we end up having a lot of listings from a single site,
since the documents are indexed in batches from the same site. At some
point we have several documents from the same site in the same date/time
and having images. I'm trying to give some random aspect to this search so
other documents can also appear in between that big dataset from the same
source.
Does the grouping help to achieve this?

Alexandre

On Thu, Jan 12, 2012 at 12:31 AM, Erick Erickson wrote:

> Alexandre:
>
> Have you thought about grouping? If you can analyze the incoming
> documents and include a field such that "similar" documents map
> to the same value, than group on that value you'll get output that
> isn't dominated by repeated copies of the "similar" documents. It
> depends, though, on being able to do a suitable mapping.
>
> In your case, could the mapping just be the site from which you
> got the data?
>
> Best
> Erick
>
> On Wed, Jan 11, 2012 at 1:58 PM, Alexandre Rocco 
> wrote:
> > Erick,
> >
> > Probably I really written something silly. You are right on either
> sorting
> > by field or ranking.
> > I just need to change the ranking to shift things around as you said.
> >
> > To clarify the use case:
> > We have a listing aggregator that gets product listings from a lot of
> > different sites and since they are added in batches, sometimes you see a
> > lot of pages from the same source (site). We are working on some changes
> to
> > shift things around and reduce this "blocking" effect, so we can present
> > mixed sources on the result pages.
> >
> > I guess I will start with the document random field and later try to
> > develop a custom plugin to make things better.
> >
> > Thanks for the pointers.
> >
> > Regards,
> > Alexandre
> >
> > On Wed, Jan 11, 2012 at 1:58 PM, Erick Erickson  >wrote:
> >
> >> I really don't understand what this means:
> >> "random sorting for the records but also preserving the ranking"
> >>
> >> Either you're sorting on rank or you're not. If you mean you're
> >> trying to shift things around just a little bit, *mostly* respecting
> >> relevance then I guess you can do what you're thinking.
> >>
> >> You could create your own function query to do the boosting, see:
> >> http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser
> >>
> >> which would keep you from having to re-index your data to get
> >> a different "randomness".
> >>
> >> You could also consider external file fields, but I think your
> >> own function query would be cleaner. I don't think math.random
> >> is a supported function OOB
> >>
> >> Best
> >> Erick
> >>
> >>
> >> On Wed, Jan 11, 2012 at 8:29 AM, Alexandre Rocco 
> >> wrote:
> >> > Hello all,
> >> >
> >> > Recently i've been trying to tweak some aspects of relevancy in one
> >> listing
> >> > project.
> >> > I need to give a higher score to newer documents and also boost the
> >> > document based on a boolean field that indicates the listing has
> >> pictures.
> >> > On top of that, in some situations we need a random sorting for the
> >> records
> >> > but also preserving the ranking.
> >> >
> >> > I tried to combine some techniques described in the Solr Relevancy FAQ
> >> > wiki, but when I add the random sorting, the ranking gets messy (as
> >> > expected).
> >> >
> >> > This works well:
> >> >
> >>
> http://localhost:18979/solr/select/?start=0&rows=15&q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22&fl=*,score
> >> >
> >> > This does not work, gives a random order on what is already ranked
> >> >
> >>
> http://localhost:18979/solr/select/?start=0&rows=15&q={!boost%20b=recip(ms(NOW/HOUR,date_updated),3.16e-11,1,1)}active%3a%22true%22+AND+featured%3a%22false%22+_val_:%haspicture%22&fl=*,score&sort=random_1+desc
> >> >
> >> > The only way I see is to create another field on the schema
> containing a
> >> > random value and use it to boost the document the same way that was
> tone
> >> on
> >> > the boolean field.
> >> > Anyone tried something like this before and knows some way to get it
> >> > working?
> >> >
> >> > Thanks,
> >> > Alexandre
> >>
>

Re: Relevancy and random sorting

2012-01-12 Thread Alexandre Rocco

Michael,

We are using the random sorting in combination with date and other fields
but I am trying to change this to affect the ranking instead of sorting
directly.
That way we can also use other useful tweaks on the rank itself.

Alexandre

On Thu, Jan 12, 2012 at 11:46 AM, Michael Kuhlmann  wrote:

> Does the random sort function help you here?
>
> http://lucene.apache.org/solr/**api/org/apache/solr/schema/**
> RandomSortField.html<http://lucene.apache.org/solr/api/org/apache/solr/schema/RandomSortField.html>
>
> However, you will get some very old listings then, if it's okay for you.
>
> -Kuli
>
> Am 12.01.2012 14:38, schrieb Alexandre Rocco:
>
>  Erick,
>>
>> This document already has a field that indicates the source (site).
>> The issue we are trying to solve is when we list all documents without any
>> specific criteria. Since we bring the most recent ones and the ones that
>> contains images, we end up having a lot of listings from a single site,
>> since the documents are indexed in batches from the same site. At some
>> point we have several documents from the same site in the same date/time
>> and having images. I'm trying to give some random aspect to this search so
>> other documents can also appear in between that big dataset from the same
>> source.
>> Does the grouping help to achieve this?
>>
>> Alexandre
>>
>> On Thu, Jan 12, 2012 at 12:31 AM, Erick Erickson> com >wrote:
>>
>>  Alexandre:
>>>
>>> Have you thought about grouping? If you can analyze the incoming
>>> documents and include a field such that "similar" documents map
>>> to the same value, than group on that value you'll get output that
>>> isn't dominated by repeated copies of the "similar" documents. It
>>> depends, though, on being able to do a suitable mapping.
>>>
>>> In your case, could the mapping just be the site from which you
>>> got the data?
>>>
>>> Best
>>> Erick
>>>
>>> On Wed, Jan 11, 2012 at 1:58 PM, Alexandre Rocco
>>> wrote:
>>>
>>>> Erick,
>>>>
>>>> Probably I really written something silly. You are right on either
>>>>
>>> sorting
>>>
>>>> by field or ranking.
>>>> I just need to change the ranking to shift things around as you said.
>>>>
>>>> To clarify the use case:
>>>> We have a listing aggregator that gets product listings from a lot of
>>>> different sites and since they are added in batches, sometimes you see a
>>>> lot of pages from the same source (site). We are working on some changes
>>>>
>>> to
>>>
>>>> shift things around and reduce this "blocking" effect, so we can present
>>>> mixed sources on the result pages.
>>>>
>>>> I guess I will start with the document random field and later try to
>>>> develop a custom plugin to make things better.
>>>>
>>>> Thanks for the pointers.
>>>>
>>>> Regards,
>>>> Alexandre
>>>>
>>>> On Wed, Jan 11, 2012 at 1:58 PM, Erick Erickson>>> com 
>>>> wrote:
>>>>
>>>>  I really don't understand what this means:
>>>>> "random sorting for the records but also preserving the ranking"
>>>>>
>>>>> Either you're sorting on rank or you're not. If you mean you're
>>>>> trying to shift things around just a little bit, *mostly* respecting
>>>>> relevance then I guess you can do what you're thinking.
>>>>>
>>>>> You could create your own function query to do the boosting, see:
>>>>> http://wiki.apache.org/solr/**SolrPlugins#ValueSourceParser<http://wiki.apache.org/solr/SolrPlugins#ValueSourceParser>
>>>>>
>>>>> which would keep you from having to re-index your data to get
>>>>> a different "randomness".
>>>>>
>>>>> You could also consider external file fields, but I think your
>>>>> own function query would be cleaner. I don't think math.random
>>>>> is a supported function OOB
>>>>>
>>>>> Best
>>>>> Erick
>>>>>
>>>>>
>>>>> On Wed, Jan 11, 2012 at 8:29 AM, Alexandre Rocco
>>>>> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>&g

Jetty rerturning HTTP error code 413

2010-08-18 Thread Alexandre Rocco

Guys,

We are facing an issue executing very large query (~4000 bytes in the URL)
in Solr.
When we execute the query, Solr (probably Jetty) returns a HTTP 413 error
(FULL HEAD).

I guess that this is related to the very big query being executed, and
currently we can't make it short.
Is there any configuration that need to be tweaked on Jetty or other
component to make this query work?

Any advice is really appreciated.

Thanks!
Alexandre Rocco

Re: Jetty rerturning HTTP error code 413

2010-08-19 Thread Alexandre Rocco

Hi diddier,

I have updated my etc/jetty.xml and updated my headerBufferSize to 2x as:
16384

But the error persists. Do you know if there is any other config that should
be updated so this setting works?
Also, is there any way to check if jetty is use this config inside Solr
admin pages? I know that we can check the Java properties but I haven't
found any way to locate the jetty config there.

Thanks!
Alexandre

On Wed, Aug 18, 2010 at 4:58 PM, didier deshommes wrote:

> Hi Alexandre,
> Have you tried setting a higher headerBufferSize?  Look in
> etc/jetty.xml and search for 'headerBufferSize'; I think it controls
> the size of the url. By default it is 8192.
>
> didier
>
> On Wed, Aug 18, 2010 at 2:43 PM, Alexandre Rocco 
> wrote:
> > Guys,
> >
> > We are facing an issue executing very large query (~4000 bytes in the
> URL)
> > in Solr.
> > When we execute the query, Solr (probably Jetty) returns a HTTP 413 error
> > (FULL HEAD).
> >
> > I guess that this is related to the very big query being executed, and
> > currently we can't make it short.
> > Is there any configuration that need to be tweaked on Jetty or other
> > component to make this query work?
> >
> > Any advice is really appreciated.
> >
> > Thanks!
> > Alexandre Rocco
> >
>

Re: Jetty rerturning HTTP error code 413

2010-08-19 Thread Alexandre Rocco

Hi diddier,

Nevermind.
I figured it out. There was some miscommunication between me and our IT guy.

Thanks for helping. It's fixed now.

Alexandre

On Thu, Aug 19, 2010 at 9:59 AM, Alexandre Rocco  wrote:

> Hi diddier,
>
> I have updated my etc/jetty.xml and updated my headerBufferSize to 2x as:
> 16384
>
> But the error persists. Do you know if there is any other config that
> should be updated so this setting works?
> Also, is there any way to check if jetty is use this config inside Solr
> admin pages? I know that we can check the Java properties but I haven't
> found any way to locate the jetty config there.
>
> Thanks!
> Alexandre
>
> On Wed, Aug 18, 2010 at 4:58 PM, didier deshommes wrote:
>
>> Hi Alexandre,
>> Have you tried setting a higher headerBufferSize?  Look in
>> etc/jetty.xml and search for 'headerBufferSize'; I think it controls
>> the size of the url. By default it is 8192.
>>
>> didier
>>
>> On Wed, Aug 18, 2010 at 2:43 PM, Alexandre Rocco 
>> wrote:
>> > Guys,
>> >
>> > We are facing an issue executing very large query (~4000 bytes in the
>> URL)
>> > in Solr.
>> > When we execute the query, Solr (probably Jetty) returns a HTTP 413
>> error
>> > (FULL HEAD).
>> >
>> > I guess that this is related to the very big query being executed, and
>> > currently we can't make it short.
>> > Is there any configuration that need to be tweaked on Jetty or other
>> > component to make this query work?
>> >
>> > Any advice is really appreciated.
>> >
>> > Thanks!
>> > Alexandre Rocco
>> >
>>
>
>

Faceting and first letter of fields

2010-10-14 Thread Alexandre Rocco

Guys,

We have a website running Solr indexing books, and we use a facet to filter
books by author.
After some time, we detected that this facet is very large and we need to
create some other feature to help finding the information.

Our product team asked to create a page that can show all authors by it's
initial letter, so we can distribute this query easier.
Is it a feasible solution to create another field containing only the
initial letter for the authors? Using this approach we will be able to
filter the authors using this newly created field.
Do you think there will be any performance penalty on creating a couple of
fields with the initial letter of these other fields (author, publisher)?

I guess that this approach is way easier than other solutions we came up
with.
Am I missing other alternatives?

Thanks,
Alexandre

Re: Faceting and first letter of fields

2010-10-14 Thread Alexandre Rocco

Thank you for both responses.

Another question I have is where the processing of this "first letter" is
more adequate.
I am considering updating my data import handler to execute a script to
extract the first letter from the author field.

I saw other thread when someone mentioned using a field analyser to extract
the letter using a regex.
Which one is the best option?

Thanks!
Alexandre

On Thu, Oct 14, 2010 at 4:46 PM, Yonik Seeley wrote:

> On Thu, Oct 14, 2010 at 3:42 PM, Jonathan Rochkind 
> wrote:
> > I believe that should work fine in Solr 1.4.1.  Creating a field with
> just
> > first letter of author is definitely the right (possibly only) way to
> allow
> > facetting on first letter of author's name.
> >
> > I have very voluminous facets (few facet values, many docs in each value)
> > like that in my app too, works fine.
> >
> > I get confused over the different facetting methods available in 1.4.1,
> and
> > exactly when each is called for. If you see initial problems, you could
> try
> > switching the facet.method and see what happens.
>
> Right - for faceting on first letter, you should probably use
> facet.method=enum
> since there will only be 26 values (assuming english/western languages).
>
> In the future, I'm hoping we can come up with a smarter way to pick
> the facet.method if it's not supplied.  The new flex API in 4.0-dev
> should help out here.
>
> -Yonik
> http://www.lucidimagination.com
>

Slave index size growing fast

2012-03-23 Thread Alexandre Rocco

Hello,

We have a Solr index that has an average of 1.19 GB in size.
After configuring the replication, the slave machine is growing the index
size expoentially.
Currently we have an slave with 323.44 GB in size.
Is there anything that could cause this behavior?
The current replication config is below.

Master:


commit
startup
startup

elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt




Slave:


http://master:8984/solr/Index/replication



Any pointers will be useful.

Thanks,
Alexandre

Re: Slave index size growing fast

2012-03-23 Thread Alexandre Rocco

Erick,

We're using Solr 3.3 on Linux (CentOS 5.6).
The /data dir on master is actually 1.2G.

I haven't tried to recreate the index yet. Since it's a production
environment,
I guess that I can stop replication and indexing and then recreate the
master index to see if it makes any difference.

Also just noticed another thread here named "Simple Slave Replication
Question" that tells that it could be a problem if I'm seeing an
/data/index with an timestamp on the slave node.
Is this info relevant to this issue?

Thanks,
Alexandre

On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson wrote:

> What version of Solr and what operating system?
>
> But regardless, this shouldn't be happening. Indexes can
> temporarily double in size, but any extras should be
> cleaned up relatively soon.
>
> On the master, what's the total size of the /data directory?
> I'm a little suspicious of the  on your master, but I
> don't think that's the root of your problem
>
> Are you recreating the index on the master (by deleting the
> index directory and starting over)?
>
> This is unusual, and I suspect it's something odd in your configuration,
> but I confess I'm at a loss as to what.
>
> Best
> Erick
>
> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
> wrote:
> > Hello,
> >
> > We have a Solr index that has an average of 1.19 GB in size.
> > After configuring the replication, the slave machine is growing the index
> > size expoentially.
> > Currently we have an slave with 323.44 GB in size.
> > Is there anything that could cause this behavior?
> > The current replication config is below.
> >
> > Master:
> > 
> > 
> > commit
> > startup
> > startup
> > 
> >
> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
> > 
> > 
> > 
> >
> > Slave:
> > 
> > 
> > http://master:8984/solr/Index/replication
> > 
> > 
> >
> > Any pointers will be useful.
> >
> > Thanks,
> > Alexandre
>

Re: Slave index size growing fast

2012-03-23 Thread Alexandre Rocco

Erick,

The master /data dir contains only an index dir with a bunch of files.
In the slave, the /data dir contains an index.20110926152410 dir with a lot
more files than the master. That is quite strange for me.

I guess that the config is right, since we have another slave that is
running fine with the same config.
The best bet would be clean up this messed slave and try to sync it again
and see what happens.

Thanks

On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson wrote:

> not really, unless perhaps you're issuing commits or optimizes
> on the _slave_ (which you should NOT do).
>
> Replication happens based on the version of the index on the master.
> True, it starts out as a timestamp, but then successive versions
> just have that number incremented. The version number
> in the index on the slave is compared against the one on the master,
> but the actual time (on the slave or master) is irrelevant. This is
> explicitly to avoid problems with time synching across
> machines/timezones/whataver
>
> It would be instructive to look at the admin/info page to see what
> the index version is on the master and slave.
>
> But, if you optimize or commit (I think) on the _slave_, you might
> change the timestamp and mess things up (although I'm reaching
> here, I don't know this for certain).
>
> What's the  index look like on the slave as compared to the master?
> Are there just a bunch of files on the slave? Or a bunch of directories?
>
> Instead of re-indexing on the master, you could try to bring down the
> slave, blow away the entire index and start it back up. Since this is a
> production system, I'd only try this if I had more than one slave. Although
> you could bring up a new slave and attach it to the master and see
> what happens there. You wouldn't affect production if you didn't point
> incoming requests at it...
>
> Best
> Erick
>
> On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco 
> wrote:
> > Erick,
> >
> > We're using Solr 3.3 on Linux (CentOS 5.6).
> > The /data dir on master is actually 1.2G.
> >
> > I haven't tried to recreate the index yet. Since it's a production
> > environment,
> > I guess that I can stop replication and indexing and then recreate the
> > master index to see if it makes any difference.
> >
> > Also just noticed another thread here named "Simple Slave Replication
> > Question" that tells that it could be a problem if I'm seeing an
> > /data/index with an timestamp on the slave node.
> > Is this info relevant to this issue?
> >
> > Thanks,
> > Alexandre
> >
> > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson <
> erickerick...@gmail.com>wrote:
> >
> >> What version of Solr and what operating system?
> >>
> >> But regardless, this shouldn't be happening. Indexes can
> >> temporarily double in size, but any extras should be
> >> cleaned up relatively soon.
> >>
> >> On the master, what's the total size of the /data directory?
> >> I'm a little suspicious of the  on your master, but I
> >> don't think that's the root of your problem
> >>
> >> Are you recreating the index on the master (by deleting the
> >> index directory and starting over)?
> >>
> >> This is unusual, and I suspect it's something odd in your configuration,
> >> but I confess I'm at a loss as to what.
> >>
> >> Best
> >> Erick
> >>
> >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
> >> wrote:
> >> > Hello,
> >> >
> >> > We have a Solr index that has an average of 1.19 GB in size.
> >> > After configuring the replication, the slave machine is growing the
> index
> >> > size expoentially.
> >> > Currently we have an slave with 323.44 GB in size.
> >> > Is there anything that could cause this behavior?
> >> > The current replication config is below.
> >> >
> >> > Master:
> >> > 
> >> > 
> >> > commit
> >> > startup
> >> > startup
> >> > 
> >> >
> >>
> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
> >> > 
> >> > 
> >> > 
> >> >
> >> > Slave:
> >> > 
> >> > 
> >> > http://master:8984/solr/Index/replication
> >> > 
> >> > 
> >> >
> >> > Any pointers will be useful.
> >> >
> >> > Thanks,
> >> > Alexandre
> >>
>

Re: Slave index size growing fast

2012-03-23 Thread Alexandre Rocco

Tomás,

The 300+GB size is only inside the index.20110926152410 dir. Inside there
are a lot of files.
I am almost conviced that something is messed up like someone commited on
this slave machine.

Thanks

2012/3/23 Tomás Fernández Löbbe 

> Alexandre, additionally to what Erick said, you may want to check in the
> slave if what's 300+GB is the "data" directory or the "index."
> directory.
>
> On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson  >wrote:
>
> > not really, unless perhaps you're issuing commits or optimizes
> > on the _slave_ (which you should NOT do).
> >
> > Replication happens based on the version of the index on the master.
> > True, it starts out as a timestamp, but then successive versions
> > just have that number incremented. The version number
> > in the index on the slave is compared against the one on the master,
> > but the actual time (on the slave or master) is irrelevant. This is
> > explicitly to avoid problems with time synching across
> > machines/timezones/whataver
> >
> > It would be instructive to look at the admin/info page to see what
> > the index version is on the master and slave.
> >
> > But, if you optimize or commit (I think) on the _slave_, you might
> > change the timestamp and mess things up (although I'm reaching
> > here, I don't know this for certain).
> >
> > What's the  index look like on the slave as compared to the master?
> > Are there just a bunch of files on the slave? Or a bunch of directories?
> >
> > Instead of re-indexing on the master, you could try to bring down the
> > slave, blow away the entire index and start it back up. Since this is a
> > production system, I'd only try this if I had more than one slave.
> Although
> > you could bring up a new slave and attach it to the master and see
> > what happens there. You wouldn't affect production if you didn't point
> > incoming requests at it...
> >
> > Best
> > Erick
> >
> > On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco 
> > wrote:
> > > Erick,
> > >
> > > We're using Solr 3.3 on Linux (CentOS 5.6).
> > > The /data dir on master is actually 1.2G.
> > >
> > > I haven't tried to recreate the index yet. Since it's a production
> > > environment,
> > > I guess that I can stop replication and indexing and then recreate the
> > > master index to see if it makes any difference.
> > >
> > > Also just noticed another thread here named "Simple Slave Replication
> > > Question" that tells that it could be a problem if I'm seeing an
> > > /data/index with an timestamp on the slave node.
> > > Is this info relevant to this issue?
> > >
> > > Thanks,
> > > Alexandre
> > >
> > > On Fri, Mar 23, 2012 at 11:48 AM, Erick Erickson <
> > erickerick...@gmail.com>wrote:
> > >
> > >> What version of Solr and what operating system?
> > >>
> > >> But regardless, this shouldn't be happening. Indexes can
> > >> temporarily double in size, but any extras should be
> > >> cleaned up relatively soon.
> > >>
> > >> On the master, what's the total size of the /data
> directory?
> > >> I'm a little suspicious of the  on your master, but I
> > >> don't think that's the root of your problem
> > >>
> > >> Are you recreating the index on the master (by deleting the
> > >> index directory and starting over)?
> > >>
> > >> This is unusual, and I suspect it's something odd in your
> configuration,
> > >> but I confess I'm at a loss as to what.
> > >>
> > >> Best
> > >> Erick
> > >>
> > >> On Fri, Mar 23, 2012 at 10:28 AM, Alexandre Rocco 
> > >> wrote:
> > >> > Hello,
> > >> >
> > >> > We have a Solr index that has an average of 1.19 GB in size.
> > >> > After configuring the replication, the slave machine is growing the
> > index
> > >> > size expoentially.
> > >> > Currently we have an slave with 323.44 GB in size.
> > >> > Is there anything that could cause this behavior?
> > >> > The current replication config is below.
> > >> >
> > >> > Master:
> > >> > 
> > >> > 
> > >> > commit
> > >> > startup
> > >> > startup
> > >> > 
> > >> >
> > >>
> >
> elevate.xml,protwords.txt,schema.xml,spellings.txt,stopwords.txt,synonyms.txt
> > >> > 
> > >> > 
> > >> > 
> > >> >
> > >> > Slave:
> > >> > 
> > >> > 
> > >> > http://master:8984/solr/Index/replication
> 
> > >> > 
> > >> > 
> > >> >
> > >> > Any pointers will be useful.
> > >> >
> > >> > Thanks,
> > >> > Alexandre
> > >>
> >
>

Re: Slave index size growing fast

2012-03-26 Thread Alexandre Rocco

Erick,

I haven't changed the maxCommitsToKeep yet.
We stopped the slave that had issues and removed the data dir as you
pointed and afer starting it, everything started working as normal.
I guess that at some point someone commited on the slave or even copied the
master files over and made this mess. Will check on the internal docs to
prevent this from happening again.

Thanks for explaining the whole concept, will be useful to understand the
whole process.

Best,
Alexandre

On Fri, Mar 23, 2012 at 4:05 PM, Erick Erickson wrote:

> Alexandre:
>
> Have you changed anything like  on your slave?
> And do you have more than one slave? If you do, have you considered
> just blowing away the entire .../data directory on the slave and letting
> it re-start from scratch? I'd take the slave out of service for the
> duration of this operation, or do it when you are OK with some number of
> requests going to an empty index
>
> Because having an index. directory indicates that sometime
> someone forced the slave to get out of sync, possibly as you say by
> doing a commit. Or sending docs to it to be indexed or some such. Starting
> the slave over should fix that if it's the root of your problem.
>
> Note a curious thing about the . When you start indexing, the
> index version is a timestamp. However, from that point on when the index
> changes, the version number is just incremented (not made the current
> time). This is to avoid problems with masters and slaves having different
> times. But a consequence of that is if your slave somehow gets an index
> that's newer, the replication process does the best it can to not delete
> indexes that are out of sync with the master and saves them away. This
> might be what you're seeing.
>
> I'm grasping at straws a bit here, but this seems possible.
>
> Best
> Erick
>
> On Fri, Mar 23, 2012 at 1:16 PM, Alexandre Rocco 
> wrote:
> > Tomás,
> >
> > The 300+GB size is only inside the index.20110926152410 dir. Inside there
> > are a lot of files.
> > I am almost conviced that something is messed up like someone commited on
> > this slave machine.
> >
> > Thanks
> >
> > 2012/3/23 Tomás Fernández Löbbe 
> >
> >> Alexandre, additionally to what Erick said, you may want to check in the
> >> slave if what's 300+GB is the "data" directory or the
> "index."
> >> directory.
> >>
> >> On Fri, Mar 23, 2012 at 12:25 PM, Erick Erickson <
> erickerick...@gmail.com
> >> >wrote:
> >>
> >> > not really, unless perhaps you're issuing commits or optimizes
> >> > on the _slave_ (which you should NOT do).
> >> >
> >> > Replication happens based on the version of the index on the master.
> >> > True, it starts out as a timestamp, but then successive versions
> >> > just have that number incremented. The version number
> >> > in the index on the slave is compared against the one on the master,
> >> > but the actual time (on the slave or master) is irrelevant. This is
> >> > explicitly to avoid problems with time synching across
> >> > machines/timezones/whataver
> >> >
> >> > It would be instructive to look at the admin/info page to see what
> >> > the index version is on the master and slave.
> >> >
> >> > But, if you optimize or commit (I think) on the _slave_, you might
> >> > change the timestamp and mess things up (although I'm reaching
> >> > here, I don't know this for certain).
> >> >
> >> > What's the  index look like on the slave as compared to the master?
> >> > Are there just a bunch of files on the slave? Or a bunch of
> directories?
> >> >
> >> > Instead of re-indexing on the master, you could try to bring down the
> >> > slave, blow away the entire index and start it back up. Since this is
> a
> >> > production system, I'd only try this if I had more than one slave.
> >> Although
> >> > you could bring up a new slave and attach it to the master and see
> >> > what happens there. You wouldn't affect production if you didn't point
> >> > incoming requests at it...
> >> >
> >> > Best
> >> > Erick
> >> >
> >> > On Fri, Mar 23, 2012 at 11:03 AM, Alexandre Rocco 
> >> > wrote:
> >> > > Erick,
> >> > >
> >> > > We're using Solr 3.3 on Linux (CentOS 5.6).
> >> > > The /data dir on master is actually 1.2G.

bbox query and range queries

2012-03-29 Thread Alexandre Rocco

Hello,

I'm trying to perform some queries on a location field on the index.
The requirement is to search listings inside a pair of coordinates, like a
bounding box.

Taking a look on the wiki, I noticed that there is the option to use the
bbox query but in does not create a retangular shaped box to find the docs.
Also since the LatLon field is searchable by range, it's possible to use a
range query to find.

I'm trying to search inside a pair of coordinates (the top left corner and
bottom right corner) and no result is found.

The query i'm trying is something like:
http://localhost:8984/solr/select?wt=json&indent=true&fl=local,*&q=*:*&fq=local:[-23.6674,-46.7314TO
-23.6705,-46.7274]

Is there any other way to find docs inside a rectangular bounding box?

Thanks
Alexandre

Re: bbox query and range queries

2012-03-29 Thread Alexandre Rocco

Erick,

My location field is defined like in the example project:


Also, there is the dynamic that stores the splitted coordinates:


The response XML with debugQuery=on is looking like this:


0
1



*:*
*:*
MatchAllDocsQuery(*:*)
*:*

LuceneQParser

local:[-23.6674,-46.7314 TO -23.6705,-46.7274]



+local_0_coordinate:[-23.6674 TO -23.6705] +local_1_coordinate:[-46.7314 TO
-46.7274]



1.0

0.0

0.0


0.0


0.0


0.0


0.0


0.0



1.0

1.0


0.0


0.0


0.0


0.0


0.0






I tried to get some docs that contains the coordinates and then created a
retangle around that doc to see it is returned between these ranges.
Don't know if this is the best way to test it, but it's quite easy.

Best,
Alexandre

On Thu, Mar 29, 2012 at 2:57 PM, Erick Erickson wrote:

> What are your results? Can you show us the field definition for "local"
> and the results of adding &debugQuery=on?
>
> Because this should work as far as I can tell.
>
> Best
> Erick
>
> On Thu, Mar 29, 2012 at 11:04 AM, Alexandre Rocco 
> wrote:
> > Hello,
> >
> > I'm trying to perform some queries on a location field on the index.
> > The requirement is to search listings inside a pair of coordinates, like
> a
> > bounding box.
> >
> > Taking a look on the wiki, I noticed that there is the option to use the
> > bbox query but in does not create a retangular shaped box to find the
> docs.
> > Also since the LatLon field is searchable by range, it's possible to use
> a
> > range query to find.
> >
> > I'm trying to search inside a pair of coordinates (the top left corner
> and
> > bottom right corner) and no result is found.
> >
> > The query i'm trying is something like:
> >
> http://localhost:8984/solr/select?wt=json&indent=true&fl=local,*&q=*:*&fq=local:[-23.6674,-46.7314TO
> > -23.6705,-46.7274]
> >
> > Is there any other way to find docs inside a rectangular bounding box?
> >
> > Thanks
> > Alexandre
>

Re: bbox query and range queries

2012-03-29 Thread Alexandre Rocco

Erick,

Just checked on the separate fields and everything looks fine.
One thing that I'm not completely sure is if this query I tried to perform
is correct.

One sample document looks like this:

200
-23.6696784,-46.7290193
-23.6696784
-46.7290193


So, to find for this document I tried to create a virtual rectangle that
would be queried using the range query I described:
http://localhost:8984/solr/select?q=*:*&fq=local:[-23.6677,-46.7315 TO
-23.6709,-46.7261]

You see that in the first coordinate I used a smaller value (got it from
map) that is on the top left corner of the area of the doc. The other
coordinate is on the bottom right corner, and it's bigger than the doc
local field.

When I split the query in 2 parts, the first part
(local_1_coordinate:[-46.7315 TO -46.7261]) returns results but the other
part (local_0_coordinate:[-23.6709 TO -23.6677]) doesn't match any docs.

I am guessing that my query is wrong. The typical use case is to take the
bounds of part of an map, that is represented by these top left and bottom
right coordinates and find the docs inside this area. Does this range query
accomplish this kind of scenario?

Any pointers are appreciated.

Best,
Alexandre

On Thu, Mar 29, 2012 at 3:54 PM, Erick Erickson wrote:

> This all looks fine, so the next question is whether or not your
> documents have the value you think.
>
> +local_0_coordinate:[-23.6674 TO -23.6705] +local_1_coordinate:[-46.7314 TO
> -46.7274]
> is the actual translated filter.
>
> So I'd check the actual documents in the index to see if you have a single
> document with local_0 and local_1 that fits the above. You should be able
> to
> use the TermsComponent: http://wiki.apache.org/solr/TermsComponent
> to look. Or switch to stored="true" and look at search results for
> documents you think should match, just to see the raw value Who knows?
> It could be something as silly as you have your lat/lon backwards somehow,
> I've
> spent _days_ having problems like that ...
>
> Best
> Erick
>
> On Thu, Mar 29, 2012 at 2:34 PM, Alexandre Rocco 
> wrote:
> > Erick,
> >
> > My location field is defined like in the example project:
> > 
> >
> > Also, there is the dynamic that stores the splitted coordinates:
> >  > stored="false" multiValued="false"/>
> >
> > The response XML with debugQuery=on is looking like this:
> > 
> > 
> > 0
> > 1
> > 
> > 
> > 
> > *:*
> > *:*
> > MatchAllDocsQuery(*:*)
> > *:*
> > 
> > LuceneQParser
> > 
> > local:[-23.6674,-46.7314 TO -23.6705,-46.7274]
> > 
> > 
> > 
> > +local_0_coordinate:[-23.6674 TO -23.6705] +local_1_coordinate:[-46.7314
> TO
> > -46.7274]
> > 
> > 
> > 
> > 1.0
> > 
> > 0.0
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 
> > 1.0
> > 
> > 1.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 0.0
> > 
> > 
> > 
> > 
> > 
> >
> > I tried to get some docs that contains the coordinates and then created a
> > retangle around that doc to see it is returned between these ranges.
> > Don't know if this is the best way to test it, but it's quite easy.
> >
> > Best,
> > Alexandre
> >
> > On Thu, Mar 29, 2012 at 2:57 PM, Erick Erickson  >wrote:
> >
> >> What are your results? Can you show us the field definition for "local"
> >> and the results of adding &debugQuery=on?
> >>
> >> Because this should work as far as I can tell.
> >>
> >> Best
> >> Erick
> >>
> >> On Thu, Mar 29, 2012 at 11:04 AM, Alexandre Rocco 
> >> wrote:
> >> > Hello,
> >> >
> >> > I'm trying to perform some queries on a location field on the index.
> >> > The requirement is to search listings inside a pair of coordinates,
> like
> >> a
> >> > bounding box.
> >> >
> >> > Taking a look on the wiki, I noticed that there is the option to use
> the
> >> > bbox query but in does not create a retangular shaped box to find the
> >> docs.
> >> > Also since the LatLon field is searchable by range, it's possible to
> use
> >> a
> >> > range query to find.
> >> >
> >> > I'm trying to search inside a pair of coordinates (the top left corner
> >> and
> >> > bottom right corner) and no result is found.
> >> >
> >> > The query i'm trying is something like:
> >> >
> >>
> http://localhost:8984/solr/select?wt=json&indent=true&fl=local,*&q=*:*&fq=local:[-23.6674,-46.7314TO
> >> > -23.6705,-46.7274]
> >> >
> >> > Is there any other way to find docs inside a rectangular bounding box?
> >> >
> >> > Thanks
> >> > Alexandre
> >>
>

Re: bbox query and range queries

2012-03-29 Thread Alexandre Rocco

Yonik,

Thanks for the heads-up. That one worked.

Just trying to wrap around how it would work on a real case. To test this
one I just got the coordinates from Google Maps and searched within the
pair of coordinates as I got them. Should I always check which is the lower
and upper to assemble the query?
I know that this one is off-topic, just curious.

Thanks
Alexandre

On Thu, Mar 29, 2012 at 7:26 PM, Yonik Seeley wrote:

> On Thu, Mar 29, 2012 at 6:20 PM, Alexandre Rocco 
> wrote:
> > http://localhost:8984/solr/select?q=*:*&fq=local:[-23.6677,-46.7315 TO
> > -23.6709,-46.7261]
>
> Range queries always need to be [lower_bound TO upper_bound]
> Try
> http://localhost:8984/solr/select?q=*:*&fq=local:[-23.6709,-46.7315 TO
> -23.6677,-46.7261]
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10
>

DataImportHandler in Solr 4.0

2011-02-23 Thread Alexandre Rocco

Hi guys,

I'm having some issues when trying to use the DataImportHandler on Solr
4.0.
I've downloaded the latest nightly build of Solr 4.0 and configured normally
(on the example folder) solrconfig.xml file like this:



data-config.xml



At this point I noticed that the DIH jar was not being loaded correctly
causing exceptions like:
Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'
and
java.lang.ClassNotFoundException:
org.apache.solr.handler.dataimport.DataImportHandler

Do I need to build to get DIH running on Solr 4.0?

Thanks!
Alexandre

DataImportHandler in Solr 4.0

2011-02-23 Thread Alexandre Rocco

Hi guys,

I'm having some issues when trying to use the DataImportHandler on Solr 4.0.

I've downloaded the latest nightly build of Solr 4.0 and configured normally
(on the example folder) solrconfig.xml file like this:



data-config.xml



At this point I noticed that the DIH jar was not being loaded correctly
causing exceptions like:
Error loading class 'org.apache.solr.handler.dataimport.DataImportHandler'
and
java.lang.ClassNotFoundException:
org.apache.solr.handler.dataimport.DataImportHandler

Do I need to build to get DIH running on Solr 4.0?

Thanks!
Alexandre

Re: DataImportHandler in Solr 4.0

2011-02-23 Thread Alexandre Rocco

I got it working by building the DIH from the contrib folder and made a
change on the lib statements to map the folder that contains the .jar files.

Thanks!
Alexandre

On Wed, Feb 23, 2011 at 8:55 PM, Smiley, David W.  wrote:

> The DIH is no longer supplied embedded in the Solr war file.  You need to
> get it on the classpath somehow. You could add another  solrconfig.xml to resolve this.
>
> ~ David Smiley
> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
>
> On Feb 23, 2011, at 4:11 PM, Alexandre Rocco wrote:
>
> > Hi guys,
> >
> > I'm having some issues when trying to use the DataImportHandler on Solr
> 4.0.
> >
> > I've downloaded the latest nightly build of Solr 4.0 and configured
> normally
> > (on the example folder) solrconfig.xml file like this:
> >
> >  > class="org.apache.solr.handler.dataimport.DataImportHandler">
> > 
> > data-config.xml
> > 
> > 
> >
> > At this point I noticed that the DIH jar was not being loaded correctly
> > causing exceptions like:
> > Error loading class
> 'org.apache.solr.handler.dataimport.DataImportHandler'
> > and
> > java.lang.ClassNotFoundException:
> > org.apache.solr.handler.dataimport.DataImportHandler
> >
> > Do I need to build to get DIH running on Solr 4.0?
> >
> > Thanks!
> > Alexandre
>
>
>
>
>
>
>
>
>

Distances in spatial search (Solr 4.0)

2011-02-28 Thread Alexandre Rocco

Hi guys,

We are implementing a separate index on our website, that will be dedicated
to spatial search.
I've downloaded a build of Solr 4.0 to try the spatial features and got the
geodist working really fast.

We now have 2 other features that will be needed on this project:
1. Returning the distance from the reference point to the search hit (in
kilometers)
2. Sorting by the distance.

On item 2, the wiki doc points that a distance function can be used but I
was not able to find good info on how to accomplish it.
Also, returning the distance (item 1) is noted as currently being in
development and there is some workaround to get it.

Anyone had experience with the spatial feature and could help with some
pointers on how to achieve it?

Thanks,
Alexandre

Re: Distances in spatial search (Solr 4.0)

2011-03-01 Thread Alexandre Rocco

Hi Bill,

I was using a different approach to sort by the distance with the dist()
function, since geodist() is not documented on the wiki (
http://wiki.apache.org/solr/FunctionQuery)

Tried something like:
&sort=dist(2, 45.15,-93.85, lat, lng) asc

I made some tests with geodist() function as you pointed and got different
results.
Is it safe to assume that geodist() is the correct way of doing it?

Also, can you clear up how can I see the distance using the "_Val_" as you
told?

Thanks!
Alexandre

On Tue, Mar 1, 2011 at 12:03 AM, Bill Bell  wrote:

> Use sort with geodist() to sort by distance.
>
> Getting the distance returned us documented on the wiki if you are not
> using score. see reference to _Val_
>
> Bill Bell
> Sent from mobile
>
>
> On Feb 28, 2011, at 7:54 AM, Alexandre Rocco  wrote:
>
> > Hi guys,
> >
> > We are implementing a separate index on our website, that will be
> dedicated
> > to spatial search.
> > I've downloaded a build of Solr 4.0 to try the spatial features and got
> the
> > geodist working really fast.
> >
> > We now have 2 other features that will be needed on this project:
> > 1. Returning the distance from the reference point to the search hit (in
> > kilometers)
> > 2. Sorting by the distance.
> >
> > On item 2, the wiki doc points that a distance function can be used but I
> > was not able to find good info on how to accomplish it.
> > Also, returning the distance (item 1) is noted as currently being in
> > development and there is some workaround to get it.
> >
> > Anyone had experience with the spatial feature and could help with some
> > pointers on how to achieve it?
> >
> > Thanks,
> > Alexandre
>

DIH import and postImportDeleteQuery

2011-05-24 Thread Alexandre Rocco

Guys,

I am facing a situation in one of our projects that I need to perform a
cleanup to remove some documents after we perform an update via DIH.
The big issue right now comes from the fact that when we call the DIH with
clean=false, the postImportDeleteQuery is not executed.

My setup is currently arranged like this:
- A SQL Server stored procedure that receives a parameter (specified in the
URL) and returns the records to be indexed
- The procedure is able to return all the records (for a full-import) or
only the updated records (for a delta-import)
- This procedure returns valid and deleted records, from this point comes
the need to run a postImportDeleteQuery to remove the deleted ones.

Everything works fine when I run a full-import, I am running always with
clean=true, and then the whole index is rebuilt.
When I need to do an incremental update, the records are updated correctly,
but the command to delete the other records is not executed.

I've tried several combinations, with different results:
- Running full-import with clean=false: the records are updated but the ones
that needs to be deleted stays on the index
- Running delta-import with clean=false: the records are updated but the
ones that needs to be deleted stays on the index
- Running delta-import with clean=true: all records are deleted from the
index and then only the records returned by the procedure are on the index,
except the deleted ones.

I don't see any way to achieve my goal, without changing the process that I
do to obtain the data.
Since this is a very complex stored procedure, with tons of joins and custom
processing, I am trying everything to avoid messing with it.

See below a copy of my data-config.xml file. I made it simpler omitting all
the fields, since it's out of scope of the issue:


















Any ideas or pointers that might help on this one?

Many thanks,
Alexandre

Re: DIH import and postImportDeleteQuery

2011-05-25 Thread Alexandre Rocco

Hi Ephraim,

Thank you so much for the input.
I was able to find your thread on the archives and got your solution to
work.

In fact, when using $deleteDocById and $skipDoc it worked like a charm. This
feature is very useful, it's a shame it's not properly documented.
The only downside is the one you mentioned that the stats are not updated,
so if I update 13 documents and delete 2, DIH would tell me that only 13
documents were processed. This is bad in my case because I check the end
result to generate an error e-mail if needed.

You also mentioned that if the query contains only deletion records, a
commit would not be automatically executed and it would be necessary to
commit manually.

How can I commit manually via DIH? I was not able to find any references on
the documentation.

Thanks!
Alexandre

On Wed, May 25, 2011 at 5:14 AM, Ephraim Ofir  wrote:

> Search the list for my post "DIH - deleting documents, high performance
> (delta) imports, and passing parameters" which shows my solution a
> similar problem.
>
> Ephraim Ofir
>
> -----Original Message-
> From: Alexandre Rocco [mailto:alel...@gmail.com]
> Sent: Tuesday, May 24, 2011 11:24 PM
> To: solr-user@lucene.apache.org
> Subject: DIH import and postImportDeleteQuery
>
> Guys,
>
> I am facing a situation in one of our projects that I need to perform a
> cleanup to remove some documents after we perform an update via DIH.
> The big issue right now comes from the fact that when we call the DIH
> with
> clean=false, the postImportDeleteQuery is not executed.
>
> My setup is currently arranged like this:
> - A SQL Server stored procedure that receives a parameter (specified in
> the
> URL) and returns the records to be indexed
> - The procedure is able to return all the records (for a full-import) or
> only the updated records (for a delta-import)
> - This procedure returns valid and deleted records, from this point
> comes
> the need to run a postImportDeleteQuery to remove the deleted ones.
>
> Everything works fine when I run a full-import, I am running always with
> clean=true, and then the whole index is rebuilt.
> When I need to do an incremental update, the records are updated
> correctly,
> but the command to delete the other records is not executed.
>
> I've tried several combinations, with different results:
> - Running full-import with clean=false: the records are updated but the
> ones
> that needs to be deleted stays on the index
> - Running delta-import with clean=false: the records are updated but the
> ones that needs to be deleted stays on the index
> - Running delta-import with clean=true: all records are deleted from the
> index and then only the records returned by the procedure are on the
> index,
> except the deleted ones.
>
> I don't see any way to achieve my goal, without changing the process
> that I
> do to obtain the data.
> Since this is a very complex stored procedure, with tons of joins and
> custom
> processing, I am trying everything to avoid messing with it.
>
> See below a copy of my data-config.xml file. I made it simpler omitting
> all
> the fields, since it's out of scope of the issue:
> 
> 
>  driver="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> url="jdbc:sqlserver://myserver;databaseName=mydb;user=username;password=
> password;responseBuffering=adaptive;"
>
> />
> 
>  pk="entityid"
> transformer="RegexTransformer"
> query="EXEC some_stored_procedure ${dataimporter.request.someid}"
> preImportDeleteQuery="status:1" postImportDeleteQuery="status:1"
> >
> 
> 
> 
> 
>
>  pk="entityid"
> transformer="RegexTransformer"
> query="EXEC someother_stored_procedure
> ${dataimporter.request.someotherid}"
> preImportDeleteQuery="status:1" postImportDeleteQuery="status:1"
> >
> 
> 
> 
> 
> 
> 
>
> Any ideas or pointers that might help on this one?
>
> Many thanks,
> Alexandre
>

Re: DIH import and postImportDeleteQuery

2011-05-25 Thread Alexandre Rocco

Hi James,

Thanks for the heads up!
I am currently on version 1.4.1, so I can apply this patch and see if it
works.
Just need to assess if it's best to apply the patch or to check on the
backend system to see if only delete requests were generated and then do not
call DIH.

Previously, I found another open issue, created from Ephraim:
https://issues.apache.org/jira/browse/SOLR-2104

It's the same issue, but it hasn't had any updates yet.

Regards,
Alexandre

On Wed, May 25, 2011 at 3:17 PM, Dyer, James wrote:

> The "failure to commit" bug with $deleteDocById can be fixed by applying
> patch SOLR-2492.  This patch also partially fixes the "no updated stats" bug
> in that it increments 1 for every call to $deleteDocById and
> $deleteDocByQuery.  Note that this might result in inaccurate counts if the
> id given with $deleteDocById doesn't exist or is duplicated.  Obviously this
> is not a complete fix for stats using $deleteDocByQuery as this command
> would normally be used to delete >1 doc at a time.
>
> The patch is for Trunk but it might work with 3.1 also.  If not, it likely
> only needs minor tweaking.
>
> The jira ticket is here:  https://issues.apache.org/jira/browse/SOLR-2492
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: Alexandre Rocco [mailto:alel...@gmail.com]
> Sent: Wednesday, May 25, 2011 12:54 PM
> To: solr-user@lucene.apache.org
> Subject: Re: DIH import and postImportDeleteQuery
>
> Hi Ephraim,
>
> Thank you so much for the input.
> I was able to find your thread on the archives and got your solution to
> work.
>
> In fact, when using $deleteDocById and $skipDoc it worked like a charm.
> This
> feature is very useful, it's a shame it's not properly documented.
> The only downside is the one you mentioned that the stats are not updated,
> so if I update 13 documents and delete 2, DIH would tell me that only 13
> documents were processed. This is bad in my case because I check the end
> result to generate an error e-mail if needed.
>
> You also mentioned that if the query contains only deletion records, a
> commit would not be automatically executed and it would be necessary to
> commit manually.
>
> How can I commit manually via DIH? I was not able to find any references on
> the documentation.
>
> Thanks!
> Alexandre
>
> On Wed, May 25, 2011 at 5:14 AM, Ephraim Ofir  wrote:
>
> > Search the list for my post "DIH - deleting documents, high performance
> > (delta) imports, and passing parameters" which shows my solution a
> > similar problem.
> >
> > Ephraim Ofir
> >
> > -Original Message-
> > From: Alexandre Rocco [mailto:alel...@gmail.com]
> > Sent: Tuesday, May 24, 2011 11:24 PM
> > To: solr-user@lucene.apache.org
> > Subject: DIH import and postImportDeleteQuery
> >
> > Guys,
> >
> > I am facing a situation in one of our projects that I need to perform a
> > cleanup to remove some documents after we perform an update via DIH.
> > The big issue right now comes from the fact that when we call the DIH
> > with
> > clean=false, the postImportDeleteQuery is not executed.
> >
> > My setup is currently arranged like this:
> > - A SQL Server stored procedure that receives a parameter (specified in
> > the
> > URL) and returns the records to be indexed
> > - The procedure is able to return all the records (for a full-import) or
> > only the updated records (for a delta-import)
> > - This procedure returns valid and deleted records, from this point
> > comes
> > the need to run a postImportDeleteQuery to remove the deleted ones.
> >
> > Everything works fine when I run a full-import, I am running always with
> > clean=true, and then the whole index is rebuilt.
> > When I need to do an incremental update, the records are updated
> > correctly,
> > but the command to delete the other records is not executed.
> >
> > I've tried several combinations, with different results:
> > - Running full-import with clean=false: the records are updated but the
> > ones
> > that needs to be deleted stays on the index
> > - Running delta-import with clean=false: the records are updated but the
> > ones that needs to be deleted stays on the index
> > - Running delta-import with clean=true: all records are deleted from the
> > index and then only the records returned by the procedure are on the
> > index,
> > except the deleted ones.
> >
> > I don't see any way to achieve my goal, without changing the process
>

Storing RandomSortField

2010-05-18 Thread Alexandre Rocco

Hi guys,

Is there any way to mak a RandomSortField be stored?
I'm trying to do it for debugging purposes,
My intention is to take a look at the values that are stored there to
determine the sorting that is being applied to the results.

I tried to make it a stored field as:


And also tried to create another text field, copying the result from the
random field like this:



Neither of the approaches worked.
Is there any restriction on this kind of field that prevents it from being
displayed in the results?

Thanks,
Alexandre

Re: Storing RandomSortField

2010-05-19 Thread Alexandre Rocco

Leonardo,

I was able to use the feature with a dynamic field as pointed in the
documentation.
So, I was just curious to take a peek at the values that are generated, even
when the field is not dynamic, so I tried to figure out a way to do so.
Maybe some output when the debug query is enabled would be useful, but it
seems it's not implemented yet.
I will try to take a look at the classes and see what can I do about it.

Thanks!

On Wed, May 19, 2010 at 5:34 AM, Leonardo Menezes <
leonardo.menez...@googlemail.com> wrote:

> Hey,
>   for random sorting, random values are generated in runtime using the seed
> you passed as one of the parameters to generate the value, among other
> things. this way, if the value you use as seed is the same in different
> request, the sorting order should be the same. you could also, for debbuing
> purposes, edit the random sort field class and put some traces in there, so
> it could print the id of the document and the value generated for example.
> but the values wont be stored on the idx.
>
> cheers
>
> On Wed, May 19, 2010 at 10:00 AM, Marco Martinez <
> mmarti...@paradigmatecnologico.com> wrote:
>
> > Hi Alexandre,
> >
> > I am not totally sure about this, but the random sort field its only used
> > to
> > do a random sort on your searchs, and you will to pass differents values
> to
> > have differents sorts, so this only applies in the searchs, so no value
> is
> > indexed. You will find more information here:
> >
> >
> http://lucene.apache.org/solr/api/org/apache/solr/schema/RandomSortField.html
> >
> > Marco Martínez Bautista
> > http://www.paradigmatecnologico.com
> > Avenida de Europa, 26. Ática 5. 3ª Planta
> > 28224 Pozuelo de Alarcón
> > Tel.: 91 352 59 42
> >
> >
> > 2010/5/18 Alexandre Rocco 
> >
> > > Hi guys,
> > >
> > > Is there any way to mak a RandomSortField be stored?
> > > I'm trying to do it for debugging purposes,
> > > My intention is to take a look at the values that are stored there to
> > > determine the sorting that is being applied to the results.
> > >
> > > I tried to make it a stored field as:
> > > 
> > >
> > > And also tried to create another text field, copying the result from
> the
> > > random field like this:
> > >  stored="true"/>
> > > 
> > >
> > > Neither of the approaches worked.
> > > Is there any restriction on this kind of field that prevents it from
> > being
> > > displayed in the results?
> > >
> > > Thanks,
> > > Alexandre
> > >
> >
>

Relevancy and random sorting

Re: Relevancy and random sorting

Re: Relevancy and random sorting

Re: Relevancy and random sorting

Jetty rerturning HTTP error code 413

Re: Jetty rerturning HTTP error code 413

Re: Jetty rerturning HTTP error code 413

Faceting and first letter of fields

Re: Faceting and first letter of fields

Slave index size growing fast

Re: Slave index size growing fast

Re: Slave index size growing fast

Re: Slave index size growing fast

Re: Slave index size growing fast

bbox query and range queries

Re: bbox query and range queries

Re: bbox query and range queries

Re: bbox query and range queries

DataImportHandler in Solr 4.0

DataImportHandler in Solr 4.0

Re: DataImportHandler in Solr 4.0

Distances in spatial search (Solr 4.0)

Re: Distances in spatial search (Solr 4.0)

DIH import and postImportDeleteQuery

Re: DIH import and postImportDeleteQuery

Re: DIH import and postImportDeleteQuery

Storing RandomSortField

Re: Storing RandomSortField

28 matches

Site Navigation

Mail list logo

Footer information