Re: really slow performance when trying to get facet.field

2012-01-18 Thread Dmitry Kan
Sounds good! So the take away lesson here is to remember cache pre-warming.
And of course keep track of RAM allocation :)

On Tue, Jan 17, 2012 at 11:23 PM, Daniel Bruegge <
daniel.brue...@googlemail.com> wrote:

> Ok, I have now changed the static warming in the solrconfig.xml using
> first- and newSearcher.
> "Content" is my field to facet on. Now the commits take longer, which is OK
> for me, but the searches are really faster right now. I also reduced the
> number of documents on my shards to 15mio/shard. So the index is about
> 3.5G, which fits also in my memory I hope.
>
>
>  
> 
>*:*
>true
>content
>1
>1
>
>  
>
>
>  
>
> *:*
>true
>content
>1
>1
>
>  
>
>
>
> On Tue, Jan 17, 2012 at 2:36 PM, Daniel Bruegge <
> daniel.brue...@googlemail.com> wrote:
>
> > Evictions are 0 for all cache types.
> >
> > Your server max heap space with 12G is pretty huge. Which is good I
> think.
> > The CPU on my server is a 8-Core Intel i7 965.
> >
> > Commit frequency is low, because shards are added and old shards exist
> for
> > historical reasons. Old shards will be then cleaned after couple of
> months.
> >
> > I will try to add maximum 15mio per shard and see what will happen here.
> >
> > This thing is, that I will add more shards over time, so that I can
> handle
> > maybe 500-800mio documents. Maybe more. It depends.
> >
> > On Tue, Jan 17, 2012 at 2:14 PM, Dmitry Kan 
> wrote:
> >
> >> Hi Daniel,
> >>
> >> My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is
> >> beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m
> >> -Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs
> is
> >> over 6,5 million.
> >>
> >> Do you see any evictions in your caches? What kind of server is it, in
> >> terms of CPU and OS? How often do you commit to the index?
> >>
> >> Dmitry
> >>
> >> On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge <
> >> daniel.brue...@googlemail.com> wrote:
> >>
> >> > Hi Dmitry,
> >> >
> >> > I had everything on one Solr Instance before, but this got to heavy
> and
> >> I
> >> > had the same issue here, that the 1st facet.query was really slow.
> >> >
> >> > When querying the facet:
> >> > - facet.limit = 100
> >> >
> >> > Cache settings are like this:
> >> >
> >> > >> > size="16384"
> >> > initialSize="4096"
> >> > autowarmCount="4096"/>
> >> >
> >> > >> > size="512"
> >> > initialSize="512"
> >> > autowarmCount="0"/>
> >> >
> >> > >> >   size="512"
> >> >   initialSize="512"
> >> >   autowarmCount="0"/>
> >> >
> >> > How big was your index? Did it fit into the RAM which you gave the
> Solr
> >> > instance?
> >> >
> >> > Thanks
> >> >
> >> >
> >> > On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan 
> >> wrote:
> >> >
> >> > > I had a similar problem for a similar task. And in my case merging
> the
> >> > > results from two shards turned out to be a culprit. If you can
> >> logically
> >> > > store your data just in one shard, your faceting should become
> faster.
> >> > Size
> >> > > wise it should not be a problem for SOLR.
> >> > >
> >> > > Also, you didn't say anything about the facet.limit value, cache
> >> > > parameters, usage of filter queries. Some of these can be
> >> interconnected.
> >> > >
> >> > > Dmitry
> >> > >
> >> > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge <
> >> > > daniel.brue...@googlemail.com> wrote:
> >> > >
> >> > > > Hi,
> >> > > >
> >> > > > I have 2 Solr-shards. One is filled with approx. 25mio documents
> >> (local
> >> > > > index 6GB), the other with 10mio documents (2.7GB size).
> >> > > > I am trying to create some kind of 'word cloud' to see the
> >> frequency of
> >> > > > words for a *text_general *field.
> >> > > > For this I am currently using a facet over this field and I am
> also
> >> > > > restricting the documents by using some other filters in the
> query.
> >> > > >
> >> > > > The performance is really bad for the first call and then pretty
> >> fast
> >> > for
> >> > > > the following calls.
> >> > > >
> >> > > > The maximum Java heap size is 3G for each shard. Both shards are
> >> > running
> >> > > on
> >> > > > the same physical server which has 12G RAM.
> >> > > >
> >> > > > Question: Should I reduce the documents in one shard, so that the
> >> index
> >> > > is
> >> > > > equal or less the Java Heap size for this shard? Or is
> >> > > > there another method to avoid this slow calls?
> >> > > >
> >> > > > Thank you
> >> > > >
> >> > > > Daniel
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Regards,
> >> > >
> >> > > Dmitry Kan
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Regards,
> >>
> >> Dmitry Kan
> >>
> >
> >
>



-- 
Regards,

Dmitry Kan


Re: Question on Reverse Indexing

2012-01-18 Thread Dmitry Kan
Just to play safe here, can you double check that the reversing is not any
more the case by issuing a query through the admin analysis page?

Dmitry

On Wed, Jan 18, 2012 at 4:23 AM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Hi Francois,
>
> I understand that disabling of ReversedWildcardFilterFactory has improved
> the performance.
>
> But I am puzzled over how the leading wild card search like *lock is
> working even though I have now disabled the ReversedWildcardFilterFactory
> and the indexes have been created without ReversedWildcardFilter ?
>
> How does reverse indexing work even after disabling
> ReversedWildcardFilterFactory?
>
> Can anyone explain me how this feature is working.
>
> -Shyam
>
> -Original Message-
> From: François Schiettecatte [mailto:fschietteca...@gmail.com]
> Sent: Wednesday, January 18, 2012 7:49 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> Using ReversedWildcardFilterFactory will double the size of your
> dictionary (more or less), maybe the drop in performance that you are
> seeing is a result of that?
>
> François
>
> On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:
>
> > Hi,
> >
> > For reverse indexing we are using the ReversedWildcardFilterFactory on
> Solr 4.0
> >
> >
> >  >
> > maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
> >
> >
> > ReversedWildcardFilterFactory was helping us to perform leading wild
> card searches like *lock.
> >
> > But it was observed that the performance of the searches was not good
> after introducing ReversedWildcardFilterFactory filter.
> >
> > Hence we disabled ReversedWildcardFilterFactory filter and re-created
> the indexes and this time we found the performance of Solr query to be
> faster.
> >
> > But surprisingly it is observed that leading wild card searches were
> still working inspite of disabling the ReversedWildcardFilterFactory filter.
> >
> >
> > This behavior is puzzling everyone and wanted to know how this behavior
> of reverse indexing works?
> >
> > Can anyone share with me on this Solr behavior.
> >
> > -Shyam
> >
>
>


-- 
Regards,

Dmitry Kan


Re: How to return the distance geo distance on solr 3.5 with bbox filtering

2012-01-18 Thread Mikhail Khludnev
Maxim,

Which version of Solr you are using?
Why the second approach at the link doesn't work for you?
just move 
q=trafficRouteId:235to
fq=, because it's pretty a filter, and use geodist() as a function
query. 
&sort=score%20asc&q={!func}geodist()

what do you get on this case? pls provide, logs, exception, and debug
response.

Thanks


On Tue, Jan 17, 2012 at 10:06 PM, Maxim Veksler  wrote:

> Hello,
>
> I'm querying with bbox which should be faster then geodist, my queries are
> looking like this:
>
> http://localhost:8983/solr/select?indent=true&fq={!bbox}&sfield=loc&pt=39.738548,-73.130322&d=100&sort=geodist()%20asc&q=trafficRouteId:235
>
> the trouble is, that with bbox solr does not return the distance of each
> document, I couldn't get it to work even with tips from
> http://wiki.apache.org/solr/SpatialSearch#Returning_the_distance
>
> Something I'm missing ?
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


RE: Question on Reverse Indexing

2012-01-18 Thread Shyam Bhaskaran
Dimitry,

Using http://localhost:7070/solr/docs/admin/analysis.jsp passed the query *lock 
and did not find ReversedWildcardFilterFactory to the indexer or any other 
filters that could do the reversing.

-Shyam

-Original Message-
From: Dmitry Kan [mailto:dmitry@gmail.com] 
Sent: Wednesday, January 18, 2012 2:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Question on Reverse Indexing

Just to play safe here, can you double check that the reversing is not any
more the case by issuing a query through the admin analysis page?

Dmitry

On Wed, Jan 18, 2012 at 4:23 AM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Hi Francois,
>
> I understand that disabling of ReversedWildcardFilterFactory has improved
> the performance.
>
> But I am puzzled over how the leading wild card search like *lock is
> working even though I have now disabled the ReversedWildcardFilterFactory
> and the indexes have been created without ReversedWildcardFilter ?
>
> How does reverse indexing work even after disabling
> ReversedWildcardFilterFactory?
>
> Can anyone explain me how this feature is working.
>
> -Shyam
>
> -Original Message-
> From: François Schiettecatte [mailto:fschietteca...@gmail.com]
> Sent: Wednesday, January 18, 2012 7:49 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> Using ReversedWildcardFilterFactory will double the size of your
> dictionary (more or less), maybe the drop in performance that you are
> seeing is a result of that?
>
> François
>
> On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:
>
> > Hi,
> >
> > For reverse indexing we are using the ReversedWildcardFilterFactory on
> Solr 4.0
> >
> >
> >  >
> > maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
> >
> >
> > ReversedWildcardFilterFactory was helping us to perform leading wild
> card searches like *lock.
> >
> > But it was observed that the performance of the searches was not good
> after introducing ReversedWildcardFilterFactory filter.
> >
> > Hence we disabled ReversedWildcardFilterFactory filter and re-created
> the indexes and this time we found the performance of Solr query to be
> faster.
> >
> > But surprisingly it is observed that leading wild card searches were
> still working inspite of disabling the ReversedWildcardFilterFactory filter.
> >
> >
> > This behavior is puzzling everyone and wanted to know how this behavior
> of reverse indexing works?
> >
> > Can anyone share with me on this Solr behavior.
> >
> > -Shyam
> >
>
>


-- 
Regards,

Dmitry Kan


Re: Question on Reverse Indexing

2012-01-18 Thread Dmitry Kan
OK. Not sure what is your system architecture there, but could your queries
stay cached in some server caches even after you have re-indexed your data?
The way the index level leading wildcard works (reading SOLR 3.4 code, but
seems to be true circa 1.4) is that the following check is done for the
analysis chain:

[code src=SolrQueryParser.java]
boolean allow = false;
...
  if (factory instanceof ReversedWildcardFilterFactory) {
allow = true;
...
  }
...
if (allow) {
  setAllowLeadingWildcard(true);
}
[/code]

so practically what you described can happen if
the ReversedWildcardFilterFactory is still mentioned in one of your shards.
A weird question, but have you reindexed your data to a clean index or on
top of the existing one?

On Wed, Jan 18, 2012 at 12:35 PM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Dimitry,
>
> Using http://localhost:7070/solr/docs/admin/analysis.jsp passed the query
> *lock and did not find ReversedWildcardFilterFactory to the indexer or any
> other filters that could do the reversing.
>
> -Shyam
>
> -Original Message-
> From: Dmitry Kan [mailto:dmitry@gmail.com]
> Sent: Wednesday, January 18, 2012 2:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> Just to play safe here, can you double check that the reversing is not any
> more the case by issuing a query through the admin analysis page?
>
> Dmitry
>
> On Wed, Jan 18, 2012 at 4:23 AM, Shyam Bhaskaran <
> shyam.bhaska...@synopsys.com> wrote:
>
> > Hi Francois,
> >
> > I understand that disabling of ReversedWildcardFilterFactory has improved
> > the performance.
> >
> > But I am puzzled over how the leading wild card search like *lock is
> > working even though I have now disabled the ReversedWildcardFilterFactory
> > and the indexes have been created without ReversedWildcardFilter ?
> >
> > How does reverse indexing work even after disabling
> > ReversedWildcardFilterFactory?
> >
> > Can anyone explain me how this feature is working.
> >
> > -Shyam
> >
> > -Original Message-
> > From: François Schiettecatte [mailto:fschietteca...@gmail.com]
> > Sent: Wednesday, January 18, 2012 7:49 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Question on Reverse Indexing
> >
> > Using ReversedWildcardFilterFactory will double the size of your
> > dictionary (more or less), maybe the drop in performance that you are
> > seeing is a result of that?
> >
> > François
> >
> > On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:
> >
> > > Hi,
> > >
> > > For reverse indexing we are using the ReversedWildcardFilterFactory on
> > Solr 4.0
> > >
> > >
> > >  > >
> > > maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
> > >
> > >
> > > ReversedWildcardFilterFactory was helping us to perform leading wild
> > card searches like *lock.
> > >
> > > But it was observed that the performance of the searches was not good
> > after introducing ReversedWildcardFilterFactory filter.
> > >
> > > Hence we disabled ReversedWildcardFilterFactory filter and re-created
> > the indexes and this time we found the performance of Solr query to be
> > faster.
> > >
> > > But surprisingly it is observed that leading wild card searches were
> > still working inspite of disabling the ReversedWildcardFilterFactory
> filter.
> > >
> > >
> > > This behavior is puzzling everyone and wanted to know how this behavior
> > of reverse indexing works?
> > >
> > > Can anyone share with me on this Solr behavior.
> > >
> > > -Shyam
> > >
> >
> >
>
>
> --
> Regards,
>
> Dmitry Kan
>



-- 
Regards,

Dmitry Kan


RE: Question on Reverse Indexing

2012-01-18 Thread Shyam Bhaskaran
Dimitry,

We are using Solr 4.0. To confirm server caching issues I have restarted our 
tomcat webserver after performing a re-index.

For reverseIndexing we have defined a fieldType "text_rev" and this fieldyType 
was used against the fields.

  
 







 
 





 
  

But when it was found that ReversedWildcardFilterFactory is adding performance 
burden we removed the ReversedWildcardFilterFactory filter

and the whole collection was re-indexed.

But even after removing the ReversedWildcardFilterFactory leading wild card 
search like *lock is working.

-Shyam

-Original Message-
From: Dmitry Kan [mailto:dmitry@gmail.com] 
Sent: Wednesday, January 18, 2012 4:26 PM
To: solr-user@lucene.apache.org
Subject: Re: Question on Reverse Indexing

OK. Not sure what is your system architecture there, but could your queries
stay cached in some server caches even after you have re-indexed your data?
The way the index level leading wildcard works (reading SOLR 3.4 code, but
seems to be true circa 1.4) is that the following check is done for the
analysis chain:

[code src=SolrQueryParser.java]
boolean allow = false;
...
  if (factory instanceof ReversedWildcardFilterFactory) {
allow = true;
...
  }
...
if (allow) {
  setAllowLeadingWildcard(true);
}
[/code]

so practically what you described can happen if
the ReversedWildcardFilterFactory is still mentioned in one of your shards.
A weird question, but have you reindexed your data to a clean index or on
top of the existing one?

On Wed, Jan 18, 2012 at 12:35 PM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Dimitry,
>
> Using http://localhost:7070/solr/docs/admin/analysis.jsp passed the query
> *lock and did not find ReversedWildcardFilterFactory to the indexer or any
> other filters that could do the reversing.
>
> -Shyam
>
> -Original Message-
> From: Dmitry Kan [mailto:dmitry@gmail.com]
> Sent: Wednesday, January 18, 2012 2:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> Just to play safe here, can you double check that the reversing is not any
> more the case by issuing a query through the admin analysis page?
>
> Dmitry
>
> On Wed, Jan 18, 2012 at 4:23 AM, Shyam Bhaskaran <
> shyam.bhaska...@synopsys.com> wrote:
>
> > Hi Francois,
> >
> > I understand that disabling of ReversedWildcardFilterFactory has improved
> > the performance.
> >
> > But I am puzzled over how the leading wild card search like *lock is
> > working even though I have now disabled the ReversedWildcardFilterFactory
> > and the indexes have been created without ReversedWildcardFilter ?
> >
> > How does reverse indexing work even after disabling
> > ReversedWildcardFilterFactory?
> >
> > Can anyone explain me how this feature is working.
> >
> > -Shyam
> >
> > -Original Message-
> > From: François Schiettecatte [mailto:fschietteca...@gmail.com]
> > Sent: Wednesday, January 18, 2012 7:49 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Question on Reverse Indexing
> >
> > Using ReversedWildcardFilterFactory will double the size of your
> > dictionary (more or less), maybe the drop in performance that you are
> > seeing is a result of that?
> >
> > François
> >
> > On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:
> >
> > > Hi,
> > >
> > > For reverse indexing we are using the ReversedWildcardFilterFactory on
> > Solr 4.0
> > >
> > >
> > >  > >
> > > maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
> > >
> > >
> > > ReversedWildcardFilterFactory was helping us to perform leading wild
> > card searches like *lock.
> > >
> > > But it was observed that the performance of the searches was not good
> > after introducing ReversedWildcardFilterFactory filter.
> > >
> > > Hence we disabled ReversedWildcardFilterFactory filter and re-created
> > the indexes and this time we found the performance of Solr query to be
> > faster.
> > >
> > > But surprisingly it is observed that leading wild card searches were
> > still working inspite of disabling the ReversedWildcardFilterFactory
> filter.
> > >
> > >
> > > This behavior is puzzling everyone and wanted to know how this behavior
> > of reverse indexing works?
> > >
> > > Can anyone share with me on this Solr behavior.
> > >
> > > -Shyam
> > >
> >
> >
>
>
> --
> Regards,
>
> Dmitry Kan
>



-- 
Regards,

Dmitry Kan


Re: Question on Reverse Indexing

2012-01-18 Thread Dmitry Kan
Shyam,

You still didn't say if you have started re-indexing from the clean index,
i.e. if you have removed all the data prior to re-indexing.
You can use the luke (http://code.google.com/p/luke/) to check the contents
of your text field, and see if it still contains reversed sequences.

On Wed, Jan 18, 2012 at 1:09 PM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Dimitry,
>
> We are using Solr 4.0. To confirm server caching issues I have restarted
> our tomcat webserver after performing a re-index.
>
> For reverseIndexing we have defined a fieldType "text_rev" and this
> fieldyType was used against the fields.
>
>   omitNorms="true">
> 
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> words="stopwords.txt" ignoreCase="true"/>
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> class="com.es.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/>
>
>  withOriginal="true"
>maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
>  
> 
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> words="stopwords.txt" ignoreCase="true"/>
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
>
> words="stopwords.txt" ignoreCase="true"/>
> 
>  
>
> But when it was found that ReversedWildcardFilterFactory is adding
> performance burden we removed the ReversedWildcardFilterFactory filter
>  withOriginal="true"
>maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
> and the whole collection was re-indexed.
>
> But even after removing the ReversedWildcardFilterFactory leading wild
> card search like *lock is working.
>
> -Shyam
>
> -Original Message-
> From: Dmitry Kan [mailto:dmitry@gmail.com]
> Sent: Wednesday, January 18, 2012 4:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> OK. Not sure what is your system architecture there, but could your queries
> stay cached in some server caches even after you have re-indexed your data?
> The way the index level leading wildcard works (reading SOLR 3.4 code, but
> seems to be true circa 1.4) is that the following check is done for the
> analysis chain:
>
> [code src=SolrQueryParser.java]
> boolean allow = false;
> ...
>  if (factory instanceof ReversedWildcardFilterFactory) {
>allow = true;
>...
>  }
> ...
>if (allow) {
>  setAllowLeadingWildcard(true);
>}
> [/code]
>
> so practically what you described can happen if
> the ReversedWildcardFilterFactory is still mentioned in one of your shards.
> A weird question, but have you reindexed your data to a clean index or on
> top of the existing one?
>
> On Wed, Jan 18, 2012 at 12:35 PM, Shyam Bhaskaran <
> shyam.bhaska...@synopsys.com> wrote:
>
> > Dimitry,
> >
> > Using http://localhost:7070/solr/docs/admin/analysis.jsp passed the
> query
> > *lock and did not find ReversedWildcardFilterFactory to the indexer or
> any
> > other filters that could do the reversing.
> >
> > -Shyam
> >
> > -Original Message-
> > From: Dmitry Kan [mailto:dmitry@gmail.com]
> > Sent: Wednesday, January 18, 2012 2:26 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Question on Reverse Indexing
> >
> > Just to play safe here, can you double check that the reversing is not
> any
> > more the case by issuing a query through the admin analysis page?
> >
> > Dmitry
> >
> > On Wed, Jan 18, 2012 at 4:23 AM, Shyam Bhaskaran <
> > shyam.bhaska...@synopsys.com> wrote:
> >
> > > Hi Francois,
> > >
> > > I understand that disabling of ReversedWildcardFilterFactory has
> improved
> > > the performance.
> > >
> > > But I am puzzled over how the leading wild card search like *lock is
> > > working even though I have now disabled the
> ReversedWildcardFilterFactory
> > > and the indexes have been created without ReversedWildcardFilter ?
> > >
> > > How does reverse indexing work even after disabling
> > > ReversedWildcardFilterFactory?
> > >
> > > Can anyone explain me how this feature is working.
> > >
> > > -Shyam
> > >
> > > -Original Message-
> > > From: François Schiettecatte [mailto:fschietteca...@gmail.com]
> > > Sent: Wednesday, January 18, 2012 7:49 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Question on Reverse Indexing
> > >
> > > Using ReversedWildcardFilterFactory will double the size of your
> > > dictionary (more or less), maybe the drop in performance that you are
> > > seeing is a result of that?
> > >
> > > François
> > >
> > > On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:
> > >
> > > > Hi,
> > > >
> > > > For reverse indexing we are using the ReversedWildca

Solrj use wrong queryResponseWriter

2012-01-18 Thread tschiela
Hello,

i run into dubious problems here. I use SolrJ 3.5 to query my Solr Server
3.5

So i set the QueryResponseWriter to xml in my code and in solrconfig.xml...
in code i use this.server.setParser(new XMLResponseParser());

After i query Solr i want to output the QueryResponse:
String xml = solrs.getQueryResponse().toString();
response.setContentType("text/xml");
response.setContentLength(xml.length());

output = response.getOutputStream();
output.write(xml.getBytes());
output.flush();
output.close();

All work fine, but i get the search result in JSON format. What i do wrong?

Greets 
Thomas

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solrj-use-wrong-queryResponseWriter-tp3668974p3668974.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to return the distance geo distance on solr 3.5 with bbox filtering

2012-01-18 Thread Maxim Veksler
Hello Mikhail,

Please see reply inline.

On Wed, Jan 18, 2012 at 11:00 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Maxim,
>
> Which version of Solr you are using?
>

As mentioned in the title, I'm using Solr 3.5.


> Why the second approach at the link doesn't work for you?
> just move q=trafficRouteId:235<
> http://localhost:8983/solr/select?indent=true&fq=%7B%21bbox%7D&sfield=loc&pt=39.738548,-73.130322&d=100&sort=geodist%28%29%20asc&q=trafficRouteId:235
> >to
> fq=, because it's pretty a filter, and use geodist() as a function
> query. &sort=score%20asc&q={!func}geodist()<
> http://localhost:8983/solr/select?indent=true&fl=name,store&sfield=store&pt=45.15,-93.85&sort=score%20asc&q=%7B%21func%7Dgeodist%28%29
> >
>
>
I'm not sure I'm following. I'm trying to use bbox instead of

I use the fq fields to define the bbox filtering.
I also need to query by another parameter (trafficRouteId).

I would optimally would be happy to get the distance calculation from Solr
but that doesn't seem to work in any format of query I tried.

Being new to Solr query language I'm not sure how to form the search terms
to combine all of this with the score.



> what do you get on this case? pls provide, logs, exception, and debug
> response.
>
> Thanks
>
>
> On Tue, Jan 17, 2012 at 10:06 PM, Maxim Veksler 
> wrote:
>
> > Hello,
> >
> > I'm querying with bbox which should be faster then geodist, my queries
> are
> > looking like this:
> >
> >
> http://localhost:8983/solr/select?indent=true&fq={!bbox}&sfield=loc&pt=39.738548,-73.130322&d=100&sort=geodist()%20asc&q=trafficRouteId:235
> <
> http://localhost:8983/solr/select?indent=true&fq=%7B%21bbox%7D&sfield=loc&pt=39.738548,-73.130322&d=100&sort=geodist%28%29%20asc&q=trafficRouteId:235
> >
> >
> > the trouble is, that with bbox solr does not return the distance of each
> > document, I couldn't get it to work even with tips from
> > http://wiki.apache.org/solr/SpatialSearch#Returning_the_distance
> >
> > Something I'm missing ?
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Lucid Certified
> Apache Lucene/Solr Developer
> Grid Dynamics
>
> 
>  
>


RE: Improving Solr Spell Checker Results

2012-01-18 Thread O. Klein

Dyer, James wrote
> 
> David,
> 
> The spellchecker normally won't give suggestions for any term in your
> index.  So even if "wever" is misspelled in context, if it exists in the
> index the spell checker will not try correcting it.  There are 3
> workarounds:
> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). 
> See https://issues.apache.org/jira/browse/SOLR-2585
> 

When using trunk and DirectSolrSpellChecker I do get suggestions for terms
that are in the index. Lowering the thresholdTokenFrequency to 0.001 in my
case is giving me very good suggestions even if documents with the
misspelled word in them were found.

This combined with maxCollationTries (with all terms required) is giving
some sort of context sensitive suggestions.

Is this correct or is there something I'm missing?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-tp3658411p3669186.html
Sent from the Solr - User mailing list archive at Nabble.com.


Different mm for spellcheckquery

2012-01-18 Thread O. Klein
What is the best way to search with a mm of 0%, but use a mm of 100% on the
spellcheck query so maxCollationTries gives the best results?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Different-mm-for-spellcheckquery-tp3669200p3669200.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Grouping results after Sorting or vice-versa

2012-01-18 Thread Vijayaragavan
Thanks Tomás and Juan...

I got the expected results when i updated solr to v3.5.0

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Grouping-results-after-Sorting-or-vice-versa-tp3615957p3669299.html
Sent from the Solr - User mailing list archive at Nabble.com.


"index-time" over boosted

2012-01-18 Thread remi tassing
Hello all,

I've come accros a problem where newly indexed pages almost always come
first even when the term frequency is relatively slow.

I read the posts below on "fieldNorm" and "omitNorms" but setting
"omitNorms=true" doesn't change anything for me on the calculation of
fieldNorm.

e.g.:
0.12333163 = (MATCH) weight(content:"mobil broadband" in 1004), product of:
1.0 = queryWeight(content:"mobil broadband"), product of: 6.3145795 =
idf(content: mobil=4922 broadband=2290) 0.15836367 = queryNorm 0.12333163 =
fieldWeight(content:"mobil broadband" in 1004), product of: 1.0 =
tf(phraseFreq=1.0) 6.3145795 = idf(content: mobil=4922 broadband=2290)
0.01953125 = fieldNorm(field=content, doc=1004)

These values are the same regardless of omitNorms's value.

Any idea what might be the problem?

[1]
http://lucene.472066.n3.nabble.com/QueryNorm-and-FieldNorm-td1992964.html
[2]
http://lucene.472066.n3.nabble.com/Question-about-fieldNorms-td504500.html


ReversedWildcardFilterFactory Question

2012-01-18 Thread Jamie Johnson
I'm trying to determine when it is appropriate to use the
solr.ReversedWildcardFilterFactory, specifically if I have a field
content of type text (from default schema) which I want to be able to
search with leading wildcards do I need to index this information into
both a text field and a text_rev field, or is it sufficient to just
index the information into a text_rev field?  I *think* that it only
needs to be in text_rev, but I want to make sure before I go mucking
with my schema.


Size of fields from one document (monitoring, debugging)

2012-01-18 Thread Vadim Kisselmann
Hello folks,

is it possible to find out the size (in KB) of specific fields from
one document? Eventually with Luke or Lucid Gaze?
My case:
docs in my old index (Solr 1.4) have sizes of 3-4KB each.
In my new index(Solr 4.0 trunk) there are about 15KB per doc.
I changed only 2 things in my schema.xml. I added the
ReversedWildcardFilterFactory(indexing) and one field (LatLonType,
stored and indexed).
My content is more or less the same.
I would like to debug this to refactor my schema.xml.

The newest Luke Version(3.5) doesn't work with Solr 4.0 from trunk, so
i can't test it.

Cheers
Vadim


Re: ReversedWildcardFilterFactory Question

2012-01-18 Thread Dmitry Kan
You can store both the non-reverted and reverted terms in one field, so
that you can do leading wildcard and other searches against one field. So
your schema may look something like this:


  




  
  
  
  
  




On Wed, Jan 18, 2012 at 4:19 PM, Jamie Johnson  wrote:

> I'm trying to determine when it is appropriate to use the
> solr.ReversedWildcardFilterFactory, specifically if I have a field
> content of type text (from default schema) which I want to be able to
> search with leading wildcards do I need to index this information into
> both a text field and a text_rev field, or is it sufficient to just
> index the information into a text_rev field?  I *think* that it only
> needs to be in text_rev, but I want to make sure before I go mucking
> with my schema.
>



-- 
Regards,

Dmitry Kan


Re: Trying to understand SOLR memory requirements

2012-01-18 Thread Dave
I'm using 3.5

On Tue, Jan 17, 2012 at 7:57 PM, Lance Norskog  wrote:

> Which version of Solr do you use? 3.1 and 3.2 had a memory leak bug in
> spellchecking. This was fixed in 3.3.
>
> On Tue, Jan 17, 2012 at 5:59 AM, Robert Muir  wrote:
> > I committed it already: so you can try out branch_3x if you want.
> >
> > you can either wait for a nightly build or compile from svn
> > (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
> >
> > On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> >> Thank you Robert, I'd appreciate that. Any idea how long it will take to
> >> get a fix? Would I be better switching to trunk? Is trunk stable enough
> for
> >> someone who's very much a SOLR novice?
> >>
> >> Thanks,
> >> Dave
> >>
> >> On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
> >>
> >>> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> >>>
> >>> Previously, FST would need to hold all the terms in RAM during
> >>> construction, but with the patch it uses offline sorts/temporary
> >>> files.
> >>> I'll reopen the issue to backport this to the 3.x branch.
> >>>
> >>>
> >>> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> >>> > I'm trying to figure out what my memory needs are for a rather large
> >>> > dataset. I'm trying to build an auto-complete system for every
> >>> > city/state/country in the world. I've got a geographic database, and
> have
> >>> > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> >>> > which I've formatted into JSON-like output, so there's a bit of data
> >>> > associated with each one. Here is an example record:
> >>> >
> >>> > Brooklyn, New York, United States?{ |id|: |2620829|,
> >>> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> >>> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> >>> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> >>> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> >>> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> United
> >>> > States| }
> >>> >
> >>> > I've got the spellchecker / suggester module setup, and I can confirm
> >>> that
> >>> > everything works properly with a smaller dataset (i.e. just a couple
> of
> >>> > countries worth of cities/states). However I'm running into a big
> problem
> >>> > when I try to index the entire dataset. The
> >>> dataimport?command=full-import
> >>> > works and the system comes to an idle state. It generates the
> following
> >>> > data/index/ directory (I'm including it in case it gives any
> indication
> >>> on
> >>> > memory requirements):
> >>> >
> >>> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> >>> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> >>> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> >>> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> >>> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> >>> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> >>> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> >>> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> >>> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> >>> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >>> >
> >>> > Next I try to run the suggest?spellcheck.build=true command, and I
> get
> >>> the
> >>> > following error:
> >>> >
> >>> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester
> build
> >>> > INFO: build()
> >>> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> >>> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >>> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >>> > at java.lang.String.(String.java:215)
> >>> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> >>> > at
> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >>> >  at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> >>> > at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> >>> >  at
> org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> >>> > at
> >>>
> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
> >>> >  at
> >>>
> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> >>> > at
> >>> >
> >>>
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
> >>> >  at
> >>> >
> >>>
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
> >>> > at
> >>>
> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
> >>> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
> >>> > at
> org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
> >>> >  at
> >>> >
> >>>
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComp

Re: Trying to understand SOLR memory requirements

2012-01-18 Thread Dave
Robert, where can I pull down a nightly build from? Will it include the
apache-solr-core-3.3.0.jar and lucene-core-3.3-SNAPSHOT.jar jars? I need to
re-build with a custom SpellingQueryConverter.java.

Thanks,
Dave

On Tue, Jan 17, 2012 at 8:59 AM, Robert Muir  wrote:

> I committed it already: so you can try out branch_3x if you want.
>
> you can either wait for a nightly build or compile from svn
> (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
>
> On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> > Thank you Robert, I'd appreciate that. Any idea how long it will take to
> > get a fix? Would I be better switching to trunk? Is trunk stable enough
> for
> > someone who's very much a SOLR novice?
> >
> > Thanks,
> > Dave
> >
> > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
> >
> >> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> >>
> >> Previously, FST would need to hold all the terms in RAM during
> >> construction, but with the patch it uses offline sorts/temporary
> >> files.
> >> I'll reopen the issue to backport this to the 3.x branch.
> >>
> >>
> >> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> >> > I'm trying to figure out what my memory needs are for a rather large
> >> > dataset. I'm trying to build an auto-complete system for every
> >> > city/state/country in the world. I've got a geographic database, and
> have
> >> > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> >> > which I've formatted into JSON-like output, so there's a bit of data
> >> > associated with each one. Here is an example record:
> >> >
> >> > Brooklyn, New York, United States?{ |id|: |2620829|,
> >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> United
> >> > States| }
> >> >
> >> > I've got the spellchecker / suggester module setup, and I can confirm
> >> that
> >> > everything works properly with a smaller dataset (i.e. just a couple
> of
> >> > countries worth of cities/states). However I'm running into a big
> problem
> >> > when I try to index the entire dataset. The
> >> dataimport?command=full-import
> >> > works and the system comes to an idle state. It generates the
> following
> >> > data/index/ directory (I'm including it in case it gives any
> indication
> >> on
> >> > memory requirements):
> >> >
> >> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> >> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> >> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> >> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> >> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> >> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> >> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> >> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> >> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> >> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >> >
> >> > Next I try to run the suggest?spellcheck.build=true command, and I get
> >> the
> >> > following error:
> >> >
> >> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester
> build
> >> > INFO: build()
> >> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> >> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >> > at java.lang.String.(String.java:215)
> >> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> >> > at
> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >> >  at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> >> > at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> >> >  at
> org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> >> > at
> >>
> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
> >> >  at
> >> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> >> > at
> >> >
> >>
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
> >> >  at
> >> >
> >>
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
> >> > at
> >> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
> >> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
> >> > at
> org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
> >> >  at
> >> >
> >>
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
> >> > at
> >> >
> >>
> org.apache.solr.handler.component.Sea

RE: DataImportHandler in Solr 4.0

2012-01-18 Thread Dyer, James
You need to find "apache-solr-solrj-4.0.jar" from your distribution and put it 
in the classpath somewhere.  Perhaps the easiest thing is to include it in your 
core's "lib" directory.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: Rob [mailto:rlusa...@gmail.com] 
Sent: Tuesday, January 17, 2012 6:38 PM
To: solr-user@lucene.apache.org
Subject: Re: DataImportHandler in Solr 4.0

Not a java pro, and the documentation hasn't been updated to include these
instructions (at least that I could find). What do I need to do to perform
the steps that Alexandre is talking about?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-in-Solr-4-0-tp2563053p3667942.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Improving Solr Spell Checker Results

2012-01-18 Thread Dyer, James
Taking a quick look at DirectSolrSpellChecker I think I agree that using 
DirectSolrSpellChecker and the "thresholdTokenFrequency" parameter may provide 
an additional workaround for David's situation.  One caveat is that terms like 
"wever" need to always be low-frequency.  Also, DirectSolrSpellChecker is 
available only for 4.x/Trunk, where it is the default spellcheck impl.  But if 
using 4.x/Trunk, you can possibly do even better by applying the SOLR-2585 
patch:  even if the mispelled word is high-frequency yet wrong in context, this 
patch still would allow you to get suggestions.  (The downside being that 
SOLR-2585 is brand-new and hasn't seen much scrutiny yet.)

This is different behavior than IndexBasedSpellChecker, which will never give 
suggestions for a term in the index (unless of course you use 
"onlyMorePopular").  With IndexBasedSpellChecker, "thresholdTokenFrequency" 
only removes low-frequency terms from possibly being suggested.  It does not 
control which terms will generate suggestions.  IndexBasedSpellChecker is the 
default spellcheck impl for 3.x and earlier versions.

Thank you for clarifying this important difference between the two spellcheck 
impls.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-Original Message-
From: O. Klein [mailto:kl...@octoweb.nl] 
Sent: Wednesday, January 18, 2012 7:22 AM
To: solr-user@lucene.apache.org
Subject: RE: Improving Solr Spell Checker Results


Dyer, James wrote
> 
> David,
> 
> The spellchecker normally won't give suggestions for any term in your
> index.  So even if "wever" is misspelled in context, if it exists in the
> index the spell checker will not try correcting it.  There are 3
> workarounds:
> 1. Use the patch included with SOLR-2585 (this is for Trunk/4.x only). 
> See https://issues.apache.org/jira/browse/SOLR-2585
> 

When using trunk and DirectSolrSpellChecker I do get suggestions for terms
that are in the index. Lowering the thresholdTokenFrequency to 0.001 in my
case is giving me very good suggestions even if documents with the
misspelled word in them were found.

This combined with maxCollationTries (with all terms required) is giving
some sort of context sensitive suggestions.

Is this correct or is there something I'm missing?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-tp3658411p3669186.html
Sent from the Solr - User mailing list archive at Nabble.com.


replication, disk space

2012-01-18 Thread Jonathan Rochkind
So Solr 1.4. I have a solr master/slave, where it actually doesn't poll 
for replication, it only replicates irregularly when I issue a replicate 
command to it.


After the last replication, the slave, in solr_home, has a data/index 
directory as well as a data/index.20120113121302 directory.


The /admin/replication/index.jsp admin page reports:

Local Index
Index Version: 1326407139862, Generation: 183
Location: /opt/solr/solr_searcher/prod/data/index.20120113121302


So does this mean the index. file is actually the one currently 
being used live, not the straight 'index'? Why?


I can't afford the disk space to leave both of these around 
indefinitely.  After replication completes and is committed, why would 
two index dirs be left?  And how can I restore this to one index dir, 
without downtime? If it's really using the "index.X" directory, then 
I could just delete the "index" directory, but that's a bad idea, 
because next time the server starts it's going to be looking for 
"index", not "index.".  And if it's using the timestamped index file 
now, I can't delete THAT one now either.


If I was willing to restart the tomcat container, then I could delete 
one, rename the other, etc. But I don't want downtime.


I really don't understand what's going on or how it got in this state. 
Any ideas?


Jonathan



Re: Solr Cloud Indexing

2012-01-18 Thread Sujatha Arun
Thanks for the input.I conclude that It does not make sense to do it this
way,

Regards
Sujatha

On Wed, Jan 18, 2012 at 6:26 AM, Lance Norskog  wrote:

> Cloud upload bandwidth is free, but download bandwidth costs money. If
> you upload a lot of data but do not query it often, Amazon can make
> sense.  You can also rent much cheaper hardware in other hosting
> services where you pay by the month or even by the year. If you know
> you have a cap on how much resource you will need at once, the cheaper
> sites make more sense.
>
> On Tue, Jan 17, 2012 at 7:36 AM, Erick Erickson 
> wrote:
> > This only really makes sense if you don't have enough in-house resources
> > to do your indexing locally, but it certainly is possible.
> >
> > Amazon's EC2 has been used, but really any hosting service should do.
> >
> > Best
> > Erick
> >
> > On Tue, Jan 17, 2012 at 12:09 AM, Sujatha Arun 
> wrote:
> >> Would it make sense to  Index on the cloud and periodically [2-4 times
> >> /day] replicate the index at  our server for searching .Which service
> to go
> >> with for solr Cloud Indexing ?
> >>
> >> Any good and tried services?
> >>
> >> Regards
> >> Sujatha
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: PositionIncrementGap inside a field

2012-01-18 Thread maurizio1976
This is actually a *Nested proximity search*.
I think the query you wrote there, Mergio, will not work.
and I think there is no way in Solr to run a Nested proximity query yet.
Do you know anything about that Erik?

this is what you want to do:
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene
http://www.slideshare.net/MarkHarwood/proposal-for-nested-document-support-in-lucene
 

the only way seems to be to do it through Lucene. have a look at what this
guy did:
http://blog.griddynamics.com/2011/10/solr-experience-search-parent-child.html
http://blog.griddynamics.com/2011/10/solr-experience-search-parent-child.html 

hope this helps. :)

Maurizio

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p3669830.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: replication, disk space

2012-01-18 Thread Artem Lokotosh
Which OS do you using?
Maybe related to this Solr bug
https://issues.apache.org/jira/browse/SOLR-1781

On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkind  wrote:
> So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for
> replication, it only replicates irregularly when I issue a replicate command
> to it.
>
> After the last replication, the slave, in solr_home, has a data/index
> directory as well as a data/index.20120113121302 directory.
>
> The /admin/replication/index.jsp admin page reports:
>
> Local Index
> Index Version: 1326407139862, Generation: 183
> Location: /opt/solr/solr_searcher/prod/data/index.20120113121302
>
>
> So does this mean the index. file is actually the one currently being
> used live, not the straight 'index'? Why?
>
> I can't afford the disk space to leave both of these around indefinitely.
>  After replication completes and is committed, why would two index dirs be
> left?  And how can I restore this to one index dir, without downtime? If
> it's really using the "index.X" directory, then I could just delete the
> "index" directory, but that's a bad idea, because next time the server
> starts it's going to be looking for "index", not "index.".  And if it's
> using the timestamped index file now, I can't delete THAT one now either.
>
> If I was willing to restart the tomcat container, then I could delete one,
> rename the other, etc. But I don't want downtime.
>
> I really don't understand what's going on or how it got in this state. Any
> ideas?
>
> Jonathan
>



-- 
Best regards,
Artem Lokotosh        mailto:arco...@gmail.com


RE: replication, disk space

2012-01-18 Thread Dyer, James
I've seen this happen when the configuration files change on the master and 
replication deems it necessary to do a core-reload on the slave. In this case, 
replication copies the entire index to the new directory then does a core 
re-load to make the new config files and new index directory go live.  Because 
it is keeping the old searcher running while the new searcher is being started, 
both index copies to exist until the swap is complete.  I remember having the 
same concern about re-starts, but I believe I tested this and solr will look at 
the "replication.properties" file on startup and determine the correct index 
dir to use from that.  So (If my memory is correct) you can safely delete 
"index" so long as "replication.properties" points to the other directory.

I wasn't familiar with SOLR-1781.  Maybe replication is supposed to clean up 
the extra directories and doesn't sometimes?  In any case, I've found whenever 
it happens its ok to go out and delete the one(s) not being used, even if that 
means deleting "index".

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311

-Original Message-
From: Artem Lokotosh [mailto:arco...@gmail.com] 
Sent: Wednesday, January 18, 2012 12:24 PM
To: solr-user@lucene.apache.org
Subject: Re: replication, disk space

Which OS do you using?
Maybe related to this Solr bug
https://issues.apache.org/jira/browse/SOLR-1781

On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkind  wrote:
> So Solr 1.4. I have a solr master/slave, where it actually doesn't poll for
> replication, it only replicates irregularly when I issue a replicate command
> to it.
>
> After the last replication, the slave, in solr_home, has a data/index
> directory as well as a data/index.20120113121302 directory.
>
> The /admin/replication/index.jsp admin page reports:
>
> Local Index
> Index Version: 1326407139862, Generation: 183
> Location: /opt/solr/solr_searcher/prod/data/index.20120113121302
>
>
> So does this mean the index. file is actually the one currently being
> used live, not the straight 'index'? Why?
>
> I can't afford the disk space to leave both of these around indefinitely.
>  After replication completes and is committed, why would two index dirs be
> left?  And how can I restore this to one index dir, without downtime? If
> it's really using the "index.X" directory, then I could just delete the
> "index" directory, but that's a bad idea, because next time the server
> starts it's going to be looking for "index", not "index.".  And if it's
> using the timestamped index file now, I can't delete THAT one now either.
>
> If I was willing to restart the tomcat container, then I could delete one,
> rename the other, etc. But I don't want downtime.
>
> I really don't understand what's going on or how it got in this state. Any
> ideas?
>
> Jonathan
>



-- 
Best regards,
Artem Lokotosh        mailto:arco...@gmail.com


Re: "index-time" over boosted

2012-01-18 Thread Jan Høydahl
> I've come accros a problem where newly indexed pages almost always come
> first even when the term frequency is relatively slow.

There is no inherent index-time boost, so this must be something else.
Can you give us an example of a query? Which query parser do you use?

> I read the posts below on "fieldNorm" and "omitNorms" but setting
> "omitNorms=true" doesn't change anything for me on the calculation of
> fieldNorm.

Are you sure you have spelled omitNorms="true" correctly, then restarted Solr 
(to refresh config)? The effect of Norms on your score will be that shorter 
fields score higher than long fields.

Perhaps you instead can try to tell us your use-case. What kind of raning are 
you trying to achieve? Then we can help suggest how to get there.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

Re: replication, disk space

2012-01-18 Thread Tomás Fernández Löbbe
As far as I know, the replication is supposed to delete the old directory
index. However, the initial question is "why is this new index directory
being created". Are you adding/updating documents in the slave? what about
optimizing it? Are you rebuilding the index from scratch in the master?

Also, What OS are you on?

Tomás

On Wed, Jan 18, 2012 at 3:41 PM, Dyer, James wrote:

> I've seen this happen when the configuration files change on the master
> and replication deems it necessary to do a core-reload on the slave. In
> this case, replication copies the entire index to the new directory then
> does a core re-load to make the new config files and new index directory go
> live.  Because it is keeping the old searcher running while the new
> searcher is being started, both index copies to exist until the swap is
> complete.  I remember having the same concern about re-starts, but I
> believe I tested this and solr will look at the "replication.properties"
> file on startup and determine the correct index dir to use from that.  So
> (If my memory is correct) you can safely delete "index" so long as
> "replication.properties" points to the other directory.
>
> I wasn't familiar with SOLR-1781.  Maybe replication is supposed to clean
> up the extra directories and doesn't sometimes?  In any case, I've found
> whenever it happens its ok to go out and delete the one(s) not being used,
> even if that means deleting "index".
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
> -Original Message-
> From: Artem Lokotosh [mailto:arco...@gmail.com]
> Sent: Wednesday, January 18, 2012 12:24 PM
> To: solr-user@lucene.apache.org
> Subject: Re: replication, disk space
>
> Which OS do you using?
> Maybe related to this Solr bug
> https://issues.apache.org/jira/browse/SOLR-1781
>
> On Wed, Jan 18, 2012 at 6:32 PM, Jonathan Rochkind 
> wrote:
> > So Solr 1.4. I have a solr master/slave, where it actually doesn't poll
> for
> > replication, it only replicates irregularly when I issue a replicate
> command
> > to it.
> >
> > After the last replication, the slave, in solr_home, has a data/index
> > directory as well as a data/index.20120113121302 directory.
> >
> > The /admin/replication/index.jsp admin page reports:
> >
> > Local Index
> > Index Version: 1326407139862, Generation: 183
> > Location: /opt/solr/solr_searcher/prod/data/index.20120113121302
> >
> >
> > So does this mean the index. file is actually the one currently being
> > used live, not the straight 'index'? Why?
> >
> > I can't afford the disk space to leave both of these around indefinitely.
> >  After replication completes and is committed, why would two index dirs
> be
> > left?  And how can I restore this to one index dir, without downtime? If
> > it's really using the "index.X" directory, then I could just delete
> the
> > "index" directory, but that's a bad idea, because next time the server
> > starts it's going to be looking for "index", not "index.".  And if
> it's
> > using the timestamped index file now, I can't delete THAT one now either.
> >
> > If I was willing to restart the tomcat container, then I could delete
> one,
> > rename the other, etc. But I don't want downtime.
> >
> > I really don't understand what's going on or how it got in this state.
> Any
> > ideas?
> >
> > Jonathan
> >
>
>
>
> --
> Best regards,
> Artem Lokotoshmailto:arco...@gmail.com
>


Pdf Portfolios

2012-01-18 Thread Lucas Simão
Hello ,

I am trying to index PDF files in Solr  when the PDF file is simple
everything is fine but when i use Portfolio PDF Portfolio
(
http://help.adobe.com/en_US/Acrobat/9.0/Standard/WSA2872EA8-9756-4a8c-9F20-8E93D59D91CE.html
 )

using tika it does not works.

Someone know how to extract data from Pdf Portfolios ??


att,
Lucas


Re: Trying to understand SOLR memory requirements

2012-01-18 Thread Dave
Ok, I've been able to pull the code from SVN, build it, and compile my
SpellingQueryConverter against it. However, I'm at a loss as to where to
find / how to build the solr.war file?

On Tue, Jan 17, 2012 at 8:59 AM, Robert Muir  wrote:

> I committed it already: so you can try out branch_3x if you want.
>
> you can either wait for a nightly build or compile from svn
> (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
>
> On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> > Thank you Robert, I'd appreciate that. Any idea how long it will take to
> > get a fix? Would I be better switching to trunk? Is trunk stable enough
> for
> > someone who's very much a SOLR novice?
> >
> > Thanks,
> > Dave
> >
> > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
> >
> >> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> >>
> >> Previously, FST would need to hold all the terms in RAM during
> >> construction, but with the patch it uses offline sorts/temporary
> >> files.
> >> I'll reopen the issue to backport this to the 3.x branch.
> >>
> >>
> >> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> >> > I'm trying to figure out what my memory needs are for a rather large
> >> > dataset. I'm trying to build an auto-complete system for every
> >> > city/state/country in the world. I've got a geographic database, and
> have
> >> > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> >> > which I've formatted into JSON-like output, so there's a bit of data
> >> > associated with each one. Here is an example record:
> >> >
> >> > Brooklyn, New York, United States?{ |id|: |2620829|,
> >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> United
> >> > States| }
> >> >
> >> > I've got the spellchecker / suggester module setup, and I can confirm
> >> that
> >> > everything works properly with a smaller dataset (i.e. just a couple
> of
> >> > countries worth of cities/states). However I'm running into a big
> problem
> >> > when I try to index the entire dataset. The
> >> dataimport?command=full-import
> >> > works and the system comes to an idle state. It generates the
> following
> >> > data/index/ directory (I'm including it in case it gives any
> indication
> >> on
> >> > memory requirements):
> >> >
> >> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> >> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> >> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> >> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> >> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> >> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> >> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> >> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> >> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> >> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >> >
> >> > Next I try to run the suggest?spellcheck.build=true command, and I get
> >> the
> >> > following error:
> >> >
> >> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester
> build
> >> > INFO: build()
> >> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> >> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> >> > at java.lang.String.(String.java:215)
> >> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> >> > at
> org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >> >  at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> >> > at
> org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> >> >  at
> org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> >> > at
> >>
> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
> >> >  at
> >> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> >> > at
> >> >
> >>
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
> >> >  at
> >> >
> >>
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
> >> > at
> >> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
> >> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
> >> > at
> org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
> >> >  at
> >> >
> >>
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
> >> > at
> >> >
> >>
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(Se

RE: Trying to understand SOLR memory requirements

2012-01-18 Thread Steven A Rowe
Hi Dave,

Try 'ant usage' from the solr/ directory.

Steve

> -Original Message-
> From: Dave [mailto:dla...@gmail.com]
> Sent: Wednesday, January 18, 2012 2:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Trying to understand SOLR memory requirements
> 
> Ok, I've been able to pull the code from SVN, build it, and compile my
> SpellingQueryConverter against it. However, I'm at a loss as to where to
> find / how to build the solr.war file?
> 
> On Tue, Jan 17, 2012 at 8:59 AM, Robert Muir  wrote:
> 
> > I committed it already: so you can try out branch_3x if you want.
> >
> > you can either wait for a nightly build or compile from svn
> > (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
> >
> > On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> > > Thank you Robert, I'd appreciate that. Any idea how long it will take
> to
> > > get a fix? Would I be better switching to trunk? Is trunk stable
> enough
> > for
> > > someone who's very much a SOLR novice?
> > >
> > > Thanks,
> > > Dave
> > >
> > > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir 
> wrote:
> > >
> > >> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> > >>
> > >> Previously, FST would need to hold all the terms in RAM during
> > >> construction, but with the patch it uses offline sorts/temporary
> > >> files.
> > >> I'll reopen the issue to backport this to the 3.x branch.
> > >>
> > >>
> > >> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> > >> > I'm trying to figure out what my memory needs are for a rather
> large
> > >> > dataset. I'm trying to build an auto-complete system for every
> > >> > city/state/country in the world. I've got a geographic database,
> and
> > have
> > >> > setup the DIH to pull the proper data in. There are 2,784,937
> > documents
> > >> > which I've formatted into JSON-like output, so there's a bit of
> data
> > >> > associated with each one. Here is an example record:
> > >> >
> > >> > Brooklyn, New York, United States?{ |id|: |2620829|,
> > >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| :
> |229|
> > },
> > >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|,
> |plainname|:
> > >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> > >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> > >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> > United
> > >> > States| }
> > >> >
> > >> > I've got the spellchecker / suggester module setup, and I can
> confirm
> > >> that
> > >> > everything works properly with a smaller dataset (i.e. just a
> couple
> > of
> > >> > countries worth of cities/states). However I'm running into a big
> > problem
> > >> > when I try to index the entire dataset. The
> > >> dataimport?command=full-import
> > >> > works and the system comes to an idle state. It generates the
> > following
> > >> > data/index/ directory (I'm including it in case it gives any
> > indication
> > >> on
> > >> > memory requirements):
> > >> >
> > >> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> > >> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> > >> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> > >> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> > >> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> > >> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> > >> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> > >> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> > >> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> > >> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> > >> >
> > >> > Next I try to run the suggest?spellcheck.build=true command, and I
> get
> > >> the
> > >> > following error:
> > >> >
> > >> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester
> > build
> > >> > INFO: build()
> > >> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> > >> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> > >> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > >> > at java.lang.String.(String.java:215)
> > >> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> > >> > at
> > org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> > >> >  at
> > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> > >> > at
> > org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> > >> >  at
> > org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> > >> > at
> > >>
> >
> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
> > >> >  at
> > >>
> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> > >> > at
> > >> >
> > >>
> >
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterat
> or.isFrequent(HighFrequencyDictionary.java:75)
> > >> >  at
> > >> >
> > >>
> >
> org.apache.lucene.search.spell.HighFrequencyDicti

Re: How to return the distance geo distance on solr 3.5 with bbox filtering

2012-01-18 Thread Mikhail Khludnev
Can you try to specify two fqs, geodist as a function query, sort by score?

fq={!bbox}&.&sort=score%20asc&fq=trafficRouteId:235&q={!func}geodist()&fl=*,score

On Wed, Jan 18, 2012 at 4:46 PM, Maxim Veksler  wrote:

> Hello Mikhail,
>
> Please see reply inline.
>
> On Wed, Jan 18, 2012 at 11:00 AM, Mikhail Khludnev <
> mkhlud...@griddynamics.com> wrote:
>
> > Maxim,
> >
> > Which version of Solr you are using?
> >
>
> As mentioned in the title, I'm using Solr 3.5.
>

I see


>
>
> > Why the second approach at the link doesn't work for you?
> >
> I'm not sure I'm following. I'm trying to use bbox instead of
>
> I use the fq fields to define the bbox filtering.
> I also need to query by another parameter (trafficRouteId).
>
you can put trafficRouteId as a second fq, as I did above


>
> I would optimally would be happy to get the distance calculation from Solr
> but that doesn't seem to work in any format of query I tried.
>
pls try my approach above and let me know what you get.



>
> Being new to Solr query language I'm not sure how to form the search terms
> to combine all of this with the score.
>
>
>
> > what do you get on this case? pls provide, logs, exception, and debug
> > response.
> >
> > Thanks
> >
> >
> > On Tue, Jan 17, 2012 at 10:06 PM, Maxim Veksler 
> > wrote:
> >
> > > Hello,
> > >
> > > I'm querying with bbox which should be faster then geodist, my queries
> > are
> > > looking like this:
> > >
> > >
> >
> http://localhost:8983/solr/select?indent=true&fq={!bbox}&sfield=loc&pt=39.738548,-73.130322&d=100&sort=geodist()%20asc&q=trafficRouteId:235
> > <
> >
> http://localhost:8983/solr/select?indent=true&fq=%7B%21bbox%7D&sfield=loc&pt=39.738548,-73.130322&d=100&sort=geodist%28%29%20asc&q=trafficRouteId:235
> > >
> > >
> > > the trouble is, that with bbox solr does not return the distance of
> each
> > > document, I couldn't get it to work even with tips from
> > > http://wiki.apache.org/solr/SpatialSearch#Returning_the_distance
> > >
> > > Something I'm missing ?
> > >
> >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > Lucid Certified
> > Apache Lucene/Solr Developer
> > Grid Dynamics
> >
> > 
> >  
> >
>



-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics


 


Re: How can I index this?

2012-01-18 Thread ahammad
That would certainly work.

Just as a general thing, how would one go about indexing Sharepoint content
anyway? I heard about the Sharepoint connector for Lucene but I know nothing
about it. Is there a standard best practice method?

Also, what are your thoughts on extending the DIH? Is that recommended?

Thanks for the input :)

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3670392.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr hides some facet.fields when doing a distributed search over multiple shards

2012-01-18 Thread Daniel Bruegge
Hi,

I have asked the question already over Stackoverflow (
http://stackoverflow.com/questions/8913654/solr-hides-some-facet-fields-when-doing-a-distributed-search),
but maybe someone here can give me a hint how to solve this issue:

I am searching over 6 Solr shards (Solr version 3.5). What I recognized is
that when I am doing the search in my normal standalone instance, which
contains the same data I get 2 facet_fields in thefacet_counts section.
This is was I except:




...
...





As you can see there are 2 facet_fields. When I am doing the same query
using multiple shards (same data), I am getting always just one facet_field:




...





I am also using tagging and excluding filters in my Query. Could this be
the problem?


Thanks & regards


Daniel


Re: Solr hides some facet.fields when doing a distributed search over multiple shards

2012-01-18 Thread Yonik Seeley
On Wed, Jan 18, 2012 at 3:36 PM, Daniel Bruegge
 wrote:
>
> Hi,
>
> I have asked the question already over Stackoverflow (
> http://stackoverflow.com/questions/8913654/solr-hides-some-facet-fields-when-doing-a-distributed-search),
> but maybe someone here can give me a hint how to solve this issue:
>
> I am searching over 6 Solr shards (Solr version 3.5). What I recognized is
> that when I am doing the search in my normal standalone instance, which
> contains the same data I get 2 facet_fields in thefacet_counts section.
> This is was I except:
>
> 
> 
> 
> ...
> ...
> 
> 
> 
> 
>
> As you can see there are 2 facet_fields. When I am doing the same query
> using multiple shards (same data), I am getting always just one facet_field:
>
> 
> 
> 
> ...
> 
> 
> 
> 
>
> I am also using tagging and excluding filters in my Query. Could this be
> the problem?

Yeah, that must be it.
Try giving one of them a different name... something like:
  facet.field={!ex=my_exclusions, key=url2}url

-Yonik
http://www.lucidimagination.com


>
> Thanks & regards
>
>
> Daniel


Re: Solr hides some facet.fields when doing a distributed search over multiple shards

2012-01-18 Thread Daniel Bruegge
Thanks a lot. That worked like a charm.


On Wed, Jan 18, 2012 at 9:50 PM, Yonik Seeley wrote:

> On Wed, Jan 18, 2012 at 3:36 PM, Daniel Bruegge
>  wrote:
> >
> > Hi,
> >
> > I have asked the question already over Stackoverflow (
> >
> http://stackoverflow.com/questions/8913654/solr-hides-some-facet-fields-when-doing-a-distributed-search
> ),
> > but maybe someone here can give me a hint how to solve this issue:
> >
> > I am searching over 6 Solr shards (Solr version 3.5). What I recognized
> is
> > that when I am doing the search in my normal standalone instance, which
> > contains the same data I get 2 facet_fields in thefacet_counts section.
> > This is was I except:
> >
> > 
> > 
> > 
> > ...
> > ...
> > 
> > 
> > 
> > 
> >
> > As you can see there are 2 facet_fields. When I am doing the same query
> > using multiple shards (same data), I am getting always just one
> facet_field:
> >
> > 
> > 
> > 
> > ...
> > 
> > 
> > 
> > 
> >
> > I am also using tagging and excluding filters in my Query. Could this be
> > the problem?
>
> Yeah, that must be it.
> Try giving one of them a different name... something like:
>  facet.field={!ex=my_exclusions, key=url2}url
>
> -Yonik
> http://www.lucidimagination.com
>
>
> >
> > Thanks & regards
> >
> >
> > Daniel
>


conditional field weighting

2012-01-18 Thread Jack Kanaska
Hello Solr Users,

I am wondering if there's any mechanism to achieve conditional field
weighting.

For example, let's say I have 3 fields which are being searched: NAME,
DESCRIPTION, LOCATION

I want the weights to be applied according to these rules:

1) If search term is found in NAME, use weight 10 for that field, and apply
weight 5 to the remaining fields.
2) If search term is not found in NAME but it is found in DESCRIPTION,
apply weight 10 to that field and weight 5 to the remaining field.

So basically a reduction-based weighting schedule to be applied fields as
terms are found in them.

Any suggestions on how this might be achieved in Solr?  Or is it a custom
hack job?

Thanks,
Jack.


Re: Trying to understand SOLR memory requirements

2012-01-18 Thread Dave
Unfortunately, that doesn't look like it solved my problem. I built the new
.war file, dropped it in, and restarted the server. When I tried to build
the spellchecker index, it ran out of memory again. Is there anything I
needed to change in the configuration? Did I need to upload new .jar files,
or was replacing the .war file enough?

Jan 18, 2012 2:20:25 PM org.apache.solr.spelling.suggest.Suggester build
INFO: build()


Jan 18, 2012 2:22:06 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:344)
at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:352)
 at org.apache.lucene.util.fst.FST$BytesWriter.writeByte(FST.java:975)
at org.apache.lucene.util.fst.FST.writeLabel(FST.java:395)
 at org.apache.lucene.util.fst.FST.addNode(FST.java:499)
at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:182)
 at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:270)
at org.apache.lucene.util.fst.Builder.add(Builder.java:365)
 at
org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.buildAutomaton(FSTCompletionBuilder.java:228)
at
org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.build(FSTCompletionBuilder.java:202)
 at
org.apache.lucene.search.suggest.fst.FSTCompletionLookup.build(FSTCompletionLookup.java:199)
at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
 at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
at
org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
 at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1375)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:358)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:253)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
 at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
 at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
 at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
 at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
 at org.mortbay.jetty.Server.handle(Server.java:326)
at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
 at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)


On Tue, Jan 17, 2012 at 8:59 AM, Robert Muir  wrote:

> I committed it already: so you can try out branch_3x if you want.
>
> you can either wait for a nightly build or compile from svn
> (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
>
> On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> > Thank you Robert, I'd appreciate that. Any idea how long it will take to
> > get a fix? Would I be better switching to trunk? Is trunk stable enough
> for
> > someone who's very much a SOLR novice?
> >
> > Thanks,
> > Dave
> >
> > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
> >
> >> looks like https://issues.apache.org/jira/browse/SOLR-2888.
> >>
> >> Previously, FST would need to hold all the terms in RAM during
> >> construction, but with the patch it uses offline sorts/temporary
> >> files.
> >> I'll reopen the issue to backport this to the 3.x branch.
> >>
> >>
> >> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> >> > I'm trying to figure out what my memory needs are for a rather large
> >> > dataset. I'm trying to build an auto-complete system for every
> >> > city/state/country in the world. I've got a geographic database, and
> have
> >> > setup the DIH to pull the proper data in. There are 2,784,937
> documents
> >> > which I've formatted into JSON-like output, so there's a bit of data
> >> > associated with each one. Here is an example record:
> >> >
> >> > Brooklyn, New York, United States?{ |id|: |2620829|,
> >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
> },
> >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
> United
> >> > States| }
> >> >
> >> > I've got the spellchecker / suggester module setup, and I can confirm
> >> that
> >> > everything works properly wi

Highlighting more than 1 term

2012-01-18 Thread Tim Hibbs
Hello,

 

I have so far been unable to get more than one term highlighted for a
given field. In the example below, I expected (and want) both the words
"Scheduling" and "Pickup" to be surrounded with , but only one
word is ever highlighted. Any advice would be greatly appreciated.

 

Points of interest:

-  Using solr 3.3 and solrj, but my issue manifests as well in
the solr admin Full Interface search page (currently running locally
under eclipse/ WebLogic:
http://127.0.0.1:8983/apache-solr-3.3.0/kms-region1/admin/form.jsp)

-  Relevant schema details :

o   

o   

-  Query is simply "scheduled pickup"

 

An snippet of the results returned from the solr admin search page are
as follows:

...



0





2.2



on

Title,TOC,score

on



10

scheduled pickup

Title, TOC


...



...



2.0976832





Overview - Scheduling a Pickup, System Procedures







Scheduling a Pickup





...











Scheduling a Pickup









Overview - Scheduling a Pickup, System Procedures









Takes a while to see changes in data even after comit

2012-01-18 Thread abhayd
hi 
we have a small index . Whenever we commit new data we still see some old
data coming from SOLR.
( NOT A BROWSER CACHE ISSUE)

We do have autowarmcount set. As i read auto-warm count gets entries from
old cache to pre-populate filter cache. Can this cause such type of issue?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Takes-a-while-to-see-changes-in-data-even-after-comit-tp3670881p3670881.html
Sent from the Solr - User mailing list archive at Nabble.com.


Enforce overall Solr timeout

2012-01-18 Thread Jose Aguilar
Hi all,

Is there a setting to enforce an overall timeout for Solr? For example, we are 
using setting timeallowed=2000 in solrconfig.xml (using version 3.5), but as 
far as I can tell, that only applies to the search part that returns partial 
results if it takes more than 2 seconds and returns partialResults=true, but 
the other processing time (facetting, highlighting, etc) is not covered in this 
timeallowed setting.

Is there something that can be done so that for example if a Solr call overall 
takes more than say 5 seconds, kill the request it and return an error, or 
empty response or something?

--
Jose Aguilar.


How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

2012-01-18 Thread Daniel Bruegge
Hi,

I am just wondering how I can 'grow' a distributed Solr setup to an index
size of a couple of terabytes, when one of the distributed Solr limitations
is max. 4000 characters in URI limitation. See:

*The number of shards is limited by number of characters allowed for GET
> method's URI; most Web servers generally support at least 4000 characters,
> but many servers limit URI length to reduce their vulnerability to Denial
> of Service (DoS) attacks.
> *



> *(via
> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
> )*
>

Is the only way then to make multiple distributed solr clusters and query
them independently and merge them in application code?

Thanks. Daniel


Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

2012-01-18 Thread Mark Miller
You can raise the limit to a point.

On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:

> Hi,
> 
> I am just wondering how I can 'grow' a distributed Solr setup to an index
> size of a couple of terabytes, when one of the distributed Solr limitations
> is max. 4000 characters in URI limitation. See:
> 
> *The number of shards is limited by number of characters allowed for GET
>> method's URI; most Web servers generally support at least 4000 characters,
>> but many servers limit URI length to reduce their vulnerability to Denial
>> of Service (DoS) attacks.
>> *
> 
> 
> 
>> *(via
>> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
>> )*
>> 
> 
> Is the only way then to make multiple distributed solr clusters and query
> them independently and merge them in application code?
> 
> Thanks. Daniel

- Mark Miller
lucidimagination.com













Re: Highlighting more than 1 term

2012-01-18 Thread aronitin
Hi Tim,

Can you share the "text_en" type definition? Do check if your have Stemmer
configured in the type definition.

If not then that might be the reason of scheduled not matching with
scheduling.

Thanks
Nitin



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-more-than-1-term-tp3670862p3671004.html
Sent from the Solr - User mailing list archive at Nabble.com.


How to boost the relevancy of a field

2012-01-18 Thread Dean Del Ponte
I'm indexing some web pages and would like the "title" field to hold more
relevency than any other fields.

What's the best way to do this?

For example, if I search for the word "Solr", a web page with a title of
"Solr" should rank higher than a web page with a title of "Nutch", even if
the nutch page has the word Solr in it.

Thanks!

Dean


RE: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Steven A Rowe
Hi Peter,

Commercial solicitations are taboo here, except in the context of a request for 
help that is directly relevant to a product or service.

Please don’t do this again.

Steve Rowe

From: Peter Velikin [mailto:pe...@velobit.com]
Sent: Wednesday, January 18, 2012 6:33 PM
To: solr-user@lucene.apache.org
Subject: How to accelerate your Solr-Lucene appication by 4x

Hello Solr users,

Did you know that you can boost the performance of your Solr application using 
your existing servers? All you need is commodity SSD and plug-and-play software 
like VeloBit.

At ZoomInfo, a leading business information provider, VeloBit increased the 
performance of the Solr-Lucene-powered application by 4x.

I would love to tell you more about VeloBit and find out if we can deliver same 
business benefits at your company. Click 
here for a 15-minute 
briefing on the VeloBit technology.

Here is more information on how VeloBit helped ZoomInfo:

  *   Increased Solr-Lucene performance by 4x using existing servers and 
commodity SSD
  *   Installed VeloBit plug-and-play SSD caching software in 5-minutes 
transparent to running applications and storage infrastructure
  *   Reduced by 75% the hardware and monthly operating costs required to 
support service level agreements

Technical Details:

  *   Environment: Solr‐Lucene indexed directory search service fronted by J2EE 
web application technology
  *   Index size: 600 GB
  *   Number of items indexed: 50 million
  *   Primary storage: 6 x SAS HDD
  *   SSD Cache: VeloBit software + OCZ Vertex 3

Click here to read more 
about the ZoomInfo Solr-Lucene case 
study.

You can also sign 
up for our 
Early Access 
Program and 
try VeloBit HyperCache for free.

Also, feel free to write to me directly at 
pe...@velobit.com.

Best regards,

Peter Velikin
VP Online Marketing, VeloBit, Inc.
pe...@velobit.com
tel. 978-263-4800
mob. 617-306-7165
[Description: VeloBit with tagline]
VeloBit provides plug & play SSD caching software that dramatically accelerates 
applications at a remarkably low cost. The software installs seamlessly in less 
than 10 minutes and automatically tunes for fastest application speed. Visit 
www.velobit.com for details.


Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

2012-01-18 Thread Daniel Bruegge
But you can read so often about huge solr clusters and I am wondering how
they do this. Because I also read often, that the Index size of one shard
should fit into RAM. Or at least the heap size should be as big as the
index size. So I see a lots of limitations hardware-wise. Or am I on the
totally wrong track?

Daniel

On Thu, Jan 19, 2012 at 12:14 AM, Mark Miller  wrote:

> You can raise the limit to a point.
>
> On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:
>
> > Hi,
> >
> > I am just wondering how I can 'grow' a distributed Solr setup to an index
> > size of a couple of terabytes, when one of the distributed Solr
> limitations
> > is max. 4000 characters in URI limitation. See:
> >
> > *The number of shards is limited by number of characters allowed for GET
> >> method's URI; most Web servers generally support at least 4000
> characters,
> >> but many servers limit URI length to reduce their vulnerability to
> Denial
> >> of Service (DoS) attacks.
> >> *
> >
> >
> >
> >> *(via
> >>
> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
> >> )*
> >>
> >
> > Is the only way then to make multiple distributed solr clusters and query
> > them independently and merge them in application code?
> >
> > Thanks. Daniel
>
> - Mark Miller
> lucidimagination.com
>
>
>
>
>
>
>
>
>
>
>
>


RE: Highlighting more than 1 term

2012-01-18 Thread Tim Hibbs
Aro, thanks for your interest and response.

I'm using the "stock" definition in the supplied config.xml, as follows:



















When viewing the debug output of the results, I have:



2.097683 = (MATCH) sum of:
  1.072057 = (MATCH) weight(text:schedul in 1595), product of:
0.75786966 = queryWeight(text:schedul), product of:
  4.355735 = idf(docFreq=59, maxDocs=1720)
  0.17399353 = queryNorm
1.4145664 = (MATCH) fieldWeight(text:schedul in 1595), product of:
  1.7320508 = tf(termFreq(text:schedul)=3)
  4.355735 = idf(docFreq=59, maxDocs=1720)
  0.1875 = fieldNorm(field=text, doc=1595)
  1.0256261 = (MATCH) weight(text:pickup in 1595), product of:
0.652406 = queryWeight(text:pickup), product of:
  3.7495992 = idf(docFreq=109, maxDocs=1720)
  0.17399353 = queryNorm
1.5720673 = (MATCH) fieldWeight(text:pickup in 1595), product of:
  2.236068 = tf(termFreq(text:pickup)=5)
  3.7495992 = idf(docFreq=109, maxDocs=1720)
  0.1875 = fieldNorm(field=text, doc=1595)


The stem "schedul" is an indication that stemming has occurred on the
query. However, you gave me an idea; I HAVE changed what I thought were
small things to the config.xml without reindexing the content corpus.
It's possible I shot myself in the proverbial foot if I changed to
"text_en" without reindexing. I'll do that shortly (meaning tomorrow
morning) and will report back with my results.

Appreciate the interest...
Tim


Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

2012-01-18 Thread Darren Govoni

Try changing the URI/HTTP/GET size limitation on your app server.

On 01/18/2012 05:59 PM, Daniel Bruegge wrote:

Hi,

I am just wondering how I can 'grow' a distributed Solr setup to an index
size of a couple of terabytes, when one of the distributed Solr limitations
is max. 4000 characters in URI limitation. See:

*The number of shards is limited by number of characters allowed for GET

method's URI; most Web servers generally support at least 4000 characters,
but many servers limit URI length to reduce their vulnerability to Denial
of Service (DoS) attacks.
*




*(via
http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
)*


Is the only way then to make multiple distributed solr clusters and query
them independently and merge them in application code?

Thanks. Daniel





Re: How to boost the relevancy of a field

2012-01-18 Thread aronitin
Hi Dean,

You can use Query Time boosting where you specify the boost value in the
query itself that title:solr^2 OR body:solr

Thanks
Nitin

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-boost-the-relevancy-of-a-field-tp3671020p3671118.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Question on Reverse Indexing

2012-01-18 Thread Shyam Bhaskaran
Dimitry,

Completed a clean index and I still see the same behavior.

Did not use Luke but from the search page we use leading wild card search is 
working.

-Shyam

-Original Message-
From: Dmitry Kan [mailto:dmitry@gmail.com] 
Sent: Wednesday, January 18, 2012 5:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Question on Reverse Indexing

Shyam,

You still didn't say if you have started re-indexing from the clean index,
i.e. if you have removed all the data prior to re-indexing.
You can use the luke (http://code.google.com/p/luke/) to check the contents
of your text field, and see if it still contains reversed sequences.

On Wed, Jan 18, 2012 at 1:09 PM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Dimitry,
>
> We are using Solr 4.0. To confirm server caching issues I have restarted
> our tomcat webserver after performing a re-index.
>
> For reverseIndexing we have defined a fieldType "text_rev" and this
> fieldyType was used against the fields.
>
>   omitNorms="true">
> 
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> words="stopwords.txt" ignoreCase="true"/>
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> class="com.es.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/>
>
>  withOriginal="true"
>maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
>  
> 
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> words="stopwords.txt" ignoreCase="true"/>
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
>
> words="stopwords.txt" ignoreCase="true"/>
> 
>  
>
> But when it was found that ReversedWildcardFilterFactory is adding
> performance burden we removed the ReversedWildcardFilterFactory filter
>  withOriginal="true"
>maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
> and the whole collection was re-indexed.
>
> But even after removing the ReversedWildcardFilterFactory leading wild
> card search like *lock is working.
>
> -Shyam
>
> -Original Message-
> From: Dmitry Kan [mailto:dmitry@gmail.com]
> Sent: Wednesday, January 18, 2012 4:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> OK. Not sure what is your system architecture there, but could your queries
> stay cached in some server caches even after you have re-indexed your data?
> The way the index level leading wildcard works (reading SOLR 3.4 code, but
> seems to be true circa 1.4) is that the following check is done for the
> analysis chain:
>
> [code src=SolrQueryParser.java]
> boolean allow = false;
> ...
>  if (factory instanceof ReversedWildcardFilterFactory) {
>allow = true;
>...
>  }
> ...
>if (allow) {
>  setAllowLeadingWildcard(true);
>}
> [/code]
>
> so practically what you described can happen if
> the ReversedWildcardFilterFactory is still mentioned in one of your shards.
> A weird question, but have you reindexed your data to a clean index or on
> top of the existing one?
>
> On Wed, Jan 18, 2012 at 12:35 PM, Shyam Bhaskaran <
> shyam.bhaska...@synopsys.com> wrote:
>
> > Dimitry,
> >
> > Using http://localhost:7070/solr/docs/admin/analysis.jsp passed the
> query
> > *lock and did not find ReversedWildcardFilterFactory to the indexer or
> any
> > other filters that could do the reversing.
> >
> > -Shyam
> >
> > -Original Message-
> > From: Dmitry Kan [mailto:dmitry@gmail.com]
> > Sent: Wednesday, January 18, 2012 2:26 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Question on Reverse Indexing
> >
> > Just to play safe here, can you double check that the reversing is not
> any
> > more the case by issuing a query through the admin analysis page?
> >
> > Dmitry
> >
> > On Wed, Jan 18, 2012 at 4:23 AM, Shyam Bhaskaran <
> > shyam.bhaska...@synopsys.com> wrote:
> >
> > > Hi Francois,
> > >
> > > I understand that disabling of ReversedWildcardFilterFactory has
> improved
> > > the performance.
> > >
> > > But I am puzzled over how the leading wild card search like *lock is
> > > working even though I have now disabled the
> ReversedWildcardFilterFactory
> > > and the indexes have been created without ReversedWildcardFilter ?
> > >
> > > How does reverse indexing work even after disabling
> > > ReversedWildcardFilterFactory?
> > >
> > > Can anyone explain me how this feature is working.
> > >
> > > -Shyam
> > >
> > > -Original Message-
> > > From: François Schiettecatte [mailto:fschietteca...@gmail.com]
> > > Sent: Wednesday, January 18, 2012 7:49 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Question on Reverse Indexing

Re: Takes a while to see changes in data even after comit

2012-01-18 Thread Jan Høydahl
Hi,

What Solr version? How many docs?
What do you use as qutowarm count? If it's too high, it may take time.
Do you use spellcheck and buildOnCommit?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 18. jan. 2012, at 23:45, abhayd wrote:

> hi 
> we have a small index . Whenever we commit new data we still see some old
> data coming from SOLR.
> ( NOT A BROWSER CACHE ISSUE)
> 
> We do have autowarmcount set. As i read auto-warm count gets entries from
> old cache to pre-populate filter cache. Can this cause such type of issue?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Takes-a-while-to-see-changes-in-data-even-after-comit-tp3670881p3670881.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: How to boost the relevancy of a field

2012-01-18 Thread Jan Høydahl
And using dismax query parser makes this easier: 
http://wiki.apache.org/solr/DisMaxQParserPlugin

Example:
q=solr&defType=edismax&qf=title^10 body^0.5

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 19. jan. 2012, at 01:29, aronitin wrote:

> Hi Dean,
> 
> You can use Query Time boosting where you specify the boost value in the
> query itself that title:solr^2 OR body:solr
> 
> Thanks
> Nitin
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-to-boost-the-relevancy-of-a-field-tp3671020p3671118.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: How can I index this?

2012-01-18 Thread Matthew Parker
I just started trying Apache ManifoldCF, which has a SharePoint connector
that appears to integrate through Sharepoint's web services.

Nutch also has a SharePoint connector, and it can publish documents into
SOLR for indexing.

On Wed, Jan 18, 2012 at 3:34 PM, ahammad  wrote:

> That would certainly work.
>
> Just as a general thing, how would one go about indexing Sharepoint content
> anyway? I heard about the Sharepoint connector for Lucene but I know
> nothing
> about it. Is there a standard best practice method?
>
> Also, what are your thoughts on extending the DIH? Is that recommended?
>
> Thanks for the input :)
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3670392.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Regards,

Matt Parker (CTR)
Senior Software Architect
Apogee Integration, LLC
5180 Parkstone Drive, Suite #160
Chantilly, Virginia 20151
703.272.4797 (site)
703.474.1918 (cell)
www.apogeeintegration.com

--
This e-mail and any files transmitted with it may be proprietary.  Please note 
that any views or opinions presented in this e-mail are solely those of the 
author and do not necessarily represent those of Apogee Integration.


Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Jason Rutherglen
Steven,

If you are going to admonish people for advertising, it should be
equally dished out or not at all.

On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe  wrote:
> Hi Peter,
>
> Commercial solicitations are taboo here, except in the context of a request 
> for help that is directly relevant to a product or service.
>
> Please don’t do this again.
>
> Steve Rowe
>
> From: Peter Velikin [mailto:pe...@velobit.com]
> Sent: Wednesday, January 18, 2012 6:33 PM
> To: solr-user@lucene.apache.org
> Subject: How to accelerate your Solr-Lucene appication by 4x
>
> Hello Solr users,
>
> Did you know that you can boost the performance of your Solr application 
> using your existing servers? All you need is commodity SSD and plug-and-play 
> software like VeloBit.
>
> At ZoomInfo, a leading business information provider, VeloBit increased the 
> performance of the Solr-Lucene-powered application by 4x.
>
> I would love to tell you more about VeloBit and find out if we can deliver 
> same business benefits at your company. Click 
> here for a 15-minute 
> briefing on the VeloBit technology.
>
> Here is more information on how VeloBit helped ZoomInfo:
>
>  *   Increased Solr-Lucene performance by 4x using existing servers and 
> commodity SSD
>  *   Installed VeloBit plug-and-play SSD caching software in 5-minutes 
> transparent to running applications and storage infrastructure
>  *   Reduced by 75% the hardware and monthly operating costs required to 
> support service level agreements
>
> Technical Details:
>
>  *   Environment: Solr‐Lucene indexed directory search service fronted by 
> J2EE web application technology
>  *   Index size: 600 GB
>  *   Number of items indexed: 50 million
>  *   Primary storage: 6 x SAS HDD
>  *   SSD Cache: VeloBit software + OCZ Vertex 3
>
> Click here to read more 
> about the ZoomInfo Solr-Lucene case 
> study.
>
> You can also sign 
> up for 
> our Early Access 
> Program 
> and try VeloBit HyperCache for free.
>
> Also, feel free to write to me directly at 
> pe...@velobit.com.
>
> Best regards,
>
> Peter Velikin
> VP Online Marketing, VeloBit, Inc.
> pe...@velobit.com
> tel. 978-263-4800
> mob. 617-306-7165
> [Description: VeloBit with tagline]
> VeloBit provides plug & play SSD caching software that dramatically 
> accelerates applications at a remarkably low cost. The software installs 
> seamlessly in less than 10 minutes and automatically tunes for fastest 
> application speed. Visit www.velobit.com for details.


Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Darren Govoni
And to be honest, many people on this list are professionals who not 
only build their own solutions, but also buy tools and tech.


I don't see what the big deal is if some clever company has something of 
imminent value here to share it. Considering that its a rare event.


On 01/18/2012 08:28 PM, Jason Rutherglen wrote:

Steven,

If you are going to admonish people for advertising, it should be
equally dished out or not at all.

On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe  wrote:

Hi Peter,

Commercial solicitations are taboo here, except in the context of a request for 
help that is directly relevant to a product or service.

Please don’t do this again.

Steve Rowe

From: Peter Velikin [mailto:pe...@velobit.com]
Sent: Wednesday, January 18, 2012 6:33 PM
To: solr-user@lucene.apache.org
Subject: How to accelerate your Solr-Lucene appication by 4x

Hello Solr users,

Did you know that you can boost the performance of your Solr application using 
your existing servers? All you need is commodity SSD and plug-and-play software 
like VeloBit.

At ZoomInfo, a leading business information provider, VeloBit increased the 
performance of the Solr-Lucene-powered application by 4x.

I would love to tell you more about VeloBit and find out if we can deliver same business 
benefits at your company. Click here  for a 
15-minute briefing  on the VeloBit technology.

Here is more information on how VeloBit helped ZoomInfo:

  *   Increased Solr-Lucene performance by 4x using existing servers and 
commodity SSD
  *   Installed VeloBit plug-and-play SSD caching software in 5-minutes 
transparent to running applications and storage infrastructure
  *   Reduced by 75% the hardware and monthly operating costs required to 
support service level agreements

Technical Details:

  *   Environment: Solr‐Lucene indexed directory search service fronted by J2EE 
web application technology
  *   Index size: 600 GB
  *   Number of items indexed: 50 million
  *   Primary storage: 6 x SAS HDD
  *   SSD Cache: VeloBit software + OCZ Vertex 3

Click here  to read more about 
the ZoomInfo Solr-Lucene case 
study.

You can also sign 
up  for our Early 
Access Program  
and try VeloBit HyperCache for free.

Also, feel free to write to me directly at 
pe...@velobit.com.

Best regards,

Peter Velikin
VP Online Marketing, VeloBit, Inc.
pe...@velobit.com
tel. 978-263-4800
mob. 617-306-7165
[Description: VeloBit with tagline]
VeloBit provides plug&  play SSD caching software that dramatically accelerates 
applications at a remarkably low cost. The software installs seamlessly in less than 10 
minutes and automatically tunes for fastest application speed. Visit 
www.velobit.com  for details.




Re: first time query is very slow

2012-01-18 Thread gabriel shen
Hi Yonik,

The index I am querying against is 20gb, containing 200,000documents, some
of the documents are quite big, the schema contains more than 50 fields.
Main content field are defined as both stored and indexed, applied
htmlstripping, standardtokenization, decompounding, stemming filters,
without termvector. The solr3.3 installation runs on top of jvm64 with 12gb
memory. Default cache option(512) is applied.

First I did a query with default query parser and a single query field
called 'maintext',
http://xxx:/solr/document/select?q=maintext:most%20populous%20
city&start=0&rows=25
It took 727 milliseconds in QueryComponent which is fine

http://xxx:/solr/document/select?q=maintext:most%20populous%20
city&start=0&rows=25
&sort=sumlevel1%20asc,%20sumlevel2%20asc,%20domdate%20desc,%20score%20desc&facet=true&facet.field=sumlevel1
It took 157 milliseconds in QueryComponent


And then I did the the another dismax query with the same query keywords(I
suppose most documents, sorting, filtering are being cached)

http://xxx:/solr/document/select?q=most%20populous%20city
&qt=dismax&start=0&rows=25&qf=superdocid^1000%20popular-name^1000%20author^100%20target-id^50%20title_simple^50%20title^25%20summary_simple^25%20summary^10%20maintext_simple^5%20annotation_DEF_simple^5%20maintext%20annotation_DEF&pf=popular-name^1000%20author^100%20title_simple^50%20title^25%20summary_simple^25%20summary^10%20maintext_simple^5%20annotation_DEF_simple^5%20maintext%20annotation_DEF&sort=sumlevel1%20asc,%20sumlevel2%20asc,%20domdate%20desc,%20score%20desc&facet=true&facet.field=sumlevel1&debugQuery=true

It took more than 15-20 seconds before browser shows result, and it
displays 4781 milliseconds in QueryComponent

then I cleaned browser cache and run the same dismax url again,
It still will take 2500milliseonds in QueryComponent, and on the server
machine, I only observed a  glance of cpu spike of 84%, and returned to 2%
immediately  during the query.

Can you see what took the most time here? Is there any way to improve the
speed?

thanks,
shen


On Tue, Jan 17, 2012 at 11:25 PM, Yonik Seeley
wrote:

> On Tue, Jan 17, 2012 at 9:39 AM, gabriel shen  wrote:
> > For those customers who unluckily send un-prewarmed query, they will
> suffer
> > from bad response time, it is not too pleasant anyway.
>
> The "warming caches" part isn't about unique queries, but more about
> caches used for sorting and faceting (and those are reused across many
> different queries).
> Can you give an example of the complete request you were sending that
> takes a long time?
>
> -Yonik
> http://www.lucidimagination.com
>


Re: Enforce overall Solr timeout

2012-01-18 Thread Otis Gospodnetic
Jose,

I'm not aware of such functionality in Solr.  But there may be something of 
that sort doable on the servlet container or, if you are using SolrJ to talk to 
Solr, you should be able to set the socket/HTTP connection timeout via the 
underlying HttpClient API.

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html 


- Original Message -
> From: Jose Aguilar 
> To: "solr-user@lucene.apache.org" 
> Cc: 
> Sent: Wednesday, January 18, 2012 5:50 PM
> Subject: Enforce overall Solr timeout
> 
> Hi all,
> 
> Is there a setting to enforce an overall timeout for Solr? For example, we 
> are 
> using setting timeallowed=2000 in solrconfig.xml (using version 3.5), but as 
> far 
> as I can tell, that only applies to the search part that returns partial 
> results 
> if it takes more than 2 seconds and returns partialResults=true, but the 
> other 
> processing time (facetting, highlighting, etc) is not covered in this 
> timeallowed setting.
> 
> Is there something that can be done so that for example if a Solr call 
> overall 
> takes more than say 5 seconds, kill the request it and return an error, or 
> empty 
> response or something?
> 
> --
> Jose Aguilar.
>


Re: conditional field weighting

2012-01-18 Thread csscouter
Jack,

Did you see this response to a similar question? I think this is how to
refer to it: 
http://lucene.472066.n3.nabble.com/How-to-boost-the-relevancy-of-a-field-tp3671020p3671020.html
How to boost the relevancy of a field 


I have / had a similar question to yours, and the response to this question
seemed relevant.

Tim

--
View this message in context: 
http://lucene.472066.n3.nabble.com/conditional-field-weighting-tp3670544p3671201.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?

2012-01-18 Thread Otis Gospodnetic
Hi Daniel,

>
> From: Daniel Bruegge 
>Subject: Re: How can a distributed Solr setup scale to TB-data, if URL 
>limitations are 4000 for distributed shard search?
> 
>But you can read so often about huge solr clusters and I am wondering how
>they do this. 

Huge is relative. ;)
Huge Solr clusters also often have huge hardware. Servers with 16 cores and 32 
GM RAM are becoming very common, for example.
Another thing to keep in mind is that while lots of organizations have huge 
indices, only some portions of them may be hot at any one time.  We've had a 
number of clients who index social media or news data and while all of them 
have giant indices, typically only the most recent data is really actively 
searched.

> Because I also read often, that the Index size of one shard 
>should fit into RAM. 

Nah.  Don't take this as "the whole index needs to fit in RAM".  Just "the hot 
parts of the index should fit in RAM".  This is related to what I wrote above.

> Or at least the heap size should be as big as the
> index size. So I see a lots of limitations hardware-wise. Or am I on the
> totally wrong track?

Regarding heap - nah, that's not correct.  The heap is usually much smaller 
than the index and RAM is given to the OS to use for data caching.

Otis

Performance Monitoring SaaS for Solr 
- http://sematext.com/spm/solr-performance-monitoring/index.html



>On Thu, Jan 19, 2012 at 12:14 AM, Mark Miller  wrote:
>
>> You can raise the limit to a point.
>>
>> On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:
>>
>> > Hi,
>> >
>> > I am just wondering how I can 'grow' a distributed Solr setup to an index
>> > size of a couple of terabytes, when one of the distributed Solr
>> limitations
>> > is max. 4000 characters in URI limitation. See:
>> >
>> > *The number of shards is limited by number of characters allowed for GET
>> >> method's URI; most Web servers generally support at least 4000
>> characters,
>> >> but many servers limit URI length to reduce their vulnerability to
>> Denial
>> >> of Service (DoS) attacks.
>> >> *
>> >
>> >
>> >
>> >> *(via
>> >>
>> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
>> >> )*
>> >>
>> >
>> > Is the only way then to make multiple distributed solr clusters and query
>> > them independently and merge them in application code?
>> >
>> > Thanks. Daniel
>>
>> - Mark Miller
>> lucidimagination.com
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
> 



Re: first time query is very slow

2012-01-18 Thread Otis Gospodnetic
Gabriel,

It sounds like it's not the CPU.
Are you watching disk IO?  Maybe the time is spent reading from disk?  Although 
if you are repeating the same query the results should be cached by Solr if you 
have query cache enabled.
Or JVM/GC?  Maybe the heap is too small and the JVM is busy GCing?  Actually, 
not likely, that would keep the CPU busy.

See link in signature, man vmstat, iostat, try jconsole or visualvm...

Otis 


Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html 


- Original Message -
> From: gabriel shen 
> To: solr-user@lucene.apache.org; yo...@lucidimagination.com
> Cc: 
> Sent: Wednesday, January 18, 2012 10:15 PM
> Subject: Re: first time query is very slow
> 
> Hi Yonik,
> 
> The index I am querying against is 20gb, containing 200,000documents, some
> of the documents are quite big, the schema contains more than 50 fields.
> Main content field are defined as both stored and indexed, applied
> htmlstripping, standardtokenization, decompounding, stemming filters,
> without termvector. The solr3.3 installation runs on top of jvm64 with 12gb
> memory. Default cache option(512) is applied.
> 
> First I did a query with default query parser and a single query field
> called 'maintext',
> http://xxx:/solr/document/select?q=maintext:most%20populous%20
> city&start=0&rows=25
> It took 727 milliseconds in QueryComponent which is fine
> 
> http://xxx:/solr/document/select?q=maintext:most%20populous%20
> city&start=0&rows=25
> &sort=sumlevel1%20asc,%20sumlevel2%20asc,%20domdate%20desc,%20score%20desc&facet=true&facet.field=sumlevel1
> It took 157 milliseconds in QueryComponent
> 
> 
> And then I did the the another dismax query with the same query keywords(I
> suppose most documents, sorting, filtering are being cached)
> 
> http://xxx:/solr/document/select?q=most%20populous%20city
> &qt=dismax&start=0&rows=25&qf=superdocid^1000%20popular-name^1000%20author^100%20target-id^50%20title_simple^50%20title^25%20summary_simple^25%20summary^10%20maintext_simple^5%20annotation_DEF_simple^5%20maintext%20annotation_DEF&pf=popular-name^1000%20author^100%20title_simple^50%20title^25%20summary_simple^25%20summary^10%20maintext_simple^5%20annotation_DEF_simple^5%20maintext%20annotation_DEF&sort=sumlevel1%20asc,%20sumlevel2%20asc,%20domdate%20desc,%20score%20desc&facet=true&facet.field=sumlevel1&debugQuery=true
> 
> It took more than 15-20 seconds before browser shows result, and it
> displays 4781 milliseconds in QueryComponent
> 
> then I cleaned browser cache and run the same dismax url again,
> It still will take 2500milliseonds in QueryComponent, and on the server
> machine, I only observed a  glance of cpu spike of 84%, and returned to 2%
> immediately  during the query.
> 
> Can you see what took the most time here? Is there any way to improve the
> speed?
> 
> thanks,
> shen
> 
> 
> On Tue, Jan 17, 2012 at 11:25 PM, Yonik Seeley
> wrote:
> 
>>  On Tue, Jan 17, 2012 at 9:39 AM, gabriel shen  
> wrote:
>>  > For those customers who unluckily send un-prewarmed query, they will
>>  suffer
>>  > from bad response time, it is not too pleasant anyway.
>> 
>>  The "warming caches" part isn't about unique queries, but 
> more about
>>  caches used for sorting and faceting (and those are reused across many
>>  different queries).
>>  Can you give an example of the complete request you were sending that
>>  takes a long time?
>> 
>>  -Yonik
>>  http://www.lucidimagination.com
>> 
>


Re: conditional field weighting

2012-01-18 Thread Jack Kanaska
Hi Tim,

Unfortunately that's not what I am looking for.  I understand how to use
the relevancy of a field as described in that example, but it doesn't do
what I asked, which is conditional field weighting.

The difference is that specifying a query with something like

&qf=name^10 description^5 location^3

would apply all those weights to each field if the term was found in all
three fields.

what I want to happen is:

1) if term is found in name field, then use weights exactly as shown.

2) if term is not found in name, but is found in description and location,
then the weights should be adjusted to the equivalent of:
&qf=description^10 location^5

3) if term is not found in name or description but is found in location,
then the weights should be adjusted to be the equivalent of:
&qf=location^10


the goal of what I'm trying to do is to have an priority/importance order
to the fields and apply a higher weight on the first field it matches
(going in order of importance), then applying less weight to each remaining
field each time it matches.

so far my research seems to indicate that this is not possible with any
current solr feature ...

thanks anyway,

jack



On Wed, Jan 18, 2012 at 6:16 PM, csscouter  wrote:

> Jack,
>
> Did you see this response to a similar question? I think this is how to
> refer to it:
>
> http://lucene.472066.n3.nabble.com/How-to-boost-the-relevancy-of-a-field-tp3671020p3671020.html
> How to boost the relevancy of a field
>
>
> I have / had a similar question to yours, and the response to this question
> seemed relevant.
>
> Tim
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/conditional-field-weighting-tp3670544p3671201.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


RE: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Steven A Rowe
Why Jason, I declare, whatever do you mean?


> -Original Message-
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: Wednesday, January 18, 2012 8:29 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to accelerate your Solr-Lucene appication by 4x
> 
> Steven,
> 
> If you are going to admonish people for advertising, it should be
> equally dished out or not at all.
> 
> On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe  wrote:
> > Hi Peter,
> >
> > Commercial solicitations are taboo here, except in the context of a
> request for help that is directly relevant to a product or service.
> >
> > Please don’t do this again.
> >
> > Steve Rowe
> >
> > From: Peter Velikin [mailto:pe...@velobit.com]
> > Sent: Wednesday, January 18, 2012 6:33 PM
> > To: solr-user@lucene.apache.org
> > Subject: How to accelerate your Solr-Lucene appication by 4x
> >
> > Hello Solr users,
> >
> > Did you know that you can boost the performance of your Solr application
> using your existing servers? All you need is commodity SSD and plug-and-
> play software like VeloBit.
> >
> > At ZoomInfo, a leading business information provider, VeloBit increased
> the performance of the Solr-Lucene-powered application by 4x.
> >
> > I would love to tell you more about VeloBit and find out if we can
> deliver same business benefits at your company. Click
> here for a 15-minute
> briefing on the VeloBit
> technology.
> >
> > Here is more information on how VeloBit helped ZoomInfo:
> >
> >  *   Increased Solr-Lucene performance by 4x using existing servers and
> commodity SSD
> >  *   Installed VeloBit plug-and-play SSD caching software in 5-minutes
> transparent to running applications and storage infrastructure
> >  *   Reduced by 75% the hardware and monthly operating costs required to
> support service level agreements
> >
> > Technical Details:
> >
> >  *   Environment: Solr‐Lucene indexed directory search service fronted
> by J2EE web application technology
> >  *   Index size: 600 GB
> >  *   Number of items indexed: 50 million
> >  *   Primary storage: 6 x SAS HDD
> >  *   SSD Cache: VeloBit software + OCZ Vertex 3
> >
> > Click here to read
> more about the ZoomInfo Solr-Lucene case study cases/enterprise-search/>.
> >
> > You can also sign up accelerate-application> for our Early Access
> Program application> and try VeloBit HyperCache for free.
> >
> > Also, feel free to write to me directly at
> pe...@velobit.com.
> >
> > Best regards,
> >
> > Peter Velikin
> > VP Online Marketing, VeloBit, Inc.
> > pe...@velobit.com
> > tel. 978-263-4800
> > mob. 617-306-7165
> > [Description: VeloBit with tagline]
> > VeloBit provides plug & play SSD caching software that dramatically
> accelerates applications at a remarkably low cost. The software installs
> seamlessly in less than 10 minutes and automatically tunes for fastest
> application speed. Visit www.velobit.com for
> details.


RE: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Steven A Rowe
Hi Darren,

I think it's rare because it's rare: if this were found to be a useful 
advertising space, rare would cease to be descriptive of it.  But I could be 
wrong.

Steve

> -Original Message-
> From: Darren Govoni [mailto:dar...@ontrenet.com]
> Sent: Wednesday, January 18, 2012 8:40 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to accelerate your Solr-Lucene appication by 4x
> 
> And to be honest, many people on this list are professionals who not
> only build their own solutions, but also buy tools and tech.
> 
> I don't see what the big deal is if some clever company has something of
> imminent value here to share it. Considering that its a rare event.
> 
> On 01/18/2012 08:28 PM, Jason Rutherglen wrote:
> > Steven,
> >
> > If you are going to admonish people for advertising, it should be
> > equally dished out or not at all.
> >
> > On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe  wrote:
> >> Hi Peter,
> >>
> >> Commercial solicitations are taboo here, except in the context of a
> request for help that is directly relevant to a product or service.
> >>
> >> Please don’t do this again.
> >>
> >> Steve Rowe
> >>
> >> From: Peter Velikin [mailto:pe...@velobit.com]
> >> Sent: Wednesday, January 18, 2012 6:33 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: How to accelerate your Solr-Lucene appication by 4x
> >>
> >> Hello Solr users,
> >>
> >> Did you know that you can boost the performance of your Solr
> application using your existing servers? All you need is commodity SSD and
> plug-and-play software like VeloBit.
> >>
> >> At ZoomInfo, a leading business information provider, VeloBit increased
> the performance of the Solr-Lucene-powered application by 4x.
> >>
> >> I would love to tell you more about VeloBit and find out if we can
> deliver same business benefits at your company. Click
> here  for a 15-minute
> briefing  on the VeloBit
> technology.
> >>
> >> Here is more information on how VeloBit helped ZoomInfo:
> >>
> >>   *   Increased Solr-Lucene performance by 4x using existing servers
> and commodity SSD
> >>   *   Installed VeloBit plug-and-play SSD caching software in 5-minutes
> transparent to running applications and storage infrastructure
> >>   *   Reduced by 75% the hardware and monthly operating costs required
> to support service level agreements
> >>
> >> Technical Details:
> >>
> >>   *   Environment: Solr‐Lucene indexed directory search service fronted
> by J2EE web application technology
> >>   *   Index size: 600 GB
> >>   *   Number of items indexed: 50 million
> >>   *   Primary storage: 6 x SAS HDD
> >>   *   SSD Cache: VeloBit software + OCZ Vertex 3
> >>
> >> Click here  to
> read more about the ZoomInfo Solr-Lucene case
> study.
> >>
> >> You can also sign up accelerate-application>  for our Early Access
> Program application>  and try VeloBit HyperCache for free.
> >>
> >> Also, feel free to write to me directly at
> pe...@velobit.com.
> >>
> >> Best regards,
> >>
> >> Peter Velikin
> >> VP Online Marketing, VeloBit, Inc.
> >> pe...@velobit.com
> >> tel. 978-263-4800
> >> mob. 617-306-7165
> >> [Description: VeloBit with tagline]
> >> VeloBit provides plug&  play SSD caching software that dramatically
> accelerates applications at a remarkably low cost. The software installs
> seamlessly in less than 10 minutes and automatically tunes for fastest
> application speed. Visit www.velobit.com  for
> details.



Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Jason Rutherglen
Steven,

Fun-NY...

17 hits for this spam:

http://search-lucene.com/?q=%22Performance+Monitoring+SaaS+for+Solr%22

Though this was already partially discussed with Chris @ fucu.org
which according to him, should have already been moved to Lucene
General.

On Wed, Jan 18, 2012 at 11:04 PM, Steven A Rowe  wrote:
> Why Jason, I declare, whatever do you mean?
>
>
>> -Original Message-
>> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
>> Sent: Wednesday, January 18, 2012 8:29 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to accelerate your Solr-Lucene appication by 4x
>>
>> Steven,
>>
>> If you are going to admonish people for advertising, it should be
>> equally dished out or not at all.
>>
>> On Wed, Jan 18, 2012 at 6:38 PM, Steven A Rowe  wrote:
>> > Hi Peter,
>> >
>> > Commercial solicitations are taboo here, except in the context of a
>> request for help that is directly relevant to a product or service.
>> >
>> > Please don’t do this again.
>> >
>> > Steve Rowe
>> >
>> > From: Peter Velikin [mailto:pe...@velobit.com]
>> > Sent: Wednesday, January 18, 2012 6:33 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: How to accelerate your Solr-Lucene appication by 4x
>> >
>> > Hello Solr users,
>> >
>> > Did you know that you can boost the performance of your Solr application
>> using your existing servers? All you need is commodity SSD and plug-and-
>> play software like VeloBit.
>> >
>> > At ZoomInfo, a leading business information provider, VeloBit increased
>> the performance of the Solr-Lucene-powered application by 4x.
>> >
>> > I would love to tell you more about VeloBit and find out if we can
>> deliver same business benefits at your company. Click
>> here for a 15-minute
>> briefing on the VeloBit
>> technology.
>> >
>> > Here is more information on how VeloBit helped ZoomInfo:
>> >
>> >  *   Increased Solr-Lucene performance by 4x using existing servers and
>> commodity SSD
>> >  *   Installed VeloBit plug-and-play SSD caching software in 5-minutes
>> transparent to running applications and storage infrastructure
>> >  *   Reduced by 75% the hardware and monthly operating costs required to
>> support service level agreements
>> >
>> > Technical Details:
>> >
>> >  *   Environment: Solr‐Lucene indexed directory search service fronted
>> by J2EE web application technology
>> >  *   Index size: 600 GB
>> >  *   Number of items indexed: 50 million
>> >  *   Primary storage: 6 x SAS HDD
>> >  *   SSD Cache: VeloBit software + OCZ Vertex 3
>> >
>> > Click here to read
>> more about the ZoomInfo Solr-Lucene case study> cases/enterprise-search/>.
>> >
>> > You can also sign up> accelerate-application> for our Early Access
>> Program> application> and try VeloBit HyperCache for free.
>> >
>> > Also, feel free to write to me directly at
>> pe...@velobit.com.
>> >
>> > Best regards,
>> >
>> > Peter Velikin
>> > VP Online Marketing, VeloBit, Inc.
>> > pe...@velobit.com
>> > tel. 978-263-4800
>> > mob. 617-306-7165
>> > [Description: VeloBit with tagline]
>> > VeloBit provides plug & play SSD caching software that dramatically
>> accelerates applications at a remarkably low cost. The software installs
>> seamlessly in less than 10 minutes and automatically tunes for fastest
>> application speed. Visit www.velobit.com for
>> details.


Ngram autocompleter and term frequency boosting

2012-01-18 Thread Cuong Hoang
Hi guys,

I'm trying to build a Ngram-based autocompleter that takes term frequency
into account.

Let's say I have the following documents:

D1: title => "Java Developer"
D2: title => "Java Programmer"
D3: title => "Java Developer"

When the user types in "Java", I want to display

1. "Java Developer"
2. "Java Programmer"

Basically "Java Developer" ranks first because it appears twice in the
index while "Java Programmer" only appears once. Is it possible?

I'm using the following config for "title" field:


  



  
  


  


Thanks


Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Ted Dunning
On Thu, Jan 19, 2012 at 1:40 AM, Darren Govoni  wrote:

> And to be honest, many people on this list are professionals who not only
> build their own solutions, but also buy tools and tech.
>
> I don't see what the big deal is if some clever company has something of
> imminent value here to share it. Considering that its a rare event.
>

I would consider it if it were of eminent value, but not if it were
imminent or immanent.

Seriously, let's set the bar that blatantly commercial postings be at least
responsive to something as opposed to just spam that happens to be slightly
related to the mailing list.

For instance, if a SAS rep wanted to post an answer of the form "Mahout
doesn't, but SAS does" on the Mahout mailing list, I would be thrilled.  If
they posted their monthly newsletter, I would be pissed.  The first kind of
answer adds value, the second siphons value off.


RE: Question on Reverse Indexing

2012-01-18 Thread Shyam Bhaskaran
Dimitry,

I downloaded Luke but it was not working for me against solr indexes.

But using the solr analysis page I did not find any reversed sequences on the 
field.

-Shyam


-Original Message-
From: Shyam Bhaskaran [mailto:shyam.bhaska...@synopsys.com] 
Sent: Thursday, January 19, 2012 6:29 AM
To: solr-user@lucene.apache.org
Subject: RE: Question on Reverse Indexing

Dimitry,

Completed a clean index and I still see the same behavior.

Did not use Luke but from the search page we use leading wild card search is 
working.

-Shyam

-Original Message-
From: Dmitry Kan [mailto:dmitry@gmail.com] 
Sent: Wednesday, January 18, 2012 5:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Question on Reverse Indexing

Shyam,

You still didn't say if you have started re-indexing from the clean index,
i.e. if you have removed all the data prior to re-indexing.
You can use the luke (http://code.google.com/p/luke/) to check the contents
of your text field, and see if it still contains reversed sequences.

On Wed, Jan 18, 2012 at 1:09 PM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Dimitry,
>
> We are using Solr 4.0. To confirm server caching issues I have restarted
> our tomcat webserver after performing a re-index.
>
> For reverseIndexing we have defined a fieldType "text_rev" and this
> fieldyType was used against the fields.
>
>   omitNorms="true">
> 
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> words="stopwords.txt" ignoreCase="true"/>
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> class="com.es.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/>
>
>  withOriginal="true"
>maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
>  
> 
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> words="stopwords.txt" ignoreCase="true"/>
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
>
> words="stopwords.txt" ignoreCase="true"/>
> 
>  
>
> But when it was found that ReversedWildcardFilterFactory is adding
> performance burden we removed the ReversedWildcardFilterFactory filter
>  withOriginal="true"
>maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
> and the whole collection was re-indexed.
>
> But even after removing the ReversedWildcardFilterFactory leading wild
> card search like *lock is working.
>
> -Shyam
>
> -Original Message-
> From: Dmitry Kan [mailto:dmitry@gmail.com]
> Sent: Wednesday, January 18, 2012 4:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> OK. Not sure what is your system architecture there, but could your queries
> stay cached in some server caches even after you have re-indexed your data?
> The way the index level leading wildcard works (reading SOLR 3.4 code, but
> seems to be true circa 1.4) is that the following check is done for the
> analysis chain:
>
> [code src=SolrQueryParser.java]
> boolean allow = false;
> ...
>  if (factory instanceof ReversedWildcardFilterFactory) {
>allow = true;
>...
>  }
> ...
>if (allow) {
>  setAllowLeadingWildcard(true);
>}
> [/code]
>
> so practically what you described can happen if
> the ReversedWildcardFilterFactory is still mentioned in one of your shards.
> A weird question, but have you reindexed your data to a clean index or on
> top of the existing one?
>
> On Wed, Jan 18, 2012 at 12:35 PM, Shyam Bhaskaran <
> shyam.bhaska...@synopsys.com> wrote:
>
> > Dimitry,
> >
> > Using http://localhost:7070/solr/docs/admin/analysis.jsp passed the
> query
> > *lock and did not find ReversedWildcardFilterFactory to the indexer or
> any
> > other filters that could do the reversing.
> >
> > -Shyam
> >
> > -Original Message-
> > From: Dmitry Kan [mailto:dmitry@gmail.com]
> > Sent: Wednesday, January 18, 2012 2:26 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Question on Reverse Indexing
> >
> > Just to play safe here, can you double check that the reversing is not
> any
> > more the case by issuing a query through the admin analysis page?
> >
> > Dmitry
> >
> > On Wed, Jan 18, 2012 at 4:23 AM, Shyam Bhaskaran <
> > shyam.bhaska...@synopsys.com> wrote:
> >
> > > Hi Francois,
> > >
> > > I understand that disabling of ReversedWildcardFilterFactory has
> improved
> > > the performance.
> > >
> > > But I am puzzled over how the leading wild card search like *lock is
> > > working even though I have now disabled the
> ReversedWildcardFilterFactory
> > > and the indexes have been created without ReversedWildcardFilter ?
> > >
> > > How does reverse indexing work e

Re: "index-time" over boosted

2012-01-18 Thread remi tassing
Hi,

just a background on my setup. I'm crawling with Nutch-1.2, I used Solr-1.4
and Solr-3.5, with the same result. Solr is still using the default
settings.

I found this problem just by accident. I queried "mobile broadband", page
A, has 2 occurences and scores higher than page B that has 19 occurences. I
found it weird and that's why I started investigating.

The debug results are given below and you can see that queryWeight, idf
and queryNorm are the same, tf is higher, as expected, in B but what makes
the difference is clearly fieldNorm.

A: 0.010779975 = (MATCH) weight(content:"mobil broadband" in 18730),
product of: 1.0 = queryWeight(content:"mobil broadband"), product of:
6.2444286 = idf(content: mobil=4922 broadband=2290) 0.16014275 = queryNorm
0.010779975 = fieldWeight(content:"mobil broadband" in 18730), product of:
1.4142135 = tf(phraseFreq=2.0) 6.2444286 = idf(content: mobil=4922
broadband=2290) 0.0012207031 = fieldNorm(field=content, doc=18730)

B: 8.5223187E-4 = (MATCH) weight(content:"mobil broadband" in 14391),
product of: 1.0 = queryWeight(content:"mobil broadband"), product of:
6.2444286 = idf(content: mobil=4922 broadband=2290) 0.16014275 = queryNorm
8.5223187E-4 = fieldWeight(content:"mobil broadband" in 14391), product of:
4.472136 = tf(phraseFreq=20.0) 6.2444286 = idf(content: mobil=4922
broadband=2290) 3.0517578E-5 = fieldNorm(field=content, doc=14391)

Remi

On Wed, Jan 18, 2012 at 8:52 PM, Jan Høydahl  wrote:

> > I've come accros a problem where newly indexed pages almost always come
> > first even when the term frequency is relatively slow.
>
> There is no inherent index-time boost, so this must be something else.
> Can you give us an example of a query? Which query parser do you use?
>
> > I read the posts below on "fieldNorm" and "omitNorms" but setting
> > "omitNorms=true" doesn't change anything for me on the calculation of
> > fieldNorm.
>
> Are you sure you have spelled omitNorms="true" correctly, then restarted
> Solr (to refresh config)? The effect of Norms on your score will be that
> shorter fields score higher than long fields.
>
> Perhaps you instead can try to tell us your use-case. What kind of raning
> are you trying to achieve? Then we can help suggest how to get there.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com


RE: Question on Reverse Indexing

2012-01-18 Thread Shyam Bhaskaran
Dimitry,

I have used lukeall-3.5.0.jar and when trying to open the index it gives me the 
error "No Valid Directory at the location, try another location"

When using the below command I see this error "luke 
java.lang.ArrayIndexOutOfBoundsException: 1"
java -cp C:\lukeall-3.5.0.jar org.getopt.luke.Luke -index 
C:\solr\home\data\docs_index\index\ 

We are using Solr 4.0

-Shyam

-Original Message-
From: Shyam Bhaskaran [mailto:shyam.bhaska...@synopsys.com] 
Sent: Thursday, January 19, 2012 11:49 AM
To: solr-user@lucene.apache.org
Subject: RE: Question on Reverse Indexing

Dimitry,

I downloaded Luke but it was not working for me against solr indexes.

But using the solr analysis page I did not find any reversed sequences on the 
field.

-Shyam


-Original Message-
From: Shyam Bhaskaran [mailto:shyam.bhaska...@synopsys.com] 
Sent: Thursday, January 19, 2012 6:29 AM
To: solr-user@lucene.apache.org
Subject: RE: Question on Reverse Indexing

Dimitry,

Completed a clean index and I still see the same behavior.

Did not use Luke but from the search page we use leading wild card search is 
working.

-Shyam

-Original Message-
From: Dmitry Kan [mailto:dmitry@gmail.com] 
Sent: Wednesday, January 18, 2012 5:07 PM
To: solr-user@lucene.apache.org
Subject: Re: Question on Reverse Indexing

Shyam,

You still didn't say if you have started re-indexing from the clean index,
i.e. if you have removed all the data prior to re-indexing.
You can use the luke (http://code.google.com/p/luke/) to check the contents
of your text field, and see if it still contains reversed sequences.

On Wed, Jan 18, 2012 at 1:09 PM, Shyam Bhaskaran <
shyam.bhaska...@synopsys.com> wrote:

> Dimitry,
>
> We are using Solr 4.0. To confirm server caching issues I have restarted
> our tomcat webserver after performing a re-index.
>
> For reverseIndexing we have defined a fieldType "text_rev" and this
> fieldyType was used against the fields.
>
>   omitNorms="true">
> 
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> words="stopwords.txt" ignoreCase="true"/>
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> class="com.es.solr.backend.analysis.standard.SpecialCharSynonymFilterFactory"/>
>
>  withOriginal="true"
>maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
>  
> 
> class="com.es.solr.backend.analysis.standard.SolvNetTokenizerFactory"/>
> words="stopwords.txt" ignoreCase="true"/>
> class="com.es.solr.backend.analysis.standard.SolvNetFilterFactory"/>
>
> words="stopwords.txt" ignoreCase="true"/>
> 
>  
>
> But when it was found that ReversedWildcardFilterFactory is adding
> performance burden we removed the ReversedWildcardFilterFactory filter
>  withOriginal="true"
>maxPosAsterisk="3" maxPosQuestion="2"
> maxFractionAsterisk="0.33"/>
> and the whole collection was re-indexed.
>
> But even after removing the ReversedWildcardFilterFactory leading wild
> card search like *lock is working.
>
> -Shyam
>
> -Original Message-
> From: Dmitry Kan [mailto:dmitry@gmail.com]
> Sent: Wednesday, January 18, 2012 4:26 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Question on Reverse Indexing
>
> OK. Not sure what is your system architecture there, but could your queries
> stay cached in some server caches even after you have re-indexed your data?
> The way the index level leading wildcard works (reading SOLR 3.4 code, but
> seems to be true circa 1.4) is that the following check is done for the
> analysis chain:
>
> [code src=SolrQueryParser.java]
> boolean allow = false;
> ...
>  if (factory instanceof ReversedWildcardFilterFactory) {
>allow = true;
>...
>  }
> ...
>if (allow) {
>  setAllowLeadingWildcard(true);
>}
> [/code]
>
> so practically what you described can happen if
> the ReversedWildcardFilterFactory is still mentioned in one of your shards.
> A weird question, but have you reindexed your data to a clean index or on
> top of the existing one?
>
> On Wed, Jan 18, 2012 at 12:35 PM, Shyam Bhaskaran <
> shyam.bhaska...@synopsys.com> wrote:
>
> > Dimitry,
> >
> > Using http://localhost:7070/solr/docs/admin/analysis.jsp passed the
> query
> > *lock and did not find ReversedWildcardFilterFactory to the indexer or
> any
> > other filters that could do the reversing.
> >
> > -Shyam
> >
> > -Original Message-
> > From: Dmitry Kan [mailto:dmitry@gmail.com]
> > Sent: Wednesday, January 18, 2012 2:26 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Question on Reverse Indexing
> >
> > Just to play safe here, can you double check that the reversing is not
> any
> > more th