date:20120117

Re: Solr - Tomcat new versions

2012-01-17 Thread Alessio Crisantemi


Hi,
I installed Apache tomct on Windows (Vista) and Solr.
But I have any problem between Tomcat 7.0.23 and Solr 3.5

No problem if I install Solr 1.4.1 with the same version of Tomcat.
(I check it with binary and source code installation for omcat but the 
result is the same).

It's a bug, I think?
thank you
Alessio

Re: SolrJ Embedded

2012-01-17 Thread Maxim Veksler

On Tue, Jan 17, 2012 at 3:13 AM, Erick Erickson wrote:

> I don't see why not. I'm assuming a *nix system here so when Solr
> updated an index, any deleted files would hang around.
>
> But I have to ask why bother with the Embedded server in the
> first place? You already have a Solr instance up and running,
> why not just query that instead, perhaps using SolrJ?
>
>
Wouldn't querying the Solr server using the HTTP interface be slower?


> Best
> Erick
>
> On Mon, Jan 16, 2012 at 3:00 PM,   wrote:
> > Hi,
> >
> > is it possible to use the same index in a solr webapp and additionally
> in a
> > EmbeddedSolrServer? The embbedded one would be read only.
> >
> > Thank you.
> >
>

Re: Query regarding solr custom sort order

2012-01-17 Thread umaswayam

Hi,

Let me clarify the situation here in details.

The default sort which Websphere commerce provide is based on name & price
of any item. but we are having unique values of every item. hence sorting
goes on fine either as intger or as string but while preprocess we generate
some temporary tables like TI_CATGPENREL_0, where sequence number is
multiple value for a particular catentry id(item). this field (sequence) is
declared as varchar because it can contain multiple values separated by ";"
which solr returns, hence sorting based on sequence will happen
lexicographically as discussed in the thread. So can we restrict it to send
single values based on certain category ID or something like this.

Thanks in advance,
Uma Shankar 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Query-regarding-solr-custom-sort-order-tp3631854p3665545.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr - Tomcat new versions

2012-01-17 Thread Luca Cavanna

Hi Alessio,
I've seen Solr 3.5 running within Tomcat 7.0.23, it shouldn't be a bug I
guess.
Could you please provide some more details about the problem you have? Do
you have a stacktrace?
Are you upgrading an existing Solr 1.4.1, right?
By the way, which jdk are you using?

Thanks
Luca

On Tue, Jan 17, 2012 at 9:40 AM, Alessio Crisantemi <
alessio.crisant...@gioconews.it> wrote:

> Hi,
> I installed Apache tomct on Windows (Vista) and Solr.
> But I have any problem between Tomcat 7.0.23 and Solr 3.5
>
> No problem if I install Solr 1.4.1 with the same version of Tomcat.
> (I check it with binary and source code installation for omcat but the
> result is the same).
> It's a bug, I think?
> thank you
> Alessio
>
>
>
>
>
>
>
>
>


-- 
Luca Cavanna
E-mail: cavannal...@gmail.com
Skype: just.cavanna
Italian Mobile: +39 329 0170084
Dutch Mobile: +31 6 22255262
Website: http://www.javanna.net
Stack Overflow Careers: http://careers.stackoverflow.com/lucacavanna
Linkedin: http://it.linkedin.com/in/lucacavanna

Re: Solr - Tomcat new versions

2012-01-17 Thread Alessio Crisantemi


Dear Luca,
I follow the Solr installation procedures signed on Official guide, but with 
Solr 3,5 don't works. While with solr 1.4.1 it's all right.


I don't know why...but now I work with Solr 1.4.1

and more:
I would install TIKA 1.0 on Solr 1.4.1. Is possible?
How can i do? can you help me?
best,
a.


---







-Messaggio originale- 
From: Luca Cavanna

Sent: Tuesday, January 17, 2012 10:16 AM
To: solr-user@lucene.apache.org ; Alessio Crisantemi
Subject: Re: Solr - Tomcat new versions

Hi Alessio,
I've seen Solr 3.5 running within Tomcat 7.0.23, it shouldn't be a bug I
guess.
Could you please provide some more details about the problem you have? Do
you have a stacktrace?
Are you upgrading an existing Solr 1.4.1, right?
By the way, which jdk are you using?

Thanks
Luca

On Tue, Jan 17, 2012 at 9:40 AM, Alessio Crisantemi <
alessio.crisant...@gioconews.it> wrote:


Hi,
I installed Apache tomct on Windows (Vista) and Solr.
But I have any problem between Tomcat 7.0.23 and Solr 3.5

No problem if I install Solr 1.4.1 with the same version of Tomcat.
(I check it with binary and source code installation for omcat but the
result is the same).
It's a bug, I think?
thank you
Alessio












--
Luca Cavanna
E-mail: cavannal...@gmail.com
Skype: just.cavanna
Italian Mobile: +39 329 0170084
Dutch Mobile: +31 6 22255262
Website: http://www.javanna.net
Stack Overflow Careers: http://careers.stackoverflow.com/lucacavanna
Linkedin: http://it.linkedin.com/in/lucacavanna

Re: FacetComponent: suppress original query

2012-01-17 Thread Dmitry Kan

Yes, that's what I have started to use already. Probably, this is the
easiest solution. Thanks.

On Tue, Jan 17, 2012 at 3:03 AM, Erick Erickson wrote:

> Why not just up the maxBooleanClauses parameter in solrconfig.xml?
>
> Best
> Erick
>
> On Sat, Jan 14, 2012 at 1:41 PM, Dmitry Kan  wrote:
> > OK, let me clarify it:
> >
> > if solrconfig has maxBooleanClauses set to 1000 for example, than queries
> > with clauses more than 1000 in number will be rejected with the mentioned
> > exception.
> > What I want to do is automatically split such queries into sub-queries
> with
> > at most 1000 clauses inside SOLR and send them to shards. I have already
> > done the splitting and sending code, but how to bypass the
> > maxBooleanClauses check?
> >
> > Dmitry
> >
> > On Fri, Jan 13, 2012 at 7:40 PM, Chris Hostetter
> > wrote:
> >
> >>
> >> : I would like to "by-pass" the maxBooleanClauses limit in such a way,
> that
> >> : those queries that contain boolean clauses more than
> maxBooleanClauses in
> >> : the number, would be automatically split into sub-queries. That part
> is
> >> : done.
> >> :
> >> : Now, when such a query arrives, solr throws
> >> :
> >> : org.apache.lucene.queryParser.ParseException: Cannot parse
> >> : 'AccessionNumber:(TS-E_284668 OR TS-E_284904 OR 950123-11-086962
> OR
> >> : TS-AS_292840 OR TS-AS_295661 OR TS-AS_296320 OR TS-AS_296805 OR
> >> : TS-AS_296819 OR TS-AS_296820)': too many boolean clauses
> >>
> >> I don't understand your question/issue ... you say you've already worked
> >> arround the maxBooleanClauses (ie: "That part is done") but you didn't
> say
> >> how, and in your followup quesiton, it sounds like you are still hitting
> >> the limit of maxBooleanClauses.
> >>
> >> So what exactly have you changed/done that is "done" and what is the
> >> new problem?
> >>
> >>
> >> -Hoss
> >>
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
>



-- 
Regards,

Dmitry Kan

Re: Solr - Tomcat new versions

2012-01-17 Thread Luca Cavanna

Hi Alessio,
in order to help you, we'd need to know something more about what's going
wrong. Could you give us a stacktrace or an error you're reading?
How do you know solr isn't working?

Thanks
Luca

On Tue, Jan 17, 2012 at 10:52 AM, Alessio Crisantemi <
alessio.crisant...@gioconews.it> wrote:

> Dear Luca,
> I follow the Solr installation procedures signed on Official guide, but
> with Solr 3,5 don't works. While with solr 1.4.1 it's all right.
>
> I don't know why...but now I work with Solr 1.4.1
>
> and more:
> I would install TIKA 1.0 on Solr 1.4.1. Is possible?
> How can i do? can you help me?
> best,
> a.
>
>
> --**--**
> --**-
>
>
>
>
>
>
>
>
> -Messaggio originale- From: Luca Cavanna
> Sent: Tuesday, January 17, 2012 10:16 AM
> To: solr-user@lucene.apache.org ; Alessio Crisantemi
> Subject: Re: Solr - Tomcat new versions
>
> Hi Alessio,
> I've seen Solr 3.5 running within Tomcat 7.0.23, it shouldn't be a bug I
> guess.
> Could you please provide some more details about the problem you have? Do
> you have a stacktrace?
> Are you upgrading an existing Solr 1.4.1, right?
> By the way, which jdk are you using?
>
> Thanks
> Luca
>
> On Tue, Jan 17, 2012 at 9:40 AM, Alessio Crisantemi <
> alessio.crisantemi@gioconews.**it >
> wrote:
>
>  Hi,
>> I installed Apache tomct on Windows (Vista) and Solr.
>> But I have any problem between Tomcat 7.0.23 and Solr 3.5
>>
>> No problem if I install Solr 1.4.1 with the same version of Tomcat.
>> (I check it with binary and source code installation for omcat but the
>> result is the same).
>> It's a bug, I think?
>> thank you
>> Alessio
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

really slow performance when trying to get facet.field

2012-01-17 Thread Daniel Bruegge

Hi,

I have 2 Solr-shards. One is filled with approx. 25mio documents (local
index 6GB), the other with 10mio documents (2.7GB size).
I am trying to create some kind of 'word cloud' to see the frequency of
words for a *text_general *field.
For this I am currently using a facet over this field and I am also
restricting the documents by using some other filters in the query.

The performance is really bad for the first call and then pretty fast for
the following calls.

The maximum Java heap size is 3G for each shard. Both shards are running on
the same physical server which has 12G RAM.

Question: Should I reduce the documents in one shard, so that the index is
equal or less the Java Heap size for this shard? Or is
there another method to avoid this slow calls?

Thank you

Daniel

Re: really slow performance when trying to get facet.field

2012-01-17 Thread Dmitry Kan

I had a similar problem for a similar task. And in my case merging the
results from two shards turned out to be a culprit. If you can logically
store your data just in one shard, your faceting should become faster. Size
wise it should not be a problem for SOLR.

Also, you didn't say anything about the facet.limit value, cache
parameters, usage of filter queries. Some of these can be interconnected.

Dmitry

On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge <
daniel.brue...@googlemail.com> wrote:

> Hi,
>
> I have 2 Solr-shards. One is filled with approx. 25mio documents (local
> index 6GB), the other with 10mio documents (2.7GB size).
> I am trying to create some kind of 'word cloud' to see the frequency of
> words for a *text_general *field.
> For this I am currently using a facet over this field and I am also
> restricting the documents by using some other filters in the query.
>
> The performance is really bad for the first call and then pretty fast for
> the following calls.
>
> The maximum Java heap size is 3G for each shard. Both shards are running on
> the same physical server which has 12G RAM.
>
> Question: Should I reduce the documents in one shard, so that the index is
> equal or less the Java Heap size for this shard? Or is
> there another method to avoid this slow calls?
>
> Thank you
>
> Daniel
>



-- 
Regards,

Dmitry Kan

Re: Solr - Tomcat new versions

2012-01-17 Thread Erik Hatcher

Perhaps this the known issue with the 3.5 example schema being used in Tomcat 
and the VelocityResponseWriter issue?   I'm on my mobile now so don't have easy 
access to a pointer with details but check the archives if this seems to be the 
issue on how to resolve it.  

   Erik

On Jan 17, 2012, at 4:52, "Alessio Crisantemi" 
 wrote:

> Dear Luca,
> I follow the Solr installation procedures signed on Official guide, but with 
> Solr 3,5 don't works. While with solr 1.4.1 it's all right.
> 
> I don't know why...but now I work with Solr 1.4.1
> 
> and more:
> I would install TIKA 1.0 on Solr 1.4.1. Is possible?
> How can i do? can you help me?
> best,
> a.
> 
> 
> ---
> 
> 
> 
> 
> 
> 
> 
> -Messaggio originale- From: Luca Cavanna
> Sent: Tuesday, January 17, 2012 10:16 AM
> To: solr-user@lucene.apache.org ; Alessio Crisantemi
> Subject: Re: Solr - Tomcat new versions
> 
> Hi Alessio,
> I've seen Solr 3.5 running within Tomcat 7.0.23, it shouldn't be a bug I
> guess.
> Could you please provide some more details about the problem you have? Do
> you have a stacktrace?
> Are you upgrading an existing Solr 1.4.1, right?
> By the way, which jdk are you using?
> 
> Thanks
> Luca
> 
> On Tue, Jan 17, 2012 at 9:40 AM, Alessio Crisantemi <
> alessio.crisant...@gioconews.it> wrote:
> 
>> Hi,
>> I installed Apache tomct on Windows (Vista) and Solr.
>> But I have any problem between Tomcat 7.0.23 and Solr 3.5
>> 
>> No problem if I install Solr 1.4.1 with the same version of Tomcat.
>> (I check it with binary and source code installation for omcat but the
>> result is the same).
>> It's a bug, I think?
>> thank you
>> Alessio
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Luca Cavanna
> E-mail: cavannal...@gmail.com
> Skype: just.cavanna
> Italian Mobile: +39 329 0170084
> Dutch Mobile: +31 6 22255262
> Website: http://www.javanna.net
> Stack Overflow Careers: http://careers.stackoverflow.com/lucacavanna
> Linkedin: http://it.linkedin.com/in/lucacavanna

Re: really slow performance when trying to get facet.field

2012-01-17 Thread Daniel Bruegge

Hi Dmitry,

I had everything on one Solr Instance before, but this got to heavy and I
had the same issue here, that the 1st facet.query was really slow.

When querying the facet:
- facet.limit = 100

Cache settings are like this:







How big was your index? Did it fit into the RAM which you gave the Solr
instance?

Thanks


On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan  wrote:

> I had a similar problem for a similar task. And in my case merging the
> results from two shards turned out to be a culprit. If you can logically
> store your data just in one shard, your faceting should become faster. Size
> wise it should not be a problem for SOLR.
>
> Also, you didn't say anything about the facet.limit value, cache
> parameters, usage of filter queries. Some of these can be interconnected.
>
> Dmitry
>
> On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge <
> daniel.brue...@googlemail.com> wrote:
>
> > Hi,
> >
> > I have 2 Solr-shards. One is filled with approx. 25mio documents (local
> > index 6GB), the other with 10mio documents (2.7GB size).
> > I am trying to create some kind of 'word cloud' to see the frequency of
> > words for a *text_general *field.
> > For this I am currently using a facet over this field and I am also
> > restricting the documents by using some other filters in the query.
> >
> > The performance is really bad for the first call and then pretty fast for
> > the following calls.
> >
> > The maximum Java heap size is 3G for each shard. Both shards are running
> on
> > the same physical server which has 12G RAM.
> >
> > Question: Should I reduce the documents in one shard, so that the index
> is
> > equal or less the Java Heap size for this shard? Or is
> > there another method to avoid this slow calls?
> >
> > Thank you
> >
> > Daniel
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>

Re: really slow performance when trying to get facet.field

2012-01-17 Thread Dmitry Kan

Hi Daniel,

My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is
beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m
-Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs is
over 6,5 million.

Do you see any evictions in your caches? What kind of server is it, in
terms of CPU and OS? How often do you commit to the index?

Dmitry

On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge <
daniel.brue...@googlemail.com> wrote:

> Hi Dmitry,
>
> I had everything on one Solr Instance before, but this got to heavy and I
> had the same issue here, that the 1st facet.query was really slow.
>
> When querying the facet:
> - facet.limit = 100
>
> Cache settings are like this:
>
> size="16384"
> initialSize="4096"
> autowarmCount="4096"/>
>
> size="512"
> initialSize="512"
> autowarmCount="0"/>
>
>   size="512"
>   initialSize="512"
>   autowarmCount="0"/>
>
> How big was your index? Did it fit into the RAM which you gave the Solr
> instance?
>
> Thanks
>
>
> On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan  wrote:
>
> > I had a similar problem for a similar task. And in my case merging the
> > results from two shards turned out to be a culprit. If you can logically
> > store your data just in one shard, your faceting should become faster.
> Size
> > wise it should not be a problem for SOLR.
> >
> > Also, you didn't say anything about the facet.limit value, cache
> > parameters, usage of filter queries. Some of these can be interconnected.
> >
> > Dmitry
> >
> > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge <
> > daniel.brue...@googlemail.com> wrote:
> >
> > > Hi,
> > >
> > > I have 2 Solr-shards. One is filled with approx. 25mio documents (local
> > > index 6GB), the other with 10mio documents (2.7GB size).
> > > I am trying to create some kind of 'word cloud' to see the frequency of
> > > words for a *text_general *field.
> > > For this I am currently using a facet over this field and I am also
> > > restricting the documents by using some other filters in the query.
> > >
> > > The performance is really bad for the first call and then pretty fast
> for
> > > the following calls.
> > >
> > > The maximum Java heap size is 3G for each shard. Both shards are
> running
> > on
> > > the same physical server which has 12G RAM.
> > >
> > > Question: Should I reduce the documents in one shard, so that the index
> > is
> > > equal or less the Java Heap size for this shard? Or is
> > > there another method to avoid this slow calls?
> > >
> > > Thank you
> > >
> > > Daniel
> > >
> >
> >
> >
> > --
> > Regards,
> >
> > Dmitry Kan
> >
>



-- 
Regards,

Dmitry Kan

Re: really slow performance when trying to get facet.field

2012-01-17 Thread Daniel Bruegge

Evictions are 0 for all cache types.

Your server max heap space with 12G is pretty huge. Which is good I think.
The CPU on my server is a 8-Core Intel i7 965.

Commit frequency is low, because shards are added and old shards exist for
historical reasons. Old shards will be then cleaned after couple of months.

I will try to add maximum 15mio per shard and see what will happen here.

This thing is, that I will add more shards over time, so that I can handle
maybe 500-800mio documents. Maybe more. It depends.

On Tue, Jan 17, 2012 at 2:14 PM, Dmitry Kan  wrote:

> Hi Daniel,
>
> My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is
> beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m
> -Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs is
> over 6,5 million.
>
> Do you see any evictions in your caches? What kind of server is it, in
> terms of CPU and OS? How often do you commit to the index?
>
> Dmitry
>
> On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge <
> daniel.brue...@googlemail.com> wrote:
>
> > Hi Dmitry,
> >
> > I had everything on one Solr Instance before, but this got to heavy and I
> > had the same issue here, that the 1st facet.query was really slow.
> >
> > When querying the facet:
> > - facet.limit = 100
> >
> > Cache settings are like this:
> >
> > > size="16384"
> > initialSize="4096"
> > autowarmCount="4096"/>
> >
> > > size="512"
> > initialSize="512"
> > autowarmCount="0"/>
> >
> > >   size="512"
> >   initialSize="512"
> >   autowarmCount="0"/>
> >
> > How big was your index? Did it fit into the RAM which you gave the Solr
> > instance?
> >
> > Thanks
> >
> >
> > On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan 
> wrote:
> >
> > > I had a similar problem for a similar task. And in my case merging the
> > > results from two shards turned out to be a culprit. If you can
> logically
> > > store your data just in one shard, your faceting should become faster.
> > Size
> > > wise it should not be a problem for SOLR.
> > >
> > > Also, you didn't say anything about the facet.limit value, cache
> > > parameters, usage of filter queries. Some of these can be
> interconnected.
> > >
> > > Dmitry
> > >
> > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge <
> > > daniel.brue...@googlemail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have 2 Solr-shards. One is filled with approx. 25mio documents
> (local
> > > > index 6GB), the other with 10mio documents (2.7GB size).
> > > > I am trying to create some kind of 'word cloud' to see the frequency
> of
> > > > words for a *text_general *field.
> > > > For this I am currently using a facet over this field and I am also
> > > > restricting the documents by using some other filters in the query.
> > > >
> > > > The performance is really bad for the first call and then pretty fast
> > for
> > > > the following calls.
> > > >
> > > > The maximum Java heap size is 3G for each shard. Both shards are
> > running
> > > on
> > > > the same physical server which has 12G RAM.
> > > >
> > > > Question: Should I reduce the documents in one shard, so that the
> index
> > > is
> > > > equal or less the Java Heap size for this shard? Or is
> > > > there another method to avoid this slow calls?
> > > >
> > > > Thank you
> > > >
> > > > Daniel
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>

Function in facet.query like min,max

2012-01-17 Thread Eric Grobler

Hi Solr community,

Is it possible to return the lowest, highest and average price of a search
result using facets?
I tried something like: facet.query={!max(price,0)}
Is it possible and what is the correct syntax?

q=htc android
facet=true
facet.query=price:[* TO 10]
facet.query=price:[11 TO 100]
facet.query=price:[101 TO *]
???  facet.query={!max(price,0)}


Thanks & Regards
Ericz

Re: Trying to understand SOLR memory requirements

2012-01-17 Thread Dave

Thank you Robert, I'd appreciate that. Any idea how long it will take to
get a fix? Would I be better switching to trunk? Is trunk stable enough for
someone who's very much a SOLR novice?

Thanks,
Dave

On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:

> looks like https://issues.apache.org/jira/browse/SOLR-2888.
>
> Previously, FST would need to hold all the terms in RAM during
> construction, but with the patch it uses offline sorts/temporary
> files.
> I'll reopen the issue to backport this to the 3.x branch.
>
>
> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> > I'm trying to figure out what my memory needs are for a rather large
> > dataset. I'm trying to build an auto-complete system for every
> > city/state/country in the world. I've got a geographic database, and have
> > setup the DIH to pull the proper data in. There are 2,784,937 documents
> > which I've formatted into JSON-like output, so there's a bit of data
> > associated with each one. Here is an example record:
> >
> > Brooklyn, New York, United States?{ |id|: |2620829|,
> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
> > States| }
> >
> > I've got the spellchecker / suggester module setup, and I can confirm
> that
> > everything works properly with a smaller dataset (i.e. just a couple of
> > countries worth of cities/states). However I'm running into a big problem
> > when I try to index the entire dataset. The
> dataimport?command=full-import
> > works and the system comes to an idle state. It generates the following
> > data/index/ directory (I'm including it in case it gives any indication
> on
> > memory requirements):
> >
> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> > -rw-rw 1 root   root22M Jan 17 00:13 _2w.fdx
> > -rw-rw 1 root   root131 Jan 17 00:13 _2w.fnm
> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> > -rw-rw 1 root   root16M Jan 17 00:13 _2w.nrm
> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> > -rw-rw 1 root   root 20 Jan 17 00:13 segments.gen
> > -rw-rw 1 root   root291 Jan 17 00:13 segments_2
> >
> > Next I try to run the suggest?spellcheck.build=true command, and I get
> the
> > following error:
> >
> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build
> > INFO: build()
> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> > at java.lang.String.(String.java:215)
> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
> >  at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
> >  at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> > at
> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
> >  at
> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> > at
> >
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
> >  at
> >
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
> > at
> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
> >  at
> >
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
> > at
> >
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
> >  at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
> >  at
> >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> > at
> >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> >  at
> >
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> > at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> >  at
> >
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> > at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> >  at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.

Re: really slow performance when trying to get facet.field

2012-01-17 Thread Daniel Bruegge

Evictions are 0 for all cache types.

Your server max heap space with 12G is pretty huge. Which is good I think.
The CPU on my server is a 8-Core Intel i7 965.

Commit frequency is low, because shards are added and old shards exist for
historical reasons. Old shards will be then cleaned after couple of months.

I will try to add maximum 15mio per shard and see what will happen here.

This thing is, that I will add more shards over time, so that I can handle
maybe 500-800mio documents. Maybe more. It depends.

On Tue, Jan 17, 2012 at 2:14 PM, Dmitry Kan  wrote:

> Hi Daniel,
>
> My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is
> beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m
> -Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs is
> over 6,5 million.
>
> Do you see any evictions in your caches? What kind of server is it, in
> terms of CPU and OS? How often do you commit to the index?
>
> Dmitry
>
> On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge <
> daniel.brue...@googlemail.com> wrote:
>
> > Hi Dmitry,
> >
> > I had everything on one Solr Instance before, but this got to heavy and I
> > had the same issue here, that the 1st facet.query was really slow.
> >
> > When querying the facet:
> > - facet.limit = 100
> >
> > Cache settings are like this:
> >
> > > size="16384"
> > initialSize="4096"
> > autowarmCount="4096"/>
> >
> > > size="512"
> > initialSize="512"
> > autowarmCount="0"/>
> >
> > >   size="512"
> >   initialSize="512"
> >   autowarmCount="0"/>
> >
> > How big was your index? Did it fit into the RAM which you gave the Solr
> > instance?
> >
> > Thanks
> >
> >
> > On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan 
> wrote:
> >
> > > I had a similar problem for a similar task. And in my case merging the
> > > results from two shards turned out to be a culprit. If you can
> logically
> > > store your data just in one shard, your faceting should become faster.
> > Size
> > > wise it should not be a problem for SOLR.
> > >
> > > Also, you didn't say anything about the facet.limit value, cache
> > > parameters, usage of filter queries. Some of these can be
> interconnected.
> > >
> > > Dmitry
> > >
> > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge <
> > > daniel.brue...@googlemail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I have 2 Solr-shards. One is filled with approx. 25mio documents
> (local
> > > > index 6GB), the other with 10mio documents (2.7GB size).
> > > > I am trying to create some kind of 'word cloud' to see the frequency
> of
> > > > words for a *text_general *field.
> > > > For this I am currently using a facet over this field and I am also
> > > > restricting the documents by using some other filters in the query.
> > > >
> > > > The performance is really bad for the first call and then pretty fast
> > for
> > > > the following calls.
> > > >
> > > > The maximum Java heap size is 3G for each shard. Both shards are
> > running
> > > on
> > > > the same physical server which has 12G RAM.
> > > >
> > > > Question: Should I reduce the documents in one shard, so that the
> index
> > > is
> > > > equal or less the Java Heap size for this shard? Or is
> > > > there another method to avoid this slow calls?
> > > >
> > > > Thank you
> > > >
> > > > Daniel
> > > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > >
> > > Dmitry Kan
> > >
> >
>
>
>
> --
> Regards,
>
> Dmitry Kan
>

How can I index this?

2012-01-17 Thread ahammad

Hello,

I am looking into indexing two data sources. One of those is a standard
website and the other is a Sharepoint site. The problem is that I have no
direct database access. Normally I would just use the DIH and get what I
need from the DB. I do have a java DAO (data access object) class that I am
using to directly to fetch information for a different purpose. 

In cases like this, what would be the best way to index the data? Should I
somehow integrate Nutch as the crawler? Should I write a custom DIH? Can I
use the DAO that I have in conjunction with the DIH?

I am really looking for some recommendations here. I do have a few hacks
that can be done (copy the data in a DB and index with DIH), but I am
interested in the proper way. Any insight will be greatly appreciated.

Cheers

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666106.html
Sent from the Solr - User mailing list archive at Nabble.com.

first time query is very slow

2012-01-17 Thread gabriel shen

hi,

I had an solr3.3 index of 200,000 documents, all text are stored and the
total index size is 27gb.
I used dismax query with over 10 qf and pf boosting field each, plus
sorting on score and other 2 fields. It took quite a few seconds(5-8) for
the first time query to return any result(no highlighting is invloved).
(even slower for phrase query)

My question is, what is the bottle neck of the query speed? lucene query
part? Scoring? or fill document cache with document content? Can anyone
answer?

Is there anyway of improving the first time query speed?

thanks in advance,
shen

Re: Trying to understand SOLR memory requirements

2012-01-17 Thread Robert Muir

I committed it already: so you can try out branch_3x if you want.

you can either wait for a nightly build or compile from svn
(http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).

On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> Thank you Robert, I'd appreciate that. Any idea how long it will take to
> get a fix? Would I be better switching to trunk? Is trunk stable enough for
> someone who's very much a SOLR novice?
>
> Thanks,
> Dave
>
> On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
>
>> looks like https://issues.apache.org/jira/browse/SOLR-2888.
>>
>> Previously, FST would need to hold all the terms in RAM during
>> construction, but with the patch it uses offline sorts/temporary
>> files.
>> I'll reopen the issue to backport this to the 3.x branch.
>>
>>
>> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
>> > I'm trying to figure out what my memory needs are for a rather large
>> > dataset. I'm trying to build an auto-complete system for every
>> > city/state/country in the world. I've got a geographic database, and have
>> > setup the DIH to pull the proper data in. There are 2,784,937 documents
>> > which I've formatted into JSON-like output, so there's a bit of data
>> > associated with each one. Here is an example record:
>> >
>> > Brooklyn, New York, United States?{ |id|: |2620829|,
>> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
>> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
>> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
>> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
>> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
>> > States| }
>> >
>> > I've got the spellchecker / suggester module setup, and I can confirm
>> that
>> > everything works properly with a smaller dataset (i.e. just a couple of
>> > countries worth of cities/states). However I'm running into a big problem
>> > when I try to index the entire dataset. The
>> dataimport?command=full-import
>> > works and the system comes to an idle state. It generates the following
>> > data/index/ directory (I'm including it in case it gives any indication
>> on
>> > memory requirements):
>> >
>> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
>> > -rw-rw 1 root   root    22M Jan 17 00:13 _2w.fdx
>> > -rw-rw 1 root   root    131 Jan 17 00:13 _2w.fnm
>> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
>> > -rw-rw 1 root   root    16M Jan 17 00:13 _2w.nrm
>> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
>> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
>> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
>> > -rw-rw 1 root   root     20 Jan 17 00:13 segments.gen
>> > -rw-rw 1 root   root    291 Jan 17 00:13 segments_2
>> >
>> > Next I try to run the suggest?spellcheck.build=true command, and I get
>> the
>> > following error:
>> >
>> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build
>> > INFO: build()
>> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
>> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
>> > at java.lang.String.(String.java:215)
>> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
>> > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
>> >  at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
>> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
>> >  at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
>> > at
>> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
>> >  at
>> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
>> > at
>> >
>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
>> >  at
>> >
>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
>> > at
>> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
>> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
>> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
>> >  at
>> >
>> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
>> > at
>> >
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
>> >  at
>> >
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
>> >  at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>> > at
>> >
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>> >  at
>> >
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Serv

[Job] Sales Engineer at Lucid Imagination

2012-01-17 Thread Grant Ingersoll

Hi Solr Users,

Lucid Imagination is looking for a sales engineer.  If you know search, Solr 
and like working with customers, the sales engineer job may be of interest to 
you.  I've included the job description below.  If you are interested, please 
send your resume (off-list) to melissa.qu...@lucidimagination.com.  The 
position is based out of our Redwood City, CA office.

Cheers,
Grant

Technical Sales Professional/Sales Engineer
Responsibilities include: support business development team and help conduct 
product demonstrations and assist prospective customers to understand the value 
of our product; help close sales; craft responses to RFP's and RFI's; build 
proofs of concepts; develop and conduct training on occasion. Qualifications: 
BS or higher in Engineering or Computer Science preferred. 3+ years of IT 
Consulting and/or Professional Services experience required. Experience working 
with Lucene and/or Solr required. Experience with enterprise search 
applications are a big plus; some Java development experience; some experience 
with common scripting languages; enterprise search, eCommerce, and/or Business 
Intelligence experience a plus.

Re: first time query is very slow

2012-01-17 Thread darren

First query will cause the index caches to be warmed up and this is why
the first query takes some time.

You can prewarm the caches with a query (when solr starts up) of your
choosing in the config file. Google around the SolrWiki on cache/index
warming.

hth

> hi,
>
> I had an solr3.3 index of 200,000 documents, all text are stored and the
> total index size is 27gb.
> I used dismax query with over 10 qf and pf boosting field each, plus
> sorting on score and other 2 fields. It took quite a few seconds(5-8) for
> the first time query to return any result(no highlighting is invloved).
> (even slower for phrase query)
>
> My question is, what is the bottle neck of the query speed? lucene query
> part? Scoring? or fill document cache with document content? Can anyone
> answer?
>
> Is there anyway of improving the first time query speed?
>
> thanks in advance,
> shen
>

PositionIncrementGap inside a field

2012-01-17 Thread marotosg

Hi.

At the moment I have a multivalued field where i would like to add
information with gaps at the end of every line in the multivalued field and
I would like to add gaps as well in the middle of the lines.

For instance



   
   IBM Corporation some information *"here a gap"* more information
   
   
  IBM Limited more info "here a gap" and some more data
   


Do you know how to add a  *positionincrementgap* here *"here a gap"*
Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666230p3666230.html
Sent from the Solr - User mailing list archive at Nabble.com.

PositionIncrementGap inside a field

2012-01-17 Thread marotosg

Hi.

At the moment I have a multivalued field where i would like to add
information with gaps at the end of every line in the multivalued field and
I would like to add gaps as well in the middle of the lines.

For instance



   
   IBM Corporation some information *"here a gap"* more information
   
   
  IBM Limited more info "here a gap" and some more data
   


Do you know how to add a  *positionincrementgap* here *"here a gap"*
Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p3666243.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: first time query is very slow

2012-01-17 Thread gabriel shen

Thanks darren,

I understand it will take longer time before warming up. What I am trying
to find out is at the situation where we have no cache, why it will take so
long time to complete the query, and what is the bottleneck?

Fx, if I remove all qf, pf fields, the query speed will improve
dramatically. Does it indicate a performance hole in boosting part of the
code??

Predefined query will impove the speed a lot if the query string or
documents that are cached, but it does not improve the speed of a new
query. It will still spend the same time to do the scoreing, boosting,
filtering etc. And memory is also a problem for big indexes, in my case, I
had in total over 100gb indexes in one solr installation. I can't even
imagine how solr can handle complex query for 1tb index data in my case.

For those customers who unluckily send un-prewarmed query, they will suffer
from bad response time, it is not too pleasant anyway.

best regards,
shen

On Tue, Jan 17, 2012 at 3:18 PM,  wrote:

> First query will cause the index caches to be warmed up and this is why
> the first query takes some time.
>
> You can prewarm the caches with a query (when solr starts up) of your
> choosing in the config file. Google around the SolrWiki on cache/index
> warming.
>
> hth
>
> > hi,
> >
> > I had an solr3.3 index of 200,000 documents, all text are stored and the
> > total index size is 27gb.
> > I used dismax query with over 10 qf and pf boosting field each, plus
> > sorting on score and other 2 fields. It took quite a few seconds(5-8) for
> > the first time query to return any result(no highlighting is invloved).
> > (even slower for phrase query)
> >
> > My question is, what is the bottle neck of the query speed? lucene query
> > part? Scoring? or fill document cache with document content? Can anyone
> > answer?
> >
> > Is there anyway of improving the first time query speed?
> >
> > thanks in advance,
> > shen
> >
>
>

Re: SolrJ Embedded

2012-01-17 Thread Erick Erickson

Quantify slower, does it matter? At issue is that usually
Solr spends far more time doing the search than
transmitting the query and response over HTTP. Http
is not really slow *as a protocol* in the first place.

The usual place people have problems here is when
there are a bunch of requests made over a network, a
"chatty" connection. Especially if the other end of
the connection is far away.

But in Solr's case, there's one request and one response
per search, so there's not much chat to worry about.

But regardless of all that, never, never, never make
your environment more complex than it needs to be
before you *demonstrate* that you need to. The
efficiency savings are often negligible and the cost
of maintaining the complexity are often far more than
estimated. Premature optimization and all that.

I will allow that on some rare occasions you *can* know
that you have to get complex from the start, but I can't tell
you how many times I've been *sure* I knew where the
bottleneck would beand been wrong. Measure first, fix
second has become my mantra.

Best
Erick

On Tue, Jan 17, 2012 at 3:49 AM, Maxim Veksler  wrote:
> On Tue, Jan 17, 2012 at 3:13 AM, Erick Erickson 
> wrote:
>
>> I don't see why not. I'm assuming a *nix system here so when Solr
>> updated an index, any deleted files would hang around.
>>
>> But I have to ask why bother with the Embedded server in the
>> first place? You already have a Solr instance up and running,
>> why not just query that instead, perhaps using SolrJ?
>>
>>
> Wouldn't querying the Solr server using the HTTP interface be slower?
>
>
>> Best
>> Erick
>>
>> On Mon, Jan 16, 2012 at 3:00 PM,   wrote:
>> > Hi,
>> >
>> > is it possible to use the same index in a solr webapp and additionally
>> in a
>> > EmbeddedSolrServer? The embbedded one would be read only.
>> >
>> > Thank you.
>> >
>>

Re: PositionIncrementGap inside a field

2012-01-17 Thread Erick Erickson

This is just adding the field repeatedly, something like

IBM Corporation some information
  IBM limited more info
multiValued="true"/>
>
> 
>   
>       IBM Corporation some information *"here a gap"* more information
>   
>   
>      IBM Limited more info "here a gap" and some more data
>   
> 
>
> Do you know how to add a  *positionincrementgap* here *"here a gap"*
> Thanks
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p3666243.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr Cloud Indexing

2012-01-17 Thread Erick Erickson

This only really makes sense if you don't have enough in-house resources
to do your indexing locally, but it certainly is possible.

Amazon's EC2 has been used, but really any hosting service should do.

Best
Erick

On Tue, Jan 17, 2012 at 12:09 AM, Sujatha Arun  wrote:
> Would it make sense to  Index on the cloud and periodically [2-4 times
> /day] replicate the index at  our server for searching .Which service to go
> with for solr Cloud Indexing ?
>
> Any good and tried services?
>
> Regards
> Sujatha

Re: Function in facet.query like min,max

2012-01-17 Thread Erick Erickson

have you seen the Stats component? See:
http://wiki.apache.org/solr/StatsComponent

Best
Erick

On Tue, Jan 17, 2012 at 8:34 AM, Eric Grobler  wrote:
> Hi Solr community,
>
> Is it possible to return the lowest, highest and average price of a search
> result using facets?
> I tried something like: facet.query={!max(price,0)}
> Is it possible and what is the correct syntax?
>
> q=htc android
> facet=true
> facet.query=price:[* TO 10]
> facet.query=price:[11 TO 100]
> facet.query=price:[101 TO *]
> ???  facet.query={!max(price,0)}
>
>
> Thanks & Regards
> Ericz

Re: How can I index this?

2012-01-17 Thread Erick Erickson

This sounds like, for the database source, that using SolrJ would
be the way to go. Assuming you can access the database from
Java this is pretty easy.

As for the website, Nutch is certainly an option...

But I'm a little puzzled. You mention a website, and sharepoint
as your sources, then ask about accessing the DB. How are
all these related?

Best
Erick

On Tue, Jan 17, 2012 at 8:38 AM, ahammad  wrote:
> Hello,
>
> I am looking into indexing two data sources. One of those is a standard
> website and the other is a Sharepoint site. The problem is that I have no
> direct database access. Normally I would just use the DIH and get what I
> need from the DB. I do have a java DAO (data access object) class that I am
> using to directly to fetch information for a different purpose.
>
> In cases like this, what would be the best way to index the data? Should I
> somehow integrate Nutch as the crawler? Should I write a custom DIH? Can I
> use the DAO that I have in conjunction with the DIH?
>
> I am really looking for some recommendations here. I do have a few hacks
> that can be done (copy the data in a DB and index with DIH), but I am
> interested in the proper way. Any insight will be greatly appreciated.
>
> Cheers
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666106.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: How can I index this?

2012-01-17 Thread ahammad

Perhaps I was a little confusing...

Normally when I have DB access, I do a regular indexing process using DIH.
For these two sources, I do not have direct DB access. I can only view the
two sources like any end-user would.

I do have a java class that can get the information that I need. That class
gets that information (through HTTP requests) and does not have DB access.
That class is currently being used for other purposes but I can take it and
use it for Solr as well. Does that make sense?

Knowing all that, namely the fact that I cannot directly access the DB, and
I can make HTTP requests to get the info, how can I index that info? 

Please let me know if this clarifies what I am trying to do.

Regards

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666590.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Function in facet.query like min,max

2012-01-17 Thread Eric Grobler

Yes, I have, but unfortunately it works on the whole index and not for a
particular query.


On Tue, Jan 17, 2012 at 3:37 PM, Erick Erickson wrote:

> have you seen the Stats component? See:
> http://wiki.apache.org/solr/StatsComponent
>
> Best
> Erick
>
> On Tue, Jan 17, 2012 at 8:34 AM, Eric Grobler 
> wrote:
> > Hi Solr community,
> >
> > Is it possible to return the lowest, highest and average price of a
> search
> > result using facets?
> > I tried something like: facet.query={!max(price,0)}
> > Is it possible and what is the correct syntax?
> >
> > q=htc android
> > facet=true
> > facet.query=price:[* TO 10]
> > facet.query=price:[11 TO 100]
> > facet.query=price:[101 TO *]
> > ???  facet.query={!max(price,0)}
> >
> >
> > Thanks & Regards
> > Ericz
>

Re: PositionIncrementGap inside a field

2012-01-17 Thread marotosg

Hi Erick. Thanks for your asnwer.

This is almost what i want to do but my problem is that i want to be able to
introduce two different sizes of gaps. 

Something like
 

   IBM Corporation some information *gap of 30*  more information *gap
of 100*


  IBM Limited more info *gap of 30* and some more data  *gap of 100*

  

Do you know how can i achieve that?





--
View this message in context: 
http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p322.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How can I index this?

2012-01-17 Thread Erick Erickson

Well, if you can make an HTTP request, you can parse the return and
stuff it into a SolrInputDocument in SolrJ and then send it to Solr. At
least that seems possible if I'm understanding your setup. There are
other Solr clients that allow similar processes, but the Java version is
the one I know best.

Best
Erick

On Tue, Jan 17, 2012 at 11:10 AM, ahammad  wrote:
> Perhaps I was a little confusing...
>
> Normally when I have DB access, I do a regular indexing process using DIH.
> For these two sources, I do not have direct DB access. I can only view the
> two sources like any end-user would.
>
> I do have a java class that can get the information that I need. That class
> gets that information (through HTTP requests) and does not have DB access.
> That class is currently being used for other purposes but I can take it and
> use it for Solr as well. Does that make sense?
>
> Knowing all that, namely the fact that I cannot directly access the DB, and
> I can make HTTP requests to get the info, how can I index that info?
>
> Please let me know if this clarifies what I am trying to do.
>
> Regards
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666590.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Function in facet.query like min,max

2012-01-17 Thread Erick Erickson

I don't believe that's the case, have you tried it? From the page
I referenced:

"The stats component returns simple statistics for indexed
numeric fields within the DocSet."

And running a very quick test on the example data, I get different
results when I used *:* and name:maxtor.

That said, I'm not all that familiar with the stats component so I
could well be wrong.

Best
Erick

On Tue, Jan 17, 2012 at 11:16 AM, Eric Grobler
 wrote:
> Yes, I have, but unfortunately it works on the whole index and not for a
> particular query.
>
>
> On Tue, Jan 17, 2012 at 3:37 PM, Erick Erickson 
> wrote:
>
>> have you seen the Stats component? See:
>> http://wiki.apache.org/solr/StatsComponent
>>
>> Best
>> Erick
>>
>> On Tue, Jan 17, 2012 at 8:34 AM, Eric Grobler 
>> wrote:
>> > Hi Solr community,
>> >
>> > Is it possible to return the lowest, highest and average price of a
>> search
>> > result using facets?
>> > I tried something like: facet.query={!max(price,0)}
>> > Is it possible and what is the correct syntax?
>> >
>> > q=htc android
>> > facet=true
>> > facet.query=price:[* TO 10]
>> > facet.query=price:[11 TO 100]
>> > facet.query=price:[101 TO *]
>> > ???  facet.query={!max(price,0)}
>> >
>> >
>> > Thanks & Regards
>> > Ericz
>>

Re: PositionIncrementGap inside a field

2012-01-17 Thread Erick Erickson

Hmmm, no I don't know how to do that out of the box. Two things:
1> why do you want to do this? Perhaps if you describe the
 high-level problem you're trying to solve there might be other
 ways to approach it.

2> I *think* you could write your own Tokenizer that recognized
 the special tokens you'd have to put into your input stream
 and adjusted the token offsets accordingly, but I confess I
 haven't tried it myself...

Best
Erick

On Tue, Jan 17, 2012 at 11:23 AM, marotosg  wrote:
> Hi Erick. Thanks for your asnwer.
>
> This is almost what i want to do but my problem is that i want to be able to
> introduce two different sizes of gaps.
>
> Something like
> 
>   
>       IBM Corporation some information *gap of 30*  more information *gap
> of 100*
>   
>   
>      IBM Limited more info *gap of 30* and some more data  *gap of 100*
>   
>  
>
> Do you know how can i achieve that?
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p322.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: PositionIncrementGap inside a field

2012-01-17 Thread marotosg

Hi Erik,
what I'm trying to achieve here is trying to verify if we can run a query
like this:

 "\""IBM Ltd"~15\" \""Dublin Ireland"~15\""~100


on a field where the gaps are like this:
  
 
   IBM Ireland Ltd *gap of 30*  Dublin USA *gap of 300*
 
 
  IBM Ltd *gap of 30* Dublin Ireland  *gap of 300*
 
   

The first line cointains all the words I'm looking for, but the work
"Ireland" is not within 15 tokens from the word "Dublin" so the query will
not match that.

It will match the second line.

the 2 lines will stay separated with a 300 tokens gap so that there is no
risk of false positives with words coming form the second line.

I don't know if I was clear enough.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p3666765.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to return the distance geo distance on solr 3.5 with bbox filtering

2012-01-17 Thread Maxim Veksler

Hello,

I'm querying with bbox which should be faster then geodist, my queries are
looking like this:
http://localhost:8983/solr/select?indent=true&fq={!bbox}&sfield=loc&pt=39.738548,-73.130322&d=100&sort=geodist()%20asc&q=trafficRouteId:235

the trouble is, that with bbox solr does not return the distance of each
document, I couldn't get it to work even with tips from
http://wiki.apache.org/solr/SpatialSearch#Returning_the_distance

Something I'm missing ?

Re: Sorting results within the fields

2012-01-17 Thread aronitin

It's been almost a week and there is no response to the question that I
asked. 

Is the question has less details or there is no way to achieve the same in
Lucene?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Sorting-results-within-the-fields-tp3656049p3666983.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Function in facet.query like min,max

2012-01-17 Thread Eric Grobler

Hi Erick

Thanks for your feedback.
I will try it tomorrow - if it works it will be perfect for my needs.

Have a nice day
Ericz

On Tue, Jan 17, 2012 at 4:28 PM, Erick Erickson wrote:

> I don't believe that's the case, have you tried it? From the page
> I referenced:
>
> "The stats component returns simple statistics for indexed
> numeric fields within the DocSet."
>
> And running a very quick test on the example data, I get different
> results when I used *:* and name:maxtor.
>
> That said, I'm not all that familiar with the stats component so I
> could well be wrong.
>
> Best
> Erick
>
> On Tue, Jan 17, 2012 at 11:16 AM, Eric Grobler
>  wrote:
> > Yes, I have, but unfortunately it works on the whole index and not for a
> > particular query.
> >
> >
> > On Tue, Jan 17, 2012 at 3:37 PM, Erick Erickson  >wrote:
> >
> >> have you seen the Stats component? See:
> >> http://wiki.apache.org/solr/StatsComponent
> >>
> >> Best
> >> Erick
> >>
> >> On Tue, Jan 17, 2012 at 8:34 AM, Eric Grobler <
> impalah...@googlemail.com>
> >> wrote:
> >> > Hi Solr community,
> >> >
> >> > Is it possible to return the lowest, highest and average price of a
> >> search
> >> > result using facets?
> >> > I tried something like: facet.query={!max(price,0)}
> >> > Is it possible and what is the correct syntax?
> >> >
> >> > q=htc android
> >> > facet=true
> >> > facet.query=price:[* TO 10]
> >> > facet.query=price:[11 TO 100]
> >> > facet.query=price:[101 TO *]
> >> > ???  facet.query={!max(price,0)}
> >> >
> >> >
> >> > Thanks & Regards
> >> > Ericz
> >>
>

Re: really slow performance when trying to get facet.field

2012-01-17 Thread Daniel Bruegge

Ok, I have now changed the static warming in the solrconfig.xml using
first- and newSearcher.
"Content" is my field to facet on. Now the commits take longer, which is OK
for me, but the searches are really faster right now. I also reduced the
number of documents on my shards to 15mio/shard. So the index is about
3.5G, which fits also in my memory I hope.


  

*:*
true
content
1
1

  


  

*:*
true
content
1
1

  



On Tue, Jan 17, 2012 at 2:36 PM, Daniel Bruegge <
daniel.brue...@googlemail.com> wrote:

> Evictions are 0 for all cache types.
>
> Your server max heap space with 12G is pretty huge. Which is good I think.
> The CPU on my server is a 8-Core Intel i7 965.
>
> Commit frequency is low, because shards are added and old shards exist for
> historical reasons. Old shards will be then cleaned after couple of months.
>
> I will try to add maximum 15mio per shard and see what will happen here.
>
> This thing is, that I will add more shards over time, so that I can handle
> maybe 500-800mio documents. Maybe more. It depends.
>
> On Tue, Jan 17, 2012 at 2:14 PM, Dmitry Kan  wrote:
>
>> Hi Daniel,
>>
>> My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is
>> beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m
>> -Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs is
>> over 6,5 million.
>>
>> Do you see any evictions in your caches? What kind of server is it, in
>> terms of CPU and OS? How often do you commit to the index?
>>
>> Dmitry
>>
>> On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge <
>> daniel.brue...@googlemail.com> wrote:
>>
>> > Hi Dmitry,
>> >
>> > I had everything on one Solr Instance before, but this got to heavy and
>> I
>> > had the same issue here, that the 1st facet.query was really slow.
>> >
>> > When querying the facet:
>> > - facet.limit = 100
>> >
>> > Cache settings are like this:
>> >
>> >> > size="16384"
>> > initialSize="4096"
>> > autowarmCount="4096"/>
>> >
>> >> > size="512"
>> > initialSize="512"
>> > autowarmCount="0"/>
>> >
>> >> >   size="512"
>> >   initialSize="512"
>> >   autowarmCount="0"/>
>> >
>> > How big was your index? Did it fit into the RAM which you gave the Solr
>> > instance?
>> >
>> > Thanks
>> >
>> >
>> > On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan 
>> wrote:
>> >
>> > > I had a similar problem for a similar task. And in my case merging the
>> > > results from two shards turned out to be a culprit. If you can
>> logically
>> > > store your data just in one shard, your faceting should become faster.
>> > Size
>> > > wise it should not be a problem for SOLR.
>> > >
>> > > Also, you didn't say anything about the facet.limit value, cache
>> > > parameters, usage of filter queries. Some of these can be
>> interconnected.
>> > >
>> > > Dmitry
>> > >
>> > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge <
>> > > daniel.brue...@googlemail.com> wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > I have 2 Solr-shards. One is filled with approx. 25mio documents
>> (local
>> > > > index 6GB), the other with 10mio documents (2.7GB size).
>> > > > I am trying to create some kind of 'word cloud' to see the
>> frequency of
>> > > > words for a *text_general *field.
>> > > > For this I am currently using a facet over this field and I am also
>> > > > restricting the documents by using some other filters in the query.
>> > > >
>> > > > The performance is really bad for the first call and then pretty
>> fast
>> > for
>> > > > the following calls.
>> > > >
>> > > > The maximum Java heap size is 3G for each shard. Both shards are
>> > running
>> > > on
>> > > > the same physical server which has 12G RAM.
>> > > >
>> > > > Question: Should I reduce the documents in one shard, so that the
>> index
>> > > is
>> > > > equal or less the Java Heap size for this shard? Or is
>> > > > there another method to avoid this slow calls?
>> > > >
>> > > > Thank you
>> > > >
>> > > > Daniel
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Regards,
>> > >
>> > > Dmitry Kan
>> > >
>> >
>>
>>
>>
>> --
>> Regards,
>>
>> Dmitry Kan
>>
>
>

Re: Sorting results within the fields

2012-01-17 Thread Jan Høydahl

Hi,

Complex problems like this is much better explained with concrete examples than 
generalized text.

Please create a real example with real documents and their content, along with 
real queries.
You don't explain what "the score value which is generate by my application" is 
- which application
is that and is the score generated statically before indexing or should the 
scoring call out to
some external application to ask for the score, and what is the input and 
output criteria for such custom scoring?

> So, that the final results of the query
> will look like
> 
> (D1, D2) (D3,D4) (D5,D6,D7).

Meaning that D1,D2 comes first because they match field1 which you gave highest 
boost? field1^8?
Should D1,D2 always come before the others regardless of the "custom scoring 
from your application"?
Will order between D1,D2 be influenced by Solr scoring at all or only by 
"external application"?

Hope you see that being concrete is necessary for such questions.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. jan. 2012, at 19:38, aronitin wrote:

> It's been almost a week and there is no response to the question that I
> asked. 
> 
> Is the question has less details or there is no way to achieve the same in
> Lucene?
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Sorting-results-within-the-fields-tp3656049p3666983.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: first time query is very slow

2012-01-17 Thread Yonik Seeley

On Tue, Jan 17, 2012 at 9:39 AM, gabriel shen  wrote:
> For those customers who unluckily send un-prewarmed query, they will suffer
> from bad response time, it is not too pleasant anyway.

The "warming caches" part isn't about unique queries, but more about
caches used for sorting and faceting (and those are reused across many
different queries).
Can you give an example of the complete request you were sending that
takes a long time?

-Yonik
http://www.lucidimagination.com

Facet auto-suggest

2012-01-17 Thread Jon Drukman

I don't even know what to call this feature. Here's a website that shows
the problem:

http://pulse.audiusanews.com/pulse/index.php

Notice that you can end up in a situation where there are no results.
For example,
in order, press: People, Performance, Technology, Photos. The client
wants it so that when you click a button, it disables buttons that would
lead to a dead end. In other words, after clicking Technology, the Photos
button would be disabled.

Can Solr help with this?

-jsd-

Re: Facet auto-suggest

2012-01-17 Thread Jan Høydahl

Hi,

Sure, you can use filters and facets for this. Start a query with 
...&facet.field=source&facet.field=topics&facet.field=type
When you click a "button", you set the corresponding filter (fq=source:people), 
and the new query will return the same facets with new counts. In the Audi 
example, you would disable buttons with 0 hits in the facet count.

For more in depth, see http://java.dzone.com/news/complex-solr-faceting

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 17. jan. 2012, at 23:38, Jon Drukman wrote:

> I don't even know what to call this feature. Here's a website that shows
> the problem:
> 
> http://pulse.audiusanews.com/pulse/index.php
> 
> Notice that you can end up in a situation where there are no results.
> For example,
> in order, press: People, Performance, Technology, Photos. The client
> wants it so that when you click a button, it disables buttons that would
> lead to a dead end. In other words, after clicking Technology, the Photos
> button would be disabled.
> 
> Can Solr help with this?
> 
> -jsd-
>

Re: Solr Cloud Indexing

2012-01-17 Thread Lance Norskog

Cloud upload bandwidth is free, but download bandwidth costs money. If
you upload a lot of data but do not query it often, Amazon can make
sense.  You can also rent much cheaper hardware in other hosting
services where you pay by the month or even by the year. If you know
you have a cap on how much resource you will need at once, the cheaper
sites make more sense.

On Tue, Jan 17, 2012 at 7:36 AM, Erick Erickson  wrote:
> This only really makes sense if you don't have enough in-house resources
> to do your indexing locally, but it certainly is possible.
>
> Amazon's EC2 has been used, but really any hosting service should do.
>
> Best
> Erick
>
> On Tue, Jan 17, 2012 at 12:09 AM, Sujatha Arun  wrote:
>> Would it make sense to  Index on the cloud and periodically [2-4 times
>> /day] replicate the index at  our server for searching .Which service to go
>> with for solr Cloud Indexing ?
>>
>> Any good and tried services?
>>
>> Regards
>> Sujatha

-- 
Lance Norskog
goks...@gmail.com

Re: Trying to understand SOLR memory requirements

2012-01-17 Thread Lance Norskog

Which version of Solr do you use? 3.1 and 3.2 had a memory leak bug in
spellchecking. This was fixed in 3.3.

On Tue, Jan 17, 2012 at 5:59 AM, Robert Muir  wrote:
> I committed it already: so you can try out branch_3x if you want.
>
> you can either wait for a nightly build or compile from svn
> (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
>
> On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
>> Thank you Robert, I'd appreciate that. Any idea how long it will take to
>> get a fix? Would I be better switching to trunk? Is trunk stable enough for
>> someone who's very much a SOLR novice?
>>
>> Thanks,
>> Dave
>>
>> On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
>>
>>> looks like https://issues.apache.org/jira/browse/SOLR-2888.
>>>
>>> Previously, FST would need to hold all the terms in RAM during
>>> construction, but with the patch it uses offline sorts/temporary
>>> files.
>>> I'll reopen the issue to backport this to the 3.x branch.
>>>
>>>
>>> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
>>> > I'm trying to figure out what my memory needs are for a rather large
>>> > dataset. I'm trying to build an auto-complete system for every
>>> > city/state/country in the world. I've got a geographic database, and have
>>> > setup the DIH to pull the proper data in. There are 2,784,937 documents
>>> > which I've formatted into JSON-like output, so there's a bit of data
>>> > associated with each one. Here is an example record:
>>> >
>>> > Brooklyn, New York, United States?{ |id|: |2620829|,
>>> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
>>> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
>>> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
>>> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
>>> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
>>> > States| }
>>> >
>>> > I've got the spellchecker / suggester module setup, and I can confirm
>>> that
>>> > everything works properly with a smaller dataset (i.e. just a couple of
>>> > countries worth of cities/states). However I'm running into a big problem
>>> > when I try to index the entire dataset. The
>>> dataimport?command=full-import
>>> > works and the system comes to an idle state. It generates the following
>>> > data/index/ directory (I'm including it in case it gives any indication
>>> on
>>> > memory requirements):
>>> >
>>> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
>>> > -rw-rw 1 root   root    22M Jan 17 00:13 _2w.fdx
>>> > -rw-rw 1 root   root    131 Jan 17 00:13 _2w.fnm
>>> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
>>> > -rw-rw 1 root   root    16M Jan 17 00:13 _2w.nrm
>>> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
>>> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
>>> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
>>> > -rw-rw 1 root   root     20 Jan 17 00:13 segments.gen
>>> > -rw-rw 1 root   root    291 Jan 17 00:13 segments_2
>>> >
>>> > Next I try to run the suggest?spellcheck.build=true command, and I get
>>> the
>>> > following error:
>>> >
>>> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build
>>> > INFO: build()
>>> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
>>> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>>> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
>>> > at java.lang.String.(String.java:215)
>>> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
>>> > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
>>> >  at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
>>> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
>>> >  at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
>>> > at
>>> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
>>> >  at
>>> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
>>> > at
>>> >
>>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
>>> >  at
>>> >
>>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
>>> > at
>>> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
>>> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
>>> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
>>> >  at
>>> >
>>> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
>>> > at
>>> >
>>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
>>> >  at
>>> >
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>>> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
>>> >  at

Re: Sorting results within the fields

2012-01-17 Thread aronitin

Hi Jan,

Thanks for the reply.

Here is the concrete explanation of the problem that I'm trying to solve.

*SOLR Schema*

Here is the definition of the SOLR schema

*There are 3 dynamic fields*

*There are 4 searchable fields*

*Description*: Data in this field is Whitespace Tokenized, Stemmed,
Lowercased

*Description*: Data in this field is only lowercase and Keyword Tokenizer is
applied. So, data is not changed when stored in this field.

*Description*: Head terms are encoded in the format HEAD$Value

*Description*: Tail terms are encoded in the format TAIL$Value

The data that we store in these fields is cleaned up data from large text:
generally 1 word, 2 words, 3 words values

D1 -> UI, UI Design, UI Programming , UI Design Document,
D2 -> UI Mockup, UI development
D3 -> UI

When somebody queries *UI*, internal query that is generated is
concepts_headtermencoded_concept:HEAD$ui^100.0 concepts:ui^50.0
concepts_tailtermencoded_concept:TAIL$ui^10.0

So, that head term matched document is ranked higher than partial match.

Current Implementation without score ranks the document like: D1 > D2 > D3
(because Lucene use Tf, IDF while scoring the document)

Now, we have created *application specific score* for each concept and want
to sort the results based on that score but preserving the boost on the
field defined in the query.
e.g.
D1 -> UI=90, UI Design = 45, UI Programming = 40, UI Design Document = 85,
Project Wolverine=40
D2 -> UI Mockup=55, UI Development=74, Project Management=39
D3 -> UI=95, Project Wolverine=35
D4 -> UI Dev = 75, Video Project=42
1. If a match is found and only exact match was found then sorting will
happen based on the score value for the term that we have defined.
2. If a match is found and exact and partial matches are there. Then
sorting should happen based on the exact matched documents on top and then
partially matched documents sorted within themselves based on score.

*Examples*
*Search*: UI
*Desired Results*: D3 > D1 > D4 > D2 where (D3, D1) contains exact match
and hence scored within themselves. (D4, D2 both have head match but score
of head match in D4 > D2)

*Search*: Project
*Desired Results*: D1 > D2 > D3 > D4 Where D1, D2 and D3 are head term
matches and sorted within (D1, D2, D3) based on score and D4 is tail term
match (even though has better score tail term boost is 1/10th of head term
boost).

So, in all we can override the TF, IDF of Lucene scoring and want do the
scoring based on our concept specific score but preserving giving the higher
preference to exact match and then partial matches.

Hope I explained the problem. Let me know if you have any specific question.

Thanks
Nitin

--
View this message in context:
http://lucene.472066.n3.nabble.com/Sorting-results-within-the-fields-tp3656049p3668047.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Highlighting "text" field when query is for "string" field

2012-01-17 Thread solrdude

Just to be clear, I do phrase query on string field like
q=keyword_text:"smooth skin". I am expecting highlighting to be done on
excerpt field. What I see is:







These numbers are unique id's of documents. Where is excerpts with
highlighted text? Any idea?

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-text-field-when-query-is-for-string-field-tp3475334p3668074.html
Sent from the Solr - User mailing list archive at Nabble.com.

Question on Reverse Indexing

2012-01-17 Thread Shyam Bhaskaran

Hi,

For reverse indexing we are using the ReversedWildcardFilterFactory on Solr 4.0





ReversedWildcardFilterFactory was helping us to perform leading wild card 
searches like *lock.

But it was observed that the performance of the searches was not good after 
introducing ReversedWildcardFilterFactory filter.

Hence we disabled ReversedWildcardFilterFactory filter and re-created the 
indexes and this time we found the performance of Solr query to be faster.

But surprisingly it is observed that leading wild card searches were still 
working inspite of disabling the ReversedWildcardFilterFactory filter.


This behavior is puzzling everyone and wanted to know how this behavior of 
reverse indexing works?

Can anyone share with me on this Solr behavior.

-Shyam

Re: Question on Reverse Indexing

2012-01-17 Thread François Schiettecatte

Using ReversedWildcardFilterFactory will double the size of your dictionary 
(more or less), maybe the drop in performance that you are seeing is a result 
of that?

François

On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:

> Hi,
> 
> For reverse indexing we are using the ReversedWildcardFilterFactory on Solr 
> 4.0
> 
> 
>  
> maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
> 
> 
> ReversedWildcardFilterFactory was helping us to perform leading wild card 
> searches like *lock.
> 
> But it was observed that the performance of the searches was not good after 
> introducing ReversedWildcardFilterFactory filter.
> 
> Hence we disabled ReversedWildcardFilterFactory filter and re-created the 
> indexes and this time we found the performance of Solr query to be faster.
> 
> But surprisingly it is observed that leading wild card searches were still 
> working inspite of disabling the ReversedWildcardFilterFactory filter.
> 
> 
> This behavior is puzzling everyone and wanted to know how this behavior of 
> reverse indexing works?
> 
> Can anyone share with me on this Solr behavior.
> 
> -Shyam
>

RE: Question on Reverse Indexing

2012-01-17 Thread Shyam Bhaskaran

Hi Francois,

I understand that disabling of ReversedWildcardFilterFactory has improved the 
performance.

But I am puzzled over how the leading wild card search like *lock is working 
even though I have now disabled the ReversedWildcardFilterFactory and the 
indexes have been created without ReversedWildcardFilter ?

How does reverse indexing work even after disabling 
ReversedWildcardFilterFactory?

Can anyone explain me how this feature is working.

-Shyam

-Original Message-
From: François Schiettecatte [mailto:fschietteca...@gmail.com] 
Sent: Wednesday, January 18, 2012 7:49 AM
To: solr-user@lucene.apache.org
Subject: Re: Question on Reverse Indexing

Using ReversedWildcardFilterFactory will double the size of your dictionary 
(more or less), maybe the drop in performance that you are seeing is a result 
of that?

François

On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote:

> Hi,
> 
> For reverse indexing we are using the ReversedWildcardFilterFactory on Solr 
> 4.0
> 
> 
>  
> maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
> 
> 
> ReversedWildcardFilterFactory was helping us to perform leading wild card 
> searches like *lock.
> 
> But it was observed that the performance of the searches was not good after 
> introducing ReversedWildcardFilterFactory filter.
> 
> Hence we disabled ReversedWildcardFilterFactory filter and re-created the 
> indexes and this time we found the performance of Solr query to be faster.
> 
> But surprisingly it is observed that leading wild card searches were still 
> working inspite of disabling the ReversedWildcardFilterFactory filter.
> 
> 
> This behavior is puzzling everyone and wanted to know how this behavior of 
> reverse indexing works?
> 
> Can anyone share with me on this Solr behavior.
> 
> -Shyam
>

Re: DataImportHandler in Solr 4.0

2012-01-17 Thread Rob

Not a java pro, and the documentation hasn't been updated to include these
instructions (at least that I could find). What do I need to do to perform
the steps that Alexandre is talking about?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/DataImportHandler-in-Solr-4-0-tp2563053p3667942.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-17 Thread Otis Gospodnetic

Could indexing English Wikipedia dump over and over get you there?

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: Memory Makers 
>To: solr-user@lucene.apache.org 
>Sent: Tuesday, January 17, 2012 12:15 AM
>Subject: Re: Can Apache Solr Handle TeraByte Large Data
> 
>I've been toying with the idea of setting up an experiment to index a large
>document set 1+ TB -- any thoughts on an open data set that one could use
>for this purpose?
>
>Thanks.
>
>On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom wrote:
>
>> Hello ,
>>
>> Searching real-time sounds difficult with that amount of data. With large
>> documents, 3 million documents, and 5TB of data the index will be very
>> large. With indexes that large your performance will probably be I/O bound.
>>
>> Do you plan on allowing phrase or proximity searches? If so, your
>> performance will be even more I/O bound as documents that large will have
>> huge positions indexes that will need to be read into memory for processing
>> phrase queries. To reduce I/O you need as much of the index in memory
>> (Lucene/Solr caches, and operating system disk cache).  Every commit
>> invalidates the Solr/Lucene caches (unless the newer nrt code has solved
>> this for Solr).
>>
>> If you index and serve on the same server, you are also going to get
>> terrible response time whenever your commits trigger a large merge.
>>
>> If you need to service 10-100 qps or more, you may need to look at putting
>> your index on SSDs or spreading it over enough machines so it can stay in
>> memory.
>>
>> What kind of response times are you looking for and what query rate?
>>
>> We have somewhat smaller documents. We have 10 million documents and about
>> 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4
>> machines (i.e. 3 shards per machine).   We get an average of around
>> 200-300ms response time but our 95th percentile times are about 800ms and
>> 99th percentile are around 2 seconds.  This is with an average load of less
>> than 1 query/second.
>>
>> As Otis suggested, you may want to implement a strategy that allows users
>> to search within the large documents by breaking the documents up into
>> smaller units. What we do is have two Solr indexes.  The first indexes
>> complete documents.  When the user clicks on a result, we index the entire
>> document on a page level in a small Solr index on-the-fly.  That way they
>> can search within the document and get page level results.
>>
>> More details about our setup:
>> http://www.hathitrust.org/blogs/large-scale-search
>>
>> Tom Burton-West
>> University of Michigan Library
>> www.hathitrust.org
>> -Original Message-
>>
>>
>
>
>

Re: Solr - Tika(?) memory leak

2012-01-17 Thread Otis Gospodnetic

You'll need to reindex everything indeed.

Otis 

Performance Monitoring SaaS for Solr - 
http://sematext.com/spm/solr-performance-monitoring/index.html



>
> From: Wayne W 
>To: solr-user@lucene.apache.org 
>Sent: Tuesday, January 17, 2012 12:36 AM
>Subject: Re: Solr - Tika(?) memory leak
> 
>Thanks for the links - I've put a posting on the Tika ML.
>I've just checked and we using  tika-0.2.jar - does anyone know which
>version I can use with solr 1.3?
>
>Is there any info on upgrading from this far back to the latest
>version - is it even possible? or would I need to re-index everything?
>
>On Tue, Jan 17, 2012 at 5:39 AM, P Williams
> wrote:
>> Hi,
>>
>> I'm not sure which version of Solr/Tika you're using but I had a similar
>> experience which turned out to be the result of a design change to PDFBox.
>>
>> https://issues.apache.org/jira/browse/SOLR-2886
>>
>> Tricia
>>
>> On Sat, Jan 14, 2012 at 12:53 AM, Wayne W wrote:
>>
>>> Hi,
>>>
>>> we're using Solr running on tomcat with 1GB in production, and of late
>>> we've been having a huge number of OutOfMemory issues. It seems from
>>> what I can tell this is coming from the tika extraction of the
>>> content. I've processed the java dump file using a memory analyzer and
>>> its pretty clean at least the class involved. It seems like a leak to
>>> me, as we don't parse any files larger than 20M, and these objects are
>>> taking up ~700M
>>>
>>> I've attached 2 screen shots from the tool (not sure if you receive
>>> attachments).
>>>
>>> But to summarize (class, number of objects, Used heap size, Retained Heap
>>> Size):
>>>
>>>
>>> org.apache.xmlbeans.impl.store.Xob$ElementXObj             838,993
>>>         80,533,728       604,606,040
>>> org.apache.poi.openxml4j.opc.ZipPackage                          2
>>>                   112                  87,009,848
>>> char[]
>>>              587                    32,216,960       38,216,950
>>>
>>>
>>> We're really desperate to find a solution to this - any ideas or help
>>> is greatly appreciated.
>>> Wayne
>>>
>
>
>

54 matches

Mail list logo