Re: Solr - Tomcat new versions
Hi, I installed Apache tomct on Windows (Vista) and Solr. But I have any problem between Tomcat 7.0.23 and Solr 3.5 No problem if I install Solr 1.4.1 with the same version of Tomcat. (I check it with binary and source code installation for omcat but the result is the same). It's a bug, I think? thank you Alessio
Re: SolrJ Embedded
On Tue, Jan 17, 2012 at 3:13 AM, Erick Erickson wrote: > I don't see why not. I'm assuming a *nix system here so when Solr > updated an index, any deleted files would hang around. > > But I have to ask why bother with the Embedded server in the > first place? You already have a Solr instance up and running, > why not just query that instead, perhaps using SolrJ? > > Wouldn't querying the Solr server using the HTTP interface be slower? > Best > Erick > > On Mon, Jan 16, 2012 at 3:00 PM, wrote: > > Hi, > > > > is it possible to use the same index in a solr webapp and additionally > in a > > EmbeddedSolrServer? The embbedded one would be read only. > > > > Thank you. > > >
Re: Query regarding solr custom sort order
Hi, Let me clarify the situation here in details. The default sort which Websphere commerce provide is based on name & price of any item. but we are having unique values of every item. hence sorting goes on fine either as intger or as string but while preprocess we generate some temporary tables like TI_CATGPENREL_0, where sequence number is multiple value for a particular catentry id(item). this field (sequence) is declared as varchar because it can contain multiple values separated by ";" which solr returns, hence sorting based on sequence will happen lexicographically as discussed in the thread. So can we restrict it to send single values based on certain category ID or something like this. Thanks in advance, Uma Shankar -- View this message in context: http://lucene.472066.n3.nabble.com/Query-regarding-solr-custom-sort-order-tp3631854p3665545.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr - Tomcat new versions
Hi Alessio, I've seen Solr 3.5 running within Tomcat 7.0.23, it shouldn't be a bug I guess. Could you please provide some more details about the problem you have? Do you have a stacktrace? Are you upgrading an existing Solr 1.4.1, right? By the way, which jdk are you using? Thanks Luca On Tue, Jan 17, 2012 at 9:40 AM, Alessio Crisantemi < alessio.crisant...@gioconews.it> wrote: > Hi, > I installed Apache tomct on Windows (Vista) and Solr. > But I have any problem between Tomcat 7.0.23 and Solr 3.5 > > No problem if I install Solr 1.4.1 with the same version of Tomcat. > (I check it with binary and source code installation for omcat but the > result is the same). > It's a bug, I think? > thank you > Alessio > > > > > > > > > -- Luca Cavanna E-mail: cavannal...@gmail.com Skype: just.cavanna Italian Mobile: +39 329 0170084 Dutch Mobile: +31 6 22255262 Website: http://www.javanna.net Stack Overflow Careers: http://careers.stackoverflow.com/lucacavanna Linkedin: http://it.linkedin.com/in/lucacavanna
Re: Solr - Tomcat new versions
Dear Luca, I follow the Solr installation procedures signed on Official guide, but with Solr 3,5 don't works. While with solr 1.4.1 it's all right. I don't know why...but now I work with Solr 1.4.1 and more: I would install TIKA 1.0 on Solr 1.4.1. Is possible? How can i do? can you help me? best, a. --- -Messaggio originale- From: Luca Cavanna Sent: Tuesday, January 17, 2012 10:16 AM To: solr-user@lucene.apache.org ; Alessio Crisantemi Subject: Re: Solr - Tomcat new versions Hi Alessio, I've seen Solr 3.5 running within Tomcat 7.0.23, it shouldn't be a bug I guess. Could you please provide some more details about the problem you have? Do you have a stacktrace? Are you upgrading an existing Solr 1.4.1, right? By the way, which jdk are you using? Thanks Luca On Tue, Jan 17, 2012 at 9:40 AM, Alessio Crisantemi < alessio.crisant...@gioconews.it> wrote: Hi, I installed Apache tomct on Windows (Vista) and Solr. But I have any problem between Tomcat 7.0.23 and Solr 3.5 No problem if I install Solr 1.4.1 with the same version of Tomcat. (I check it with binary and source code installation for omcat but the result is the same). It's a bug, I think? thank you Alessio -- Luca Cavanna E-mail: cavannal...@gmail.com Skype: just.cavanna Italian Mobile: +39 329 0170084 Dutch Mobile: +31 6 22255262 Website: http://www.javanna.net Stack Overflow Careers: http://careers.stackoverflow.com/lucacavanna Linkedin: http://it.linkedin.com/in/lucacavanna
Re: FacetComponent: suppress original query
Yes, that's what I have started to use already. Probably, this is the easiest solution. Thanks. On Tue, Jan 17, 2012 at 3:03 AM, Erick Erickson wrote: > Why not just up the maxBooleanClauses parameter in solrconfig.xml? > > Best > Erick > > On Sat, Jan 14, 2012 at 1:41 PM, Dmitry Kan wrote: > > OK, let me clarify it: > > > > if solrconfig has maxBooleanClauses set to 1000 for example, than queries > > with clauses more than 1000 in number will be rejected with the mentioned > > exception. > > What I want to do is automatically split such queries into sub-queries > with > > at most 1000 clauses inside SOLR and send them to shards. I have already > > done the splitting and sending code, but how to bypass the > > maxBooleanClauses check? > > > > Dmitry > > > > On Fri, Jan 13, 2012 at 7:40 PM, Chris Hostetter > > wrote: > > > >> > >> : I would like to "by-pass" the maxBooleanClauses limit in such a way, > that > >> : those queries that contain boolean clauses more than > maxBooleanClauses in > >> : the number, would be automatically split into sub-queries. That part > is > >> : done. > >> : > >> : Now, when such a query arrives, solr throws > >> : > >> : org.apache.lucene.queryParser.ParseException: Cannot parse > >> : 'AccessionNumber:(TS-E_284668 OR TS-E_284904 OR 950123-11-086962 > OR > >> : TS-AS_292840 OR TS-AS_295661 OR TS-AS_296320 OR TS-AS_296805 OR > >> : TS-AS_296819 OR TS-AS_296820)': too many boolean clauses > >> > >> I don't understand your question/issue ... you say you've already worked > >> arround the maxBooleanClauses (ie: "That part is done") but you didn't > say > >> how, and in your followup quesiton, it sounds like you are still hitting > >> the limit of maxBooleanClauses. > >> > >> So what exactly have you changed/done that is "done" and what is the > >> new problem? > >> > >> > >> -Hoss > >> > > > > > > > > -- > > Regards, > > > > Dmitry Kan > -- Regards, Dmitry Kan
Re: Solr - Tomcat new versions
Hi Alessio, in order to help you, we'd need to know something more about what's going wrong. Could you give us a stacktrace or an error you're reading? How do you know solr isn't working? Thanks Luca On Tue, Jan 17, 2012 at 10:52 AM, Alessio Crisantemi < alessio.crisant...@gioconews.it> wrote: > Dear Luca, > I follow the Solr installation procedures signed on Official guide, but > with Solr 3,5 don't works. While with solr 1.4.1 it's all right. > > I don't know why...but now I work with Solr 1.4.1 > > and more: > I would install TIKA 1.0 on Solr 1.4.1. Is possible? > How can i do? can you help me? > best, > a. > > > --**--** > --**- > > > > > > > > > -Messaggio originale- From: Luca Cavanna > Sent: Tuesday, January 17, 2012 10:16 AM > To: solr-user@lucene.apache.org ; Alessio Crisantemi > Subject: Re: Solr - Tomcat new versions > > Hi Alessio, > I've seen Solr 3.5 running within Tomcat 7.0.23, it shouldn't be a bug I > guess. > Could you please provide some more details about the problem you have? Do > you have a stacktrace? > Are you upgrading an existing Solr 1.4.1, right? > By the way, which jdk are you using? > > Thanks > Luca > > On Tue, Jan 17, 2012 at 9:40 AM, Alessio Crisantemi < > alessio.crisantemi@gioconews.**it > > wrote: > > Hi, >> I installed Apache tomct on Windows (Vista) and Solr. >> But I have any problem between Tomcat 7.0.23 and Solr 3.5 >> >> No problem if I install Solr 1.4.1 with the same version of Tomcat. >> (I check it with binary and source code installation for omcat but the >> result is the same). >> It's a bug, I think? >> thank you >> Alessio >> >> >> >> >> >> >> >> >> >>
really slow performance when trying to get facet.field
Hi, I have 2 Solr-shards. One is filled with approx. 25mio documents (local index 6GB), the other with 10mio documents (2.7GB size). I am trying to create some kind of 'word cloud' to see the frequency of words for a *text_general *field. For this I am currently using a facet over this field and I am also restricting the documents by using some other filters in the query. The performance is really bad for the first call and then pretty fast for the following calls. The maximum Java heap size is 3G for each shard. Both shards are running on the same physical server which has 12G RAM. Question: Should I reduce the documents in one shard, so that the index is equal or less the Java Heap size for this shard? Or is there another method to avoid this slow calls? Thank you Daniel
Re: really slow performance when trying to get facet.field
I had a similar problem for a similar task. And in my case merging the results from two shards turned out to be a culprit. If you can logically store your data just in one shard, your faceting should become faster. Size wise it should not be a problem for SOLR. Also, you didn't say anything about the facet.limit value, cache parameters, usage of filter queries. Some of these can be interconnected. Dmitry On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge < daniel.brue...@googlemail.com> wrote: > Hi, > > I have 2 Solr-shards. One is filled with approx. 25mio documents (local > index 6GB), the other with 10mio documents (2.7GB size). > I am trying to create some kind of 'word cloud' to see the frequency of > words for a *text_general *field. > For this I am currently using a facet over this field and I am also > restricting the documents by using some other filters in the query. > > The performance is really bad for the first call and then pretty fast for > the following calls. > > The maximum Java heap size is 3G for each shard. Both shards are running on > the same physical server which has 12G RAM. > > Question: Should I reduce the documents in one shard, so that the index is > equal or less the Java Heap size for this shard? Or is > there another method to avoid this slow calls? > > Thank you > > Daniel > -- Regards, Dmitry Kan
Re: Solr - Tomcat new versions
Perhaps this the known issue with the 3.5 example schema being used in Tomcat and the VelocityResponseWriter issue? I'm on my mobile now so don't have easy access to a pointer with details but check the archives if this seems to be the issue on how to resolve it. Erik On Jan 17, 2012, at 4:52, "Alessio Crisantemi" wrote: > Dear Luca, > I follow the Solr installation procedures signed on Official guide, but with > Solr 3,5 don't works. While with solr 1.4.1 it's all right. > > I don't know why...but now I work with Solr 1.4.1 > > and more: > I would install TIKA 1.0 on Solr 1.4.1. Is possible? > How can i do? can you help me? > best, > a. > > > --- > > > > > > > > -Messaggio originale- From: Luca Cavanna > Sent: Tuesday, January 17, 2012 10:16 AM > To: solr-user@lucene.apache.org ; Alessio Crisantemi > Subject: Re: Solr - Tomcat new versions > > Hi Alessio, > I've seen Solr 3.5 running within Tomcat 7.0.23, it shouldn't be a bug I > guess. > Could you please provide some more details about the problem you have? Do > you have a stacktrace? > Are you upgrading an existing Solr 1.4.1, right? > By the way, which jdk are you using? > > Thanks > Luca > > On Tue, Jan 17, 2012 at 9:40 AM, Alessio Crisantemi < > alessio.crisant...@gioconews.it> wrote: > >> Hi, >> I installed Apache tomct on Windows (Vista) and Solr. >> But I have any problem between Tomcat 7.0.23 and Solr 3.5 >> >> No problem if I install Solr 1.4.1 with the same version of Tomcat. >> (I check it with binary and source code installation for omcat but the >> result is the same). >> It's a bug, I think? >> thank you >> Alessio >> >> >> >> >> >> >> >> >> > > > -- > Luca Cavanna > E-mail: cavannal...@gmail.com > Skype: just.cavanna > Italian Mobile: +39 329 0170084 > Dutch Mobile: +31 6 22255262 > Website: http://www.javanna.net > Stack Overflow Careers: http://careers.stackoverflow.com/lucacavanna > Linkedin: http://it.linkedin.com/in/lucacavanna
Re: really slow performance when trying to get facet.field
Hi Dmitry, I had everything on one Solr Instance before, but this got to heavy and I had the same issue here, that the 1st facet.query was really slow. When querying the facet: - facet.limit = 100 Cache settings are like this: How big was your index? Did it fit into the RAM which you gave the Solr instance? Thanks On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan wrote: > I had a similar problem for a similar task. And in my case merging the > results from two shards turned out to be a culprit. If you can logically > store your data just in one shard, your faceting should become faster. Size > wise it should not be a problem for SOLR. > > Also, you didn't say anything about the facet.limit value, cache > parameters, usage of filter queries. Some of these can be interconnected. > > Dmitry > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge < > daniel.brue...@googlemail.com> wrote: > > > Hi, > > > > I have 2 Solr-shards. One is filled with approx. 25mio documents (local > > index 6GB), the other with 10mio documents (2.7GB size). > > I am trying to create some kind of 'word cloud' to see the frequency of > > words for a *text_general *field. > > For this I am currently using a facet over this field and I am also > > restricting the documents by using some other filters in the query. > > > > The performance is really bad for the first call and then pretty fast for > > the following calls. > > > > The maximum Java heap size is 3G for each shard. Both shards are running > on > > the same physical server which has 12G RAM. > > > > Question: Should I reduce the documents in one shard, so that the index > is > > equal or less the Java Heap size for this shard? Or is > > there another method to avoid this slow calls? > > > > Thank you > > > > Daniel > > > > > > -- > Regards, > > Dmitry Kan >
Re: really slow performance when trying to get facet.field
Hi Daniel, My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m -Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs is over 6,5 million. Do you see any evictions in your caches? What kind of server is it, in terms of CPU and OS? How often do you commit to the index? Dmitry On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge < daniel.brue...@googlemail.com> wrote: > Hi Dmitry, > > I had everything on one Solr Instance before, but this got to heavy and I > had the same issue here, that the 1st facet.query was really slow. > > When querying the facet: > - facet.limit = 100 > > Cache settings are like this: > > size="16384" > initialSize="4096" > autowarmCount="4096"/> > > size="512" > initialSize="512" > autowarmCount="0"/> > > size="512" > initialSize="512" > autowarmCount="0"/> > > How big was your index? Did it fit into the RAM which you gave the Solr > instance? > > Thanks > > > On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan wrote: > > > I had a similar problem for a similar task. And in my case merging the > > results from two shards turned out to be a culprit. If you can logically > > store your data just in one shard, your faceting should become faster. > Size > > wise it should not be a problem for SOLR. > > > > Also, you didn't say anything about the facet.limit value, cache > > parameters, usage of filter queries. Some of these can be interconnected. > > > > Dmitry > > > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge < > > daniel.brue...@googlemail.com> wrote: > > > > > Hi, > > > > > > I have 2 Solr-shards. One is filled with approx. 25mio documents (local > > > index 6GB), the other with 10mio documents (2.7GB size). > > > I am trying to create some kind of 'word cloud' to see the frequency of > > > words for a *text_general *field. > > > For this I am currently using a facet over this field and I am also > > > restricting the documents by using some other filters in the query. > > > > > > The performance is really bad for the first call and then pretty fast > for > > > the following calls. > > > > > > The maximum Java heap size is 3G for each shard. Both shards are > running > > on > > > the same physical server which has 12G RAM. > > > > > > Question: Should I reduce the documents in one shard, so that the index > > is > > > equal or less the Java Heap size for this shard? Or is > > > there another method to avoid this slow calls? > > > > > > Thank you > > > > > > Daniel > > > > > > > > > > > -- > > Regards, > > > > Dmitry Kan > > > -- Regards, Dmitry Kan
Re: really slow performance when trying to get facet.field
Evictions are 0 for all cache types. Your server max heap space with 12G is pretty huge. Which is good I think. The CPU on my server is a 8-Core Intel i7 965. Commit frequency is low, because shards are added and old shards exist for historical reasons. Old shards will be then cleaned after couple of months. I will try to add maximum 15mio per shard and see what will happen here. This thing is, that I will add more shards over time, so that I can handle maybe 500-800mio documents. Maybe more. It depends. On Tue, Jan 17, 2012 at 2:14 PM, Dmitry Kan wrote: > Hi Daniel, > > My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is > beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m > -Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs is > over 6,5 million. > > Do you see any evictions in your caches? What kind of server is it, in > terms of CPU and OS? How often do you commit to the index? > > Dmitry > > On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge < > daniel.brue...@googlemail.com> wrote: > > > Hi Dmitry, > > > > I had everything on one Solr Instance before, but this got to heavy and I > > had the same issue here, that the 1st facet.query was really slow. > > > > When querying the facet: > > - facet.limit = 100 > > > > Cache settings are like this: > > > > > size="16384" > > initialSize="4096" > > autowarmCount="4096"/> > > > > > size="512" > > initialSize="512" > > autowarmCount="0"/> > > > > > size="512" > > initialSize="512" > > autowarmCount="0"/> > > > > How big was your index? Did it fit into the RAM which you gave the Solr > > instance? > > > > Thanks > > > > > > On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan > wrote: > > > > > I had a similar problem for a similar task. And in my case merging the > > > results from two shards turned out to be a culprit. If you can > logically > > > store your data just in one shard, your faceting should become faster. > > Size > > > wise it should not be a problem for SOLR. > > > > > > Also, you didn't say anything about the facet.limit value, cache > > > parameters, usage of filter queries. Some of these can be > interconnected. > > > > > > Dmitry > > > > > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge < > > > daniel.brue...@googlemail.com> wrote: > > > > > > > Hi, > > > > > > > > I have 2 Solr-shards. One is filled with approx. 25mio documents > (local > > > > index 6GB), the other with 10mio documents (2.7GB size). > > > > I am trying to create some kind of 'word cloud' to see the frequency > of > > > > words for a *text_general *field. > > > > For this I am currently using a facet over this field and I am also > > > > restricting the documents by using some other filters in the query. > > > > > > > > The performance is really bad for the first call and then pretty fast > > for > > > > the following calls. > > > > > > > > The maximum Java heap size is 3G for each shard. Both shards are > > running > > > on > > > > the same physical server which has 12G RAM. > > > > > > > > Question: Should I reduce the documents in one shard, so that the > index > > > is > > > > equal or less the Java Heap size for this shard? Or is > > > > there another method to avoid this slow calls? > > > > > > > > Thank you > > > > > > > > Daniel > > > > > > > > > > > > > > > > -- > > > Regards, > > > > > > Dmitry Kan > > > > > > > > > -- > Regards, > > Dmitry Kan >
Function in facet.query like min,max
Hi Solr community, Is it possible to return the lowest, highest and average price of a search result using facets? I tried something like: facet.query={!max(price,0)} Is it possible and what is the correct syntax? q=htc android facet=true facet.query=price:[* TO 10] facet.query=price:[11 TO 100] facet.query=price:[101 TO *] ??? facet.query={!max(price,0)} Thanks & Regards Ericz
Re: Trying to understand SOLR memory requirements
Thank you Robert, I'd appreciate that. Any idea how long it will take to get a fix? Would I be better switching to trunk? Is trunk stable enough for someone who's very much a SOLR novice? Thanks, Dave On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir wrote: > looks like https://issues.apache.org/jira/browse/SOLR-2888. > > Previously, FST would need to hold all the terms in RAM during > construction, but with the patch it uses offline sorts/temporary > files. > I'll reopen the issue to backport this to the 3.x branch. > > > On Mon, Jan 16, 2012 at 8:31 PM, Dave wrote: > > I'm trying to figure out what my memory needs are for a rather large > > dataset. I'm trying to build an auto-complete system for every > > city/state/country in the world. I've got a geographic database, and have > > setup the DIH to pull the proper data in. There are 2,784,937 documents > > which I've formatted into JSON-like output, so there's a bit of data > > associated with each one. Here is an example record: > > > > Brooklyn, New York, United States?{ |id|: |2620829|, > > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| }, > > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|: > > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|: > > |2300664|, |label|: |Brooklyn, New York, United States|, |value|: > > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United > > States| } > > > > I've got the spellchecker / suggester module setup, and I can confirm > that > > everything works properly with a smaller dataset (i.e. just a couple of > > countries worth of cities/states). However I'm running into a big problem > > when I try to index the entire dataset. The > dataimport?command=full-import > > works and the system comes to an idle state. It generates the following > > data/index/ directory (I'm including it in case it gives any indication > on > > memory requirements): > > > > -rw-rw 1 root root 2.2G Jan 17 00:13 _2w.fdt > > -rw-rw 1 root root22M Jan 17 00:13 _2w.fdx > > -rw-rw 1 root root131 Jan 17 00:13 _2w.fnm > > -rw-rw 1 root root 134M Jan 17 00:13 _2w.frq > > -rw-rw 1 root root16M Jan 17 00:13 _2w.nrm > > -rw-rw 1 root root 130M Jan 17 00:13 _2w.prx > > -rw-rw 1 root root 9.2M Jan 17 00:13 _2w.tii > > -rw-rw 1 root root 1.1G Jan 17 00:13 _2w.tis > > -rw-rw 1 root root 20 Jan 17 00:13 segments.gen > > -rw-rw 1 root root291 Jan 17 00:13 segments_2 > > > > Next I try to run the suggest?spellcheck.build=true command, and I get > the > > following error: > > > > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build > > INFO: build() > > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log > > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded > > at java.util.Arrays.copyOfRange(Arrays.java:3209) > > at java.lang.String.(String.java:215) > > at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122) > > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184) > > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203) > > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172) > > at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509) > > at > org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719) > > at > org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309) > > at > > > org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75) > > at > > > org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125) > > at > org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157) > > at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70) > > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133) > > at > > > org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109) > > at > > > org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) > > at > > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) > > at > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) > > at > > > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > > at > org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > > at > > > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > > at > org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.
Re: really slow performance when trying to get facet.field
Evictions are 0 for all cache types. Your server max heap space with 12G is pretty huge. Which is good I think. The CPU on my server is a 8-Core Intel i7 965. Commit frequency is low, because shards are added and old shards exist for historical reasons. Old shards will be then cleaned after couple of months. I will try to add maximum 15mio per shard and see what will happen here. This thing is, that I will add more shards over time, so that I can handle maybe 500-800mio documents. Maybe more. It depends. On Tue, Jan 17, 2012 at 2:14 PM, Dmitry Kan wrote: > Hi Daniel, > > My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is > beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m > -Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs is > over 6,5 million. > > Do you see any evictions in your caches? What kind of server is it, in > terms of CPU and OS? How often do you commit to the index? > > Dmitry > > On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge < > daniel.brue...@googlemail.com> wrote: > > > Hi Dmitry, > > > > I had everything on one Solr Instance before, but this got to heavy and I > > had the same issue here, that the 1st facet.query was really slow. > > > > When querying the facet: > > - facet.limit = 100 > > > > Cache settings are like this: > > > > > size="16384" > > initialSize="4096" > > autowarmCount="4096"/> > > > > > size="512" > > initialSize="512" > > autowarmCount="0"/> > > > > > size="512" > > initialSize="512" > > autowarmCount="0"/> > > > > How big was your index? Did it fit into the RAM which you gave the Solr > > instance? > > > > Thanks > > > > > > On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan > wrote: > > > > > I had a similar problem for a similar task. And in my case merging the > > > results from two shards turned out to be a culprit. If you can > logically > > > store your data just in one shard, your faceting should become faster. > > Size > > > wise it should not be a problem for SOLR. > > > > > > Also, you didn't say anything about the facet.limit value, cache > > > parameters, usage of filter queries. Some of these can be > interconnected. > > > > > > Dmitry > > > > > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge < > > > daniel.brue...@googlemail.com> wrote: > > > > > > > Hi, > > > > > > > > I have 2 Solr-shards. One is filled with approx. 25mio documents > (local > > > > index 6GB), the other with 10mio documents (2.7GB size). > > > > I am trying to create some kind of 'word cloud' to see the frequency > of > > > > words for a *text_general *field. > > > > For this I am currently using a facet over this field and I am also > > > > restricting the documents by using some other filters in the query. > > > > > > > > The performance is really bad for the first call and then pretty fast > > for > > > > the following calls. > > > > > > > > The maximum Java heap size is 3G for each shard. Both shards are > > running > > > on > > > > the same physical server which has 12G RAM. > > > > > > > > Question: Should I reduce the documents in one shard, so that the > index > > > is > > > > equal or less the Java Heap size for this shard? Or is > > > > there another method to avoid this slow calls? > > > > > > > > Thank you > > > > > > > > Daniel > > > > > > > > > > > > > > > > -- > > > Regards, > > > > > > Dmitry Kan > > > > > > > > > -- > Regards, > > Dmitry Kan >
How can I index this?
Hello, I am looking into indexing two data sources. One of those is a standard website and the other is a Sharepoint site. The problem is that I have no direct database access. Normally I would just use the DIH and get what I need from the DB. I do have a java DAO (data access object) class that I am using to directly to fetch information for a different purpose. In cases like this, what would be the best way to index the data? Should I somehow integrate Nutch as the crawler? Should I write a custom DIH? Can I use the DAO that I have in conjunction with the DIH? I am really looking for some recommendations here. I do have a few hacks that can be done (copy the data in a DB and index with DIH), but I am interested in the proper way. Any insight will be greatly appreciated. Cheers -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666106.html Sent from the Solr - User mailing list archive at Nabble.com.
first time query is very slow
hi, I had an solr3.3 index of 200,000 documents, all text are stored and the total index size is 27gb. I used dismax query with over 10 qf and pf boosting field each, plus sorting on score and other 2 fields. It took quite a few seconds(5-8) for the first time query to return any result(no highlighting is invloved). (even slower for phrase query) My question is, what is the bottle neck of the query speed? lucene query part? Scoring? or fill document cache with document content? Can anyone answer? Is there anyway of improving the first time query speed? thanks in advance, shen
Re: Trying to understand SOLR memory requirements
I committed it already: so you can try out branch_3x if you want. you can either wait for a nightly build or compile from svn (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/). On Tue, Jan 17, 2012 at 8:35 AM, Dave wrote: > Thank you Robert, I'd appreciate that. Any idea how long it will take to > get a fix? Would I be better switching to trunk? Is trunk stable enough for > someone who's very much a SOLR novice? > > Thanks, > Dave > > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir wrote: > >> looks like https://issues.apache.org/jira/browse/SOLR-2888. >> >> Previously, FST would need to hold all the terms in RAM during >> construction, but with the patch it uses offline sorts/temporary >> files. >> I'll reopen the issue to backport this to the 3.x branch. >> >> >> On Mon, Jan 16, 2012 at 8:31 PM, Dave wrote: >> > I'm trying to figure out what my memory needs are for a rather large >> > dataset. I'm trying to build an auto-complete system for every >> > city/state/country in the world. I've got a geographic database, and have >> > setup the DIH to pull the proper data in. There are 2,784,937 documents >> > which I've formatted into JSON-like output, so there's a bit of data >> > associated with each one. Here is an example record: >> > >> > Brooklyn, New York, United States?{ |id|: |2620829|, >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| }, >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|: >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|: >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|: >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United >> > States| } >> > >> > I've got the spellchecker / suggester module setup, and I can confirm >> that >> > everything works properly with a smaller dataset (i.e. just a couple of >> > countries worth of cities/states). However I'm running into a big problem >> > when I try to index the entire dataset. The >> dataimport?command=full-import >> > works and the system comes to an idle state. It generates the following >> > data/index/ directory (I'm including it in case it gives any indication >> on >> > memory requirements): >> > >> > -rw-rw 1 root root 2.2G Jan 17 00:13 _2w.fdt >> > -rw-rw 1 root root 22M Jan 17 00:13 _2w.fdx >> > -rw-rw 1 root root 131 Jan 17 00:13 _2w.fnm >> > -rw-rw 1 root root 134M Jan 17 00:13 _2w.frq >> > -rw-rw 1 root root 16M Jan 17 00:13 _2w.nrm >> > -rw-rw 1 root root 130M Jan 17 00:13 _2w.prx >> > -rw-rw 1 root root 9.2M Jan 17 00:13 _2w.tii >> > -rw-rw 1 root root 1.1G Jan 17 00:13 _2w.tis >> > -rw-rw 1 root root 20 Jan 17 00:13 segments.gen >> > -rw-rw 1 root root 291 Jan 17 00:13 segments_2 >> > >> > Next I try to run the suggest?spellcheck.build=true command, and I get >> the >> > following error: >> > >> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build >> > INFO: build() >> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log >> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded >> > at java.util.Arrays.copyOfRange(Arrays.java:3209) >> > at java.lang.String.(String.java:215) >> > at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122) >> > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184) >> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203) >> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172) >> > at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509) >> > at >> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719) >> > at >> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309) >> > at >> > >> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75) >> > at >> > >> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125) >> > at >> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157) >> > at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70) >> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133) >> > at >> > >> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109) >> > at >> > >> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) >> > at >> > >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) >> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) >> > at >> > >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) >> > at >> > >> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(Serv
[Job] Sales Engineer at Lucid Imagination
Hi Solr Users, Lucid Imagination is looking for a sales engineer. If you know search, Solr and like working with customers, the sales engineer job may be of interest to you. I've included the job description below. If you are interested, please send your resume (off-list) to melissa.qu...@lucidimagination.com. The position is based out of our Redwood City, CA office. Cheers, Grant Technical Sales Professional/Sales Engineer Responsibilities include: support business development team and help conduct product demonstrations and assist prospective customers to understand the value of our product; help close sales; craft responses to RFP's and RFI's; build proofs of concepts; develop and conduct training on occasion. Qualifications: BS or higher in Engineering or Computer Science preferred. 3+ years of IT Consulting and/or Professional Services experience required. Experience working with Lucene and/or Solr required. Experience with enterprise search applications are a big plus; some Java development experience; some experience with common scripting languages; enterprise search, eCommerce, and/or Business Intelligence experience a plus.
Re: first time query is very slow
First query will cause the index caches to be warmed up and this is why the first query takes some time. You can prewarm the caches with a query (when solr starts up) of your choosing in the config file. Google around the SolrWiki on cache/index warming. hth > hi, > > I had an solr3.3 index of 200,000 documents, all text are stored and the > total index size is 27gb. > I used dismax query with over 10 qf and pf boosting field each, plus > sorting on score and other 2 fields. It took quite a few seconds(5-8) for > the first time query to return any result(no highlighting is invloved). > (even slower for phrase query) > > My question is, what is the bottle neck of the query speed? lucene query > part? Scoring? or fill document cache with document content? Can anyone > answer? > > Is there anyway of improving the first time query speed? > > thanks in advance, > shen >
PositionIncrementGap inside a field
Hi. At the moment I have a multivalued field where i would like to add information with gaps at the end of every line in the multivalued field and I would like to add gaps as well in the middle of the lines. For instance IBM Corporation some information *"here a gap"* more information IBM Limited more info "here a gap" and some more data Do you know how to add a *positionincrementgap* here *"here a gap"* Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666230p3666230.html Sent from the Solr - User mailing list archive at Nabble.com.
PositionIncrementGap inside a field
Hi. At the moment I have a multivalued field where i would like to add information with gaps at the end of every line in the multivalued field and I would like to add gaps as well in the middle of the lines. For instance IBM Corporation some information *"here a gap"* more information IBM Limited more info "here a gap" and some more data Do you know how to add a *positionincrementgap* here *"here a gap"* Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p3666243.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: first time query is very slow
Thanks darren, I understand it will take longer time before warming up. What I am trying to find out is at the situation where we have no cache, why it will take so long time to complete the query, and what is the bottleneck? Fx, if I remove all qf, pf fields, the query speed will improve dramatically. Does it indicate a performance hole in boosting part of the code?? Predefined query will impove the speed a lot if the query string or documents that are cached, but it does not improve the speed of a new query. It will still spend the same time to do the scoreing, boosting, filtering etc. And memory is also a problem for big indexes, in my case, I had in total over 100gb indexes in one solr installation. I can't even imagine how solr can handle complex query for 1tb index data in my case. For those customers who unluckily send un-prewarmed query, they will suffer from bad response time, it is not too pleasant anyway. best regards, shen On Tue, Jan 17, 2012 at 3:18 PM, wrote: > First query will cause the index caches to be warmed up and this is why > the first query takes some time. > > You can prewarm the caches with a query (when solr starts up) of your > choosing in the config file. Google around the SolrWiki on cache/index > warming. > > hth > > > hi, > > > > I had an solr3.3 index of 200,000 documents, all text are stored and the > > total index size is 27gb. > > I used dismax query with over 10 qf and pf boosting field each, plus > > sorting on score and other 2 fields. It took quite a few seconds(5-8) for > > the first time query to return any result(no highlighting is invloved). > > (even slower for phrase query) > > > > My question is, what is the bottle neck of the query speed? lucene query > > part? Scoring? or fill document cache with document content? Can anyone > > answer? > > > > Is there anyway of improving the first time query speed? > > > > thanks in advance, > > shen > > > >
Re: SolrJ Embedded
Quantify slower, does it matter? At issue is that usually Solr spends far more time doing the search than transmitting the query and response over HTTP. Http is not really slow *as a protocol* in the first place. The usual place people have problems here is when there are a bunch of requests made over a network, a "chatty" connection. Especially if the other end of the connection is far away. But in Solr's case, there's one request and one response per search, so there's not much chat to worry about. But regardless of all that, never, never, never make your environment more complex than it needs to be before you *demonstrate* that you need to. The efficiency savings are often negligible and the cost of maintaining the complexity are often far more than estimated. Premature optimization and all that. I will allow that on some rare occasions you *can* know that you have to get complex from the start, but I can't tell you how many times I've been *sure* I knew where the bottleneck would beand been wrong. Measure first, fix second has become my mantra. Best Erick On Tue, Jan 17, 2012 at 3:49 AM, Maxim Veksler wrote: > On Tue, Jan 17, 2012 at 3:13 AM, Erick Erickson > wrote: > >> I don't see why not. I'm assuming a *nix system here so when Solr >> updated an index, any deleted files would hang around. >> >> But I have to ask why bother with the Embedded server in the >> first place? You already have a Solr instance up and running, >> why not just query that instead, perhaps using SolrJ? >> >> > Wouldn't querying the Solr server using the HTTP interface be slower? > > >> Best >> Erick >> >> On Mon, Jan 16, 2012 at 3:00 PM, wrote: >> > Hi, >> > >> > is it possible to use the same index in a solr webapp and additionally >> in a >> > EmbeddedSolrServer? The embbedded one would be read only. >> > >> > Thank you. >> > >>
Re: PositionIncrementGap inside a field
This is just adding the field repeatedly, something like IBM Corporation some information IBM limited more info multiValued="true"/> > > > > IBM Corporation some information *"here a gap"* more information > > > IBM Limited more info "here a gap" and some more data > > > > Do you know how to add a *positionincrementgap* here *"here a gap"* > Thanks > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p3666243.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Solr Cloud Indexing
This only really makes sense if you don't have enough in-house resources to do your indexing locally, but it certainly is possible. Amazon's EC2 has been used, but really any hosting service should do. Best Erick On Tue, Jan 17, 2012 at 12:09 AM, Sujatha Arun wrote: > Would it make sense to Index on the cloud and periodically [2-4 times > /day] replicate the index at our server for searching .Which service to go > with for solr Cloud Indexing ? > > Any good and tried services? > > Regards > Sujatha
Re: Function in facet.query like min,max
have you seen the Stats component? See: http://wiki.apache.org/solr/StatsComponent Best Erick On Tue, Jan 17, 2012 at 8:34 AM, Eric Grobler wrote: > Hi Solr community, > > Is it possible to return the lowest, highest and average price of a search > result using facets? > I tried something like: facet.query={!max(price,0)} > Is it possible and what is the correct syntax? > > q=htc android > facet=true > facet.query=price:[* TO 10] > facet.query=price:[11 TO 100] > facet.query=price:[101 TO *] > ??? facet.query={!max(price,0)} > > > Thanks & Regards > Ericz
Re: How can I index this?
This sounds like, for the database source, that using SolrJ would be the way to go. Assuming you can access the database from Java this is pretty easy. As for the website, Nutch is certainly an option... But I'm a little puzzled. You mention a website, and sharepoint as your sources, then ask about accessing the DB. How are all these related? Best Erick On Tue, Jan 17, 2012 at 8:38 AM, ahammad wrote: > Hello, > > I am looking into indexing two data sources. One of those is a standard > website and the other is a Sharepoint site. The problem is that I have no > direct database access. Normally I would just use the DIH and get what I > need from the DB. I do have a java DAO (data access object) class that I am > using to directly to fetch information for a different purpose. > > In cases like this, what would be the best way to index the data? Should I > somehow integrate Nutch as the crawler? Should I write a custom DIH? Can I > use the DAO that I have in conjunction with the DIH? > > I am really looking for some recommendations here. I do have a few hacks > that can be done (copy the data in a DB and index with DIH), but I am > interested in the proper way. Any insight will be greatly appreciated. > > Cheers > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666106.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can I index this?
Perhaps I was a little confusing... Normally when I have DB access, I do a regular indexing process using DIH. For these two sources, I do not have direct DB access. I can only view the two sources like any end-user would. I do have a java class that can get the information that I need. That class gets that information (through HTTP requests) and does not have DB access. That class is currently being used for other purposes but I can take it and use it for Solr as well. Does that make sense? Knowing all that, namely the fact that I cannot directly access the DB, and I can make HTTP requests to get the info, how can I index that info? Please let me know if this clarifies what I am trying to do. Regards -- View this message in context: http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666590.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Function in facet.query like min,max
Yes, I have, but unfortunately it works on the whole index and not for a particular query. On Tue, Jan 17, 2012 at 3:37 PM, Erick Erickson wrote: > have you seen the Stats component? See: > http://wiki.apache.org/solr/StatsComponent > > Best > Erick > > On Tue, Jan 17, 2012 at 8:34 AM, Eric Grobler > wrote: > > Hi Solr community, > > > > Is it possible to return the lowest, highest and average price of a > search > > result using facets? > > I tried something like: facet.query={!max(price,0)} > > Is it possible and what is the correct syntax? > > > > q=htc android > > facet=true > > facet.query=price:[* TO 10] > > facet.query=price:[11 TO 100] > > facet.query=price:[101 TO *] > > ??? facet.query={!max(price,0)} > > > > > > Thanks & Regards > > Ericz >
Re: PositionIncrementGap inside a field
Hi Erick. Thanks for your asnwer. This is almost what i want to do but my problem is that i want to be able to introduce two different sizes of gaps. Something like IBM Corporation some information *gap of 30* more information *gap of 100* IBM Limited more info *gap of 30* and some more data *gap of 100* Do you know how can i achieve that? -- View this message in context: http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p322.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: How can I index this?
Well, if you can make an HTTP request, you can parse the return and stuff it into a SolrInputDocument in SolrJ and then send it to Solr. At least that seems possible if I'm understanding your setup. There are other Solr clients that allow similar processes, but the Java version is the one I know best. Best Erick On Tue, Jan 17, 2012 at 11:10 AM, ahammad wrote: > Perhaps I was a little confusing... > > Normally when I have DB access, I do a regular indexing process using DIH. > For these two sources, I do not have direct DB access. I can only view the > two sources like any end-user would. > > I do have a java class that can get the information that I need. That class > gets that information (through HTTP requests) and does not have DB access. > That class is currently being used for other purposes but I can take it and > use it for Solr as well. Does that make sense? > > Knowing all that, namely the fact that I cannot directly access the DB, and > I can make HTTP requests to get the info, how can I index that info? > > Please let me know if this clarifies what I am trying to do. > > Regards > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-can-I-index-this-tp3666106p3666590.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: Function in facet.query like min,max
I don't believe that's the case, have you tried it? From the page I referenced: "The stats component returns simple statistics for indexed numeric fields within the DocSet." And running a very quick test on the example data, I get different results when I used *:* and name:maxtor. That said, I'm not all that familiar with the stats component so I could well be wrong. Best Erick On Tue, Jan 17, 2012 at 11:16 AM, Eric Grobler wrote: > Yes, I have, but unfortunately it works on the whole index and not for a > particular query. > > > On Tue, Jan 17, 2012 at 3:37 PM, Erick Erickson > wrote: > >> have you seen the Stats component? See: >> http://wiki.apache.org/solr/StatsComponent >> >> Best >> Erick >> >> On Tue, Jan 17, 2012 at 8:34 AM, Eric Grobler >> wrote: >> > Hi Solr community, >> > >> > Is it possible to return the lowest, highest and average price of a >> search >> > result using facets? >> > I tried something like: facet.query={!max(price,0)} >> > Is it possible and what is the correct syntax? >> > >> > q=htc android >> > facet=true >> > facet.query=price:[* TO 10] >> > facet.query=price:[11 TO 100] >> > facet.query=price:[101 TO *] >> > ??? facet.query={!max(price,0)} >> > >> > >> > Thanks & Regards >> > Ericz >>
Re: PositionIncrementGap inside a field
Hmmm, no I don't know how to do that out of the box. Two things: 1> why do you want to do this? Perhaps if you describe the high-level problem you're trying to solve there might be other ways to approach it. 2> I *think* you could write your own Tokenizer that recognized the special tokens you'd have to put into your input stream and adjusted the token offsets accordingly, but I confess I haven't tried it myself... Best Erick On Tue, Jan 17, 2012 at 11:23 AM, marotosg wrote: > Hi Erick. Thanks for your asnwer. > > This is almost what i want to do but my problem is that i want to be able to > introduce two different sizes of gaps. > > Something like > > > IBM Corporation some information *gap of 30* more information *gap > of 100* > > > IBM Limited more info *gap of 30* and some more data *gap of 100* > > > > Do you know how can i achieve that? > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p322.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: PositionIncrementGap inside a field
Hi Erik, what I'm trying to achieve here is trying to verify if we can run a query like this: "\""IBM Ltd"~15\" \""Dublin Ireland"~15\""~100 on a field where the gaps are like this: IBM Ireland Ltd *gap of 30* Dublin USA *gap of 300* IBM Ltd *gap of 30* Dublin Ireland *gap of 300* The first line cointains all the words I'm looking for, but the work "Ireland" is not within 15 tokens from the word "Dublin" so the query will not match that. It will match the second line. the 2 lines will stay separated with a 300 tokens gap so that there is no risk of false positives with words coming form the second line. I don't know if I was clear enough. -- View this message in context: http://lucene.472066.n3.nabble.com/PositionIncrementGap-inside-a-field-tp3666243p3666765.html Sent from the Solr - User mailing list archive at Nabble.com.
How to return the distance geo distance on solr 3.5 with bbox filtering
Hello, I'm querying with bbox which should be faster then geodist, my queries are looking like this: http://localhost:8983/solr/select?indent=true&fq={!bbox}&sfield=loc&pt=39.738548,-73.130322&d=100&sort=geodist()%20asc&q=trafficRouteId:235 the trouble is, that with bbox solr does not return the distance of each document, I couldn't get it to work even with tips from http://wiki.apache.org/solr/SpatialSearch#Returning_the_distance Something I'm missing ?
Re: Sorting results within the fields
It's been almost a week and there is no response to the question that I asked. Is the question has less details or there is no way to achieve the same in Lucene? -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-results-within-the-fields-tp3656049p3666983.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Function in facet.query like min,max
Hi Erick Thanks for your feedback. I will try it tomorrow - if it works it will be perfect for my needs. Have a nice day Ericz On Tue, Jan 17, 2012 at 4:28 PM, Erick Erickson wrote: > I don't believe that's the case, have you tried it? From the page > I referenced: > > "The stats component returns simple statistics for indexed > numeric fields within the DocSet." > > And running a very quick test on the example data, I get different > results when I used *:* and name:maxtor. > > That said, I'm not all that familiar with the stats component so I > could well be wrong. > > Best > Erick > > On Tue, Jan 17, 2012 at 11:16 AM, Eric Grobler > wrote: > > Yes, I have, but unfortunately it works on the whole index and not for a > > particular query. > > > > > > On Tue, Jan 17, 2012 at 3:37 PM, Erick Erickson >wrote: > > > >> have you seen the Stats component? See: > >> http://wiki.apache.org/solr/StatsComponent > >> > >> Best > >> Erick > >> > >> On Tue, Jan 17, 2012 at 8:34 AM, Eric Grobler < > impalah...@googlemail.com> > >> wrote: > >> > Hi Solr community, > >> > > >> > Is it possible to return the lowest, highest and average price of a > >> search > >> > result using facets? > >> > I tried something like: facet.query={!max(price,0)} > >> > Is it possible and what is the correct syntax? > >> > > >> > q=htc android > >> > facet=true > >> > facet.query=price:[* TO 10] > >> > facet.query=price:[11 TO 100] > >> > facet.query=price:[101 TO *] > >> > ??? facet.query={!max(price,0)} > >> > > >> > > >> > Thanks & Regards > >> > Ericz > >> >
Re: really slow performance when trying to get facet.field
Ok, I have now changed the static warming in the solrconfig.xml using first- and newSearcher. "Content" is my field to facet on. Now the commits take longer, which is OK for me, but the searches are really faster right now. I also reduced the number of documents on my shards to 15mio/shard. So the index is about 3.5G, which fits also in my memory I hope. *:* true content 1 1 *:* true content 1 1 On Tue, Jan 17, 2012 at 2:36 PM, Daniel Bruegge < daniel.brue...@googlemail.com> wrote: > Evictions are 0 for all cache types. > > Your server max heap space with 12G is pretty huge. Which is good I think. > The CPU on my server is a 8-Core Intel i7 965. > > Commit frequency is low, because shards are added and old shards exist for > historical reasons. Old shards will be then cleaned after couple of months. > > I will try to add maximum 15mio per shard and see what will happen here. > > This thing is, that I will add more shards over time, so that I can handle > maybe 500-800mio documents. Maybe more. It depends. > > On Tue, Jan 17, 2012 at 2:14 PM, Dmitry Kan wrote: > >> Hi Daniel, >> >> My index is 6,5G. I'm sure it can be bigger. facet.limit we ask for is >> beyond 100 thousand. It is sub-second speed. I run it with -Xms1024m >> -Xmx12000m under tomcat, it currently takes 5,4G of RAM. Amount of docs is >> over 6,5 million. >> >> Do you see any evictions in your caches? What kind of server is it, in >> terms of CPU and OS? How often do you commit to the index? >> >> Dmitry >> >> On Tue, Jan 17, 2012 at 3:01 PM, Daniel Bruegge < >> daniel.brue...@googlemail.com> wrote: >> >> > Hi Dmitry, >> > >> > I had everything on one Solr Instance before, but this got to heavy and >> I >> > had the same issue here, that the 1st facet.query was really slow. >> > >> > When querying the facet: >> > - facet.limit = 100 >> > >> > Cache settings are like this: >> > >> >> > size="16384" >> > initialSize="4096" >> > autowarmCount="4096"/> >> > >> >> > size="512" >> > initialSize="512" >> > autowarmCount="0"/> >> > >> >> > size="512" >> > initialSize="512" >> > autowarmCount="0"/> >> > >> > How big was your index? Did it fit into the RAM which you gave the Solr >> > instance? >> > >> > Thanks >> > >> > >> > On Tue, Jan 17, 2012 at 1:56 PM, Dmitry Kan >> wrote: >> > >> > > I had a similar problem for a similar task. And in my case merging the >> > > results from two shards turned out to be a culprit. If you can >> logically >> > > store your data just in one shard, your faceting should become faster. >> > Size >> > > wise it should not be a problem for SOLR. >> > > >> > > Also, you didn't say anything about the facet.limit value, cache >> > > parameters, usage of filter queries. Some of these can be >> interconnected. >> > > >> > > Dmitry >> > > >> > > On Tue, Jan 17, 2012 at 2:49 PM, Daniel Bruegge < >> > > daniel.brue...@googlemail.com> wrote: >> > > >> > > > Hi, >> > > > >> > > > I have 2 Solr-shards. One is filled with approx. 25mio documents >> (local >> > > > index 6GB), the other with 10mio documents (2.7GB size). >> > > > I am trying to create some kind of 'word cloud' to see the >> frequency of >> > > > words for a *text_general *field. >> > > > For this I am currently using a facet over this field and I am also >> > > > restricting the documents by using some other filters in the query. >> > > > >> > > > The performance is really bad for the first call and then pretty >> fast >> > for >> > > > the following calls. >> > > > >> > > > The maximum Java heap size is 3G for each shard. Both shards are >> > running >> > > on >> > > > the same physical server which has 12G RAM. >> > > > >> > > > Question: Should I reduce the documents in one shard, so that the >> index >> > > is >> > > > equal or less the Java Heap size for this shard? Or is >> > > > there another method to avoid this slow calls? >> > > > >> > > > Thank you >> > > > >> > > > Daniel >> > > > >> > > >> > > >> > > >> > > -- >> > > Regards, >> > > >> > > Dmitry Kan >> > > >> > >> >> >> >> -- >> Regards, >> >> Dmitry Kan >> > >
Re: Sorting results within the fields
Hi, Complex problems like this is much better explained with concrete examples than generalized text. Please create a real example with real documents and their content, along with real queries. You don't explain what "the score value which is generate by my application" is - which application is that and is the score generated statically before indexing or should the scoring call out to some external application to ask for the score, and what is the input and output criteria for such custom scoring? > So, that the final results of the query > will look like > > (D1, D2) (D3,D4) (D5,D6,D7). Meaning that D1,D2 comes first because they match field1 which you gave highest boost? field1^8? Should D1,D2 always come before the others regardless of the "custom scoring from your application"? Will order between D1,D2 be influenced by Solr scoring at all or only by "external application"? Hope you see that being concrete is necessary for such questions. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 17. jan. 2012, at 19:38, aronitin wrote: > It's been almost a week and there is no response to the question that I > asked. > > Is the question has less details or there is no way to achieve the same in > Lucene? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Sorting-results-within-the-fields-tp3656049p3666983.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: first time query is very slow
On Tue, Jan 17, 2012 at 9:39 AM, gabriel shen wrote: > For those customers who unluckily send un-prewarmed query, they will suffer > from bad response time, it is not too pleasant anyway. The "warming caches" part isn't about unique queries, but more about caches used for sorting and faceting (and those are reused across many different queries). Can you give an example of the complete request you were sending that takes a long time? -Yonik http://www.lucidimagination.com
Facet auto-suggest
I don't even know what to call this feature. Here's a website that shows the problem: http://pulse.audiusanews.com/pulse/index.php Notice that you can end up in a situation where there are no results. For example, in order, press: People, Performance, Technology, Photos. The client wants it so that when you click a button, it disables buttons that would lead to a dead end. In other words, after clicking Technology, the Photos button would be disabled. Can Solr help with this? -jsd-
Re: Facet auto-suggest
Hi, Sure, you can use filters and facets for this. Start a query with ...&facet.field=source&facet.field=topics&facet.field=type When you click a "button", you set the corresponding filter (fq=source:people), and the new query will return the same facets with new counts. In the Audi example, you would disable buttons with 0 hits in the facet count. For more in depth, see http://java.dzone.com/news/complex-solr-faceting -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 17. jan. 2012, at 23:38, Jon Drukman wrote: > I don't even know what to call this feature. Here's a website that shows > the problem: > > http://pulse.audiusanews.com/pulse/index.php > > Notice that you can end up in a situation where there are no results. > For example, > in order, press: People, Performance, Technology, Photos. The client > wants it so that when you click a button, it disables buttons that would > lead to a dead end. In other words, after clicking Technology, the Photos > button would be disabled. > > Can Solr help with this? > > -jsd- >
Re: Solr Cloud Indexing
Cloud upload bandwidth is free, but download bandwidth costs money. If you upload a lot of data but do not query it often, Amazon can make sense. You can also rent much cheaper hardware in other hosting services where you pay by the month or even by the year. If you know you have a cap on how much resource you will need at once, the cheaper sites make more sense. On Tue, Jan 17, 2012 at 7:36 AM, Erick Erickson wrote: > This only really makes sense if you don't have enough in-house resources > to do your indexing locally, but it certainly is possible. > > Amazon's EC2 has been used, but really any hosting service should do. > > Best > Erick > > On Tue, Jan 17, 2012 at 12:09 AM, Sujatha Arun wrote: >> Would it make sense to Index on the cloud and periodically [2-4 times >> /day] replicate the index at our server for searching .Which service to go >> with for solr Cloud Indexing ? >> >> Any good and tried services? >> >> Regards >> Sujatha -- Lance Norskog goks...@gmail.com
Re: Trying to understand SOLR memory requirements
Which version of Solr do you use? 3.1 and 3.2 had a memory leak bug in spellchecking. This was fixed in 3.3. On Tue, Jan 17, 2012 at 5:59 AM, Robert Muir wrote: > I committed it already: so you can try out branch_3x if you want. > > you can either wait for a nightly build or compile from svn > (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/). > > On Tue, Jan 17, 2012 at 8:35 AM, Dave wrote: >> Thank you Robert, I'd appreciate that. Any idea how long it will take to >> get a fix? Would I be better switching to trunk? Is trunk stable enough for >> someone who's very much a SOLR novice? >> >> Thanks, >> Dave >> >> On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir wrote: >> >>> looks like https://issues.apache.org/jira/browse/SOLR-2888. >>> >>> Previously, FST would need to hold all the terms in RAM during >>> construction, but with the patch it uses offline sorts/temporary >>> files. >>> I'll reopen the issue to backport this to the 3.x branch. >>> >>> >>> On Mon, Jan 16, 2012 at 8:31 PM, Dave wrote: >>> > I'm trying to figure out what my memory needs are for a rather large >>> > dataset. I'm trying to build an auto-complete system for every >>> > city/state/country in the world. I've got a geographic database, and have >>> > setup the DIH to pull the proper data in. There are 2,784,937 documents >>> > which I've formatted into JSON-like output, so there's a bit of data >>> > associated with each one. Here is an example record: >>> > >>> > Brooklyn, New York, United States?{ |id|: |2620829|, >>> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| }, >>> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|: >>> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|: >>> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|: >>> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United >>> > States| } >>> > >>> > I've got the spellchecker / suggester module setup, and I can confirm >>> that >>> > everything works properly with a smaller dataset (i.e. just a couple of >>> > countries worth of cities/states). However I'm running into a big problem >>> > when I try to index the entire dataset. The >>> dataimport?command=full-import >>> > works and the system comes to an idle state. It generates the following >>> > data/index/ directory (I'm including it in case it gives any indication >>> on >>> > memory requirements): >>> > >>> > -rw-rw 1 root root 2.2G Jan 17 00:13 _2w.fdt >>> > -rw-rw 1 root root 22M Jan 17 00:13 _2w.fdx >>> > -rw-rw 1 root root 131 Jan 17 00:13 _2w.fnm >>> > -rw-rw 1 root root 134M Jan 17 00:13 _2w.frq >>> > -rw-rw 1 root root 16M Jan 17 00:13 _2w.nrm >>> > -rw-rw 1 root root 130M Jan 17 00:13 _2w.prx >>> > -rw-rw 1 root root 9.2M Jan 17 00:13 _2w.tii >>> > -rw-rw 1 root root 1.1G Jan 17 00:13 _2w.tis >>> > -rw-rw 1 root root 20 Jan 17 00:13 segments.gen >>> > -rw-rw 1 root root 291 Jan 17 00:13 segments_2 >>> > >>> > Next I try to run the suggest?spellcheck.build=true command, and I get >>> the >>> > following error: >>> > >>> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build >>> > INFO: build() >>> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log >>> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded >>> > at java.util.Arrays.copyOfRange(Arrays.java:3209) >>> > at java.lang.String.(String.java:215) >>> > at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122) >>> > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184) >>> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203) >>> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172) >>> > at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509) >>> > at >>> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719) >>> > at >>> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309) >>> > at >>> > >>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75) >>> > at >>> > >>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125) >>> > at >>> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157) >>> > at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70) >>> > at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133) >>> > at >>> > >>> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109) >>> > at >>> > >>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173) >>> > at >>> > >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) >>> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372) >>> > at
Re: Sorting results within the fields
Hi Jan, Thanks for the reply. Here is the concrete explanation of the problem that I'm trying to solve. *SOLR Schema* Here is the definition of the SOLR schema *There are 3 dynamic fields* *There are 4 searchable fields* *Description*: Data in this field is Whitespace Tokenized, Stemmed, Lowercased *Description*: Data in this field is only lowercase and Keyword Tokenizer is applied. So, data is not changed when stored in this field. *Description*: Head terms are encoded in the format HEAD$Value *Description*: Tail terms are encoded in the format TAIL$Value The data that we store in these fields is cleaned up data from large text: generally 1 word, 2 words, 3 words values D1 -> UI, UI Design, UI Programming , UI Design Document, D2 -> UI Mockup, UI development D3 -> UI When somebody queries *UI*, internal query that is generated is concepts_headtermencoded_concept:HEAD$ui^100.0 concepts:ui^50.0 concepts_tailtermencoded_concept:TAIL$ui^10.0 So, that head term matched document is ranked higher than partial match. Current Implementation without score ranks the document like: D1 > D2 > D3 (because Lucene use Tf, IDF while scoring the document) Now, we have created *application specific score* for each concept and want to sort the results based on that score but preserving the boost on the field defined in the query. e.g. D1 -> UI=90, UI Design = 45, UI Programming = 40, UI Design Document = 85, Project Wolverine=40 D2 -> UI Mockup=55, UI Development=74, Project Management=39 D3 -> UI=95, Project Wolverine=35 D4 -> UI Dev = 75, Video Project=42 1. If a match is found and only exact match was found then sorting will happen based on the score value for the term that we have defined. 2. If a match is found and exact and partial matches are there. Then sorting should happen based on the exact matched documents on top and then partially matched documents sorted within themselves based on score. *Examples* *Search*: UI *Desired Results*: D3 > D1 > D4 > D2 where (D3, D1) contains exact match and hence scored within themselves. (D4, D2 both have head match but score of head match in D4 > D2) *Search*: Project *Desired Results*: D1 > D2 > D3 > D4 Where D1, D2 and D3 are head term matches and sorted within (D1, D2, D3) based on score and D4 is tail term match (even though has better score tail term boost is 1/10th of head term boost). So, in all we can override the TF, IDF of Lucene scoring and want do the scoring based on our concept specific score but preserving giving the higher preference to exact match and then partial matches. Hope I explained the problem. Let me know if you have any specific question. Thanks Nitin -- View this message in context: http://lucene.472066.n3.nabble.com/Sorting-results-within-the-fields-tp3656049p3668047.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Highlighting "text" field when query is for "string" field
Just to be clear, I do phrase query on string field like q=keyword_text:"smooth skin". I am expecting highlighting to be done on excerpt field. What I see is: These numbers are unique id's of documents. Where is excerpts with highlighted text? Any idea? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Highlighting-text-field-when-query-is-for-string-field-tp3475334p3668074.html Sent from the Solr - User mailing list archive at Nabble.com.
Question on Reverse Indexing
Hi, For reverse indexing we are using the ReversedWildcardFilterFactory on Solr 4.0 ReversedWildcardFilterFactory was helping us to perform leading wild card searches like *lock. But it was observed that the performance of the searches was not good after introducing ReversedWildcardFilterFactory filter. Hence we disabled ReversedWildcardFilterFactory filter and re-created the indexes and this time we found the performance of Solr query to be faster. But surprisingly it is observed that leading wild card searches were still working inspite of disabling the ReversedWildcardFilterFactory filter. This behavior is puzzling everyone and wanted to know how this behavior of reverse indexing works? Can anyone share with me on this Solr behavior. -Shyam
Re: Question on Reverse Indexing
Using ReversedWildcardFilterFactory will double the size of your dictionary (more or less), maybe the drop in performance that you are seeing is a result of that? François On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote: > Hi, > > For reverse indexing we are using the ReversedWildcardFilterFactory on Solr > 4.0 > > > > maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/> > > > ReversedWildcardFilterFactory was helping us to perform leading wild card > searches like *lock. > > But it was observed that the performance of the searches was not good after > introducing ReversedWildcardFilterFactory filter. > > Hence we disabled ReversedWildcardFilterFactory filter and re-created the > indexes and this time we found the performance of Solr query to be faster. > > But surprisingly it is observed that leading wild card searches were still > working inspite of disabling the ReversedWildcardFilterFactory filter. > > > This behavior is puzzling everyone and wanted to know how this behavior of > reverse indexing works? > > Can anyone share with me on this Solr behavior. > > -Shyam >
RE: Question on Reverse Indexing
Hi Francois, I understand that disabling of ReversedWildcardFilterFactory has improved the performance. But I am puzzled over how the leading wild card search like *lock is working even though I have now disabled the ReversedWildcardFilterFactory and the indexes have been created without ReversedWildcardFilter ? How does reverse indexing work even after disabling ReversedWildcardFilterFactory? Can anyone explain me how this feature is working. -Shyam -Original Message- From: François Schiettecatte [mailto:fschietteca...@gmail.com] Sent: Wednesday, January 18, 2012 7:49 AM To: solr-user@lucene.apache.org Subject: Re: Question on Reverse Indexing Using ReversedWildcardFilterFactory will double the size of your dictionary (more or less), maybe the drop in performance that you are seeing is a result of that? François On Jan 17, 2012, at 9:01 PM, Shyam Bhaskaran wrote: > Hi, > > For reverse indexing we are using the ReversedWildcardFilterFactory on Solr > 4.0 > > > > maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/> > > > ReversedWildcardFilterFactory was helping us to perform leading wild card > searches like *lock. > > But it was observed that the performance of the searches was not good after > introducing ReversedWildcardFilterFactory filter. > > Hence we disabled ReversedWildcardFilterFactory filter and re-created the > indexes and this time we found the performance of Solr query to be faster. > > But surprisingly it is observed that leading wild card searches were still > working inspite of disabling the ReversedWildcardFilterFactory filter. > > > This behavior is puzzling everyone and wanted to know how this behavior of > reverse indexing works? > > Can anyone share with me on this Solr behavior. > > -Shyam >
Re: DataImportHandler in Solr 4.0
Not a java pro, and the documentation hasn't been updated to include these instructions (at least that I could find). What do I need to do to perform the steps that Alexandre is talking about? -- View this message in context: http://lucene.472066.n3.nabble.com/DataImportHandler-in-Solr-4-0-tp2563053p3667942.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Can Apache Solr Handle TeraByte Large Data
Could indexing English Wikipedia dump over and over get you there? Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html > > From: Memory Makers >To: solr-user@lucene.apache.org >Sent: Tuesday, January 17, 2012 12:15 AM >Subject: Re: Can Apache Solr Handle TeraByte Large Data > >I've been toying with the idea of setting up an experiment to index a large >document set 1+ TB -- any thoughts on an open data set that one could use >for this purpose? > >Thanks. > >On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom wrote: > >> Hello , >> >> Searching real-time sounds difficult with that amount of data. With large >> documents, 3 million documents, and 5TB of data the index will be very >> large. With indexes that large your performance will probably be I/O bound. >> >> Do you plan on allowing phrase or proximity searches? If so, your >> performance will be even more I/O bound as documents that large will have >> huge positions indexes that will need to be read into memory for processing >> phrase queries. To reduce I/O you need as much of the index in memory >> (Lucene/Solr caches, and operating system disk cache). Every commit >> invalidates the Solr/Lucene caches (unless the newer nrt code has solved >> this for Solr). >> >> If you index and serve on the same server, you are also going to get >> terrible response time whenever your commits trigger a large merge. >> >> If you need to service 10-100 qps or more, you may need to look at putting >> your index on SSDs or spreading it over enough machines so it can stay in >> memory. >> >> What kind of response times are you looking for and what query rate? >> >> We have somewhat smaller documents. We have 10 million documents and about >> 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4 >> machines (i.e. 3 shards per machine). We get an average of around >> 200-300ms response time but our 95th percentile times are about 800ms and >> 99th percentile are around 2 seconds. This is with an average load of less >> than 1 query/second. >> >> As Otis suggested, you may want to implement a strategy that allows users >> to search within the large documents by breaking the documents up into >> smaller units. What we do is have two Solr indexes. The first indexes >> complete documents. When the user clicks on a result, we index the entire >> document on a page level in a small Solr index on-the-fly. That way they >> can search within the document and get page level results. >> >> More details about our setup: >> http://www.hathitrust.org/blogs/large-scale-search >> >> Tom Burton-West >> University of Michigan Library >> www.hathitrust.org >> -Original Message- >> >> > > >
Re: Solr - Tika(?) memory leak
You'll need to reindex everything indeed. Otis Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html > > From: Wayne W >To: solr-user@lucene.apache.org >Sent: Tuesday, January 17, 2012 12:36 AM >Subject: Re: Solr - Tika(?) memory leak > >Thanks for the links - I've put a posting on the Tika ML. >I've just checked and we using tika-0.2.jar - does anyone know which >version I can use with solr 1.3? > >Is there any info on upgrading from this far back to the latest >version - is it even possible? or would I need to re-index everything? > >On Tue, Jan 17, 2012 at 5:39 AM, P Williams > wrote: >> Hi, >> >> I'm not sure which version of Solr/Tika you're using but I had a similar >> experience which turned out to be the result of a design change to PDFBox. >> >> https://issues.apache.org/jira/browse/SOLR-2886 >> >> Tricia >> >> On Sat, Jan 14, 2012 at 12:53 AM, Wayne W wrote: >> >>> Hi, >>> >>> we're using Solr running on tomcat with 1GB in production, and of late >>> we've been having a huge number of OutOfMemory issues. It seems from >>> what I can tell this is coming from the tika extraction of the >>> content. I've processed the java dump file using a memory analyzer and >>> its pretty clean at least the class involved. It seems like a leak to >>> me, as we don't parse any files larger than 20M, and these objects are >>> taking up ~700M >>> >>> I've attached 2 screen shots from the tool (not sure if you receive >>> attachments). >>> >>> But to summarize (class, number of objects, Used heap size, Retained Heap >>> Size): >>> >>> >>> org.apache.xmlbeans.impl.store.Xob$ElementXObj 838,993 >>> 80,533,728 604,606,040 >>> org.apache.poi.openxml4j.opc.ZipPackage 2 >>> 112 87,009,848 >>> char[] >>> 587 32,216,960 38,216,950 >>> >>> >>> We're really desperate to find a solution to this - any ideas or help >>> is greatly appreciated. >>> Wayne >>> > > >