Re: Restricting HTML search?

2010-08-24 Thread Paul Libbrecht
Wouldn't the usage of the NeckoHTML (as an XML-parser) and XPath be safer? I guess it all depends on the "quality" of the source document. paul Le 25-août-10 à 02:09, Lance Norskog a écrit : I would do this with regular expressions. There is a Pattern Analyzer and a Tokenizer which do regul

How to delete documents from SOLR index using DIH

2010-08-24 Thread Pawan Darira
Hi I am using data import handler to build index. How can i delete documents from my index using DIH. -- Thanks, Pawan Darira

Re: Solr jam after all my jvm thread pool hang in blocked state

2010-08-24 Thread Erick Erickson
Has anyone changed your tomcat settings? You're logging information at the INFO level, I wonder if you used to be logging at WARN. I'd also take a look at your log directory to see if you've got a bazillion log files, and/or how big your log file is, has it been accumulating log messages forever?

Re: Query speed decreased dramatically, not sure why though?

2010-08-24 Thread Erick Erickson
If you attach &debugQuery=on to your query, you'll often get pointers as to what's actually happening under the covers... Best Erick On Tue, Aug 24, 2010 at 2:26 AM, C0re wrote: > > We have a query which takes the form of > > ".../select?q=*&sort=evalDate+desc,score+desc&start=0&rows=10" > > Th

Re: Why it's boosted up?

2010-08-24 Thread 朱炎詹
Thanks! That' make sense :) - Original Message - From: "Ahmet Arslan" To: Sent: Tuesday, August 24, 2010 4:30 PM Subject: Re: Why it's boosted up? Then why short fields are boost up? In other words longer documents are punished. Because they contain possibly many terms/words. If

Re: Removing expired documents from Solr index

2010-08-24 Thread Erick Erickson
In addition to deletebyquery, you might want to optimize your index periodically to reclaim some space Best Erick On Tue, Aug 24, 2010 at 2:53 AM, Andreas Jung wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Andy wrote: > > My documents have an "expiration_datetime" field that

Re: Why it's boosted up?

2010-08-24 Thread 朱炎詹
Thanks for your clear explanation! I got it :) - Original Message - From: "MitchK" To: Sent: Tuesday, August 24, 2010 3:37 PM Subject: Re: Why it's boosted up? Hi Scott, (so shorter fields are automatically boosted up). " The theory behind that is the following (in easy word

Re: Tokenising on Each Letter

2010-08-24 Thread Erick Erickson
Another thing you might try is set preserverOriginal=1 (just saw this in another thread). Which one is "better" usually depends on your problem space... Best Erick On Mon, Aug 23, 2010 at 9:16 AM, Scottie wrote: > > Nikolas, thanks a lot for that, I've just gave it a quick test and it > definit

Re: Problem in setting the request writer in SolrJ (wiki page wrong?)

2010-08-24 Thread Lance Norskog
It uses the XmlUpdateRequestHandler internally; it does not really send XML. This is understandably confusing. Embedded Solr calls all of the Solr classes directly; it does not use HTTP or serialized data. On Tue, Aug 24, 2010 at 6:28 AM, Constantijn Visinescu wrote: > If my requests aren't seria

Re: Having problems with the java api in 1.4.0

2010-08-24 Thread Lance Norskog
We have found that 200-250mb per Lucene index is where efficiency drops off and Lucene gets slow. You will have to use a sharding approach: many small indexes, and all have different sets of documents. Solr has a tool for doing queries across many shards, called Distributed Search. http://wiki.apa

Re: Restricting HTML search?

2010-08-24 Thread Lance Norskog
I would do this with regular expressions. There is a Pattern Analyzer and a Tokenizer which do regular expression-based text chopping. (I'm not sure how to make them do what you want). A more precise tool is the RegexTransformer in the DataImportHandler. Lance On Tue, Aug 24, 2010 at 7:08 AM, And

Re: 'Error 404: missing core name in path ' in adminconsole

2010-08-24 Thread Chris Hostetter
: we use in our application to the JEE EmbeddedSolrServer. It works very : well. Now I wanted to create the admin JSPs. For that I have copied : the JSPs from webroot Solr example. When I try to access : ...admin/index.jsp , I get 'Error 404: missing core name in path' just copying JSPs isn't eno

Re: Solr creates whitespace in dismax query

2010-08-24 Thread MitchK
Johann, try to remove the wordDelimiterFilter from the query-analyzer of your fieldType. If your index-analyzer-wordDelimiterFilter is well configured, it will find everything you want. Does this solve the problem? Kind regards, - Mitch -- View this message in context: http://lucene.472066.n

Solr creates whitespace in dismax query

2010-08-24 Thread Johann Höchtl
I have a fieldtype with the following definition: I have a value "blume2000.de" in a field with the fieldtype above. If I issue a query with select?q

Re: Having problems with the java api in 1.4.0

2010-08-24 Thread Liz Sommers
We will be ingesting gigabytes of new data per day, but have a lot of legacy data (petabytes) that will also need to be indexed. We will probably index many fields per record (ave. 50/record) and hope to add facets in the near future. If this solution gives us the speed and facet capabilities we

Re: Having problems with the java api in 1.4.0

2010-08-24 Thread Glen Newton
Liz, I've built terrabyte (1-2 TB) test Lucene indexes, but have not reached to the petabyte level, so I am not sure. Certainly there is overhead in using the http and xml marshaling/de-marshaling, which may or may not be a critical factor for you. Could you give more information with respect to

Re: Is there a SubstringTransformer?

2010-08-24 Thread Gonzalo Payo Navarro
Oh my God! That's awesome! Thank you guys 2010/8/24 Ahmet Arslan : > >> I need to get the first 100 chars of a string-type field, >> but I am not >> able to find something like a SubstringTransformer, >> therefore I am >> using the RegexTransformer, but I suspect that it eats a >> lot of time >>

Re: Having problems with the java api in 1.4.0

2010-08-24 Thread Liz Sommers
We do have synonyms.txt in our config directory. The config directory is a copy of the example directory. We will probably also run into this problem with stopwords.xml. We don't understand how to make it look in the correct directory. We thought it got the correct directory out of the solrconf

Re: Having problems with the java api in 1.4.0

2010-08-24 Thread Liz Sommers
I was worried that it wouldn't scale. We are going to be indexing petabytes of data. Does the httpserver solution scale? Thanks Liz Sommers lizswo...@gmail.com On Tue, Aug 24, 2010 at 12:23 PM, Thomas Joiner wrote: > Is there any reason you aren't using http://wiki.apache.org/solr/Solrj to >

Re: Having problems with the java api in 1.4.0

2010-08-24 Thread Rafał Kuć
Hello! The exception thrown by Solr says that You do not have synonyms.txt file either in classpath or in solr core config directory. Check Your schema.xml file for a filter - SynonymFilterFactory. That filter use synonyms.txt file to read synonyms definitions. If You don`t need synonyms filter

Re: Having problems with the java api in 1.4.0

2010-08-24 Thread Thomas Joiner
Is there any reason you aren't using http://wiki.apache.org/solr/Solrj to interact with Solr? On Tue, Aug 24, 2010 at 11:12 AM, Liz Sommers wrote: > I am very new to the solr/lucene world. I am using solr 1.4.0 and cannot > move to 1.4.1. > > I have to index about 50 fields for each document, t

Having problems with the java api in 1.4.0

2010-08-24 Thread Liz Sommers
I am very new to the solr/lucene world. I am using solr 1.4.0 and cannot move to 1.4.1. I have to index about 50 fields for each document, these fields are already in key/value pairs by the time I get to my index methods. I was able to index them with lucene without any problem, but found that I

RE: ANNOUNCE: Stump Hoss @ Lucene Revolution

2010-08-24 Thread Ma, Xiaohui (NIH/NLM/LHC) [C]
Hello, I just started to investigate Solr several weeks ago. Our current project uses Verity search engine which is commercial product and the company is out of business. I am trying to evaluate if Solr can meet our requirements. I have following questions. 1. Currently we use Verity and have

Restricting HTML search?

2010-08-24 Thread Andrew Cogan
I'm quite new to SOLR and wondering if the following is possible: in addition to normal full text search, my users want to have the option to search only HTML heading innertext, i.e. content inside of , , or tags. Thank you, Andy Cogan

FW: [Spatial] Geonames and extension to Spatial Solution for Solr

2010-08-24 Thread Mattmann, Chris A (388J)
Oops, forgot to include solr-user@ in the original email. FYI below... -- Forwarded Message From: "Mattmann, Chris A (388J)" Reply-To: Date: Tue, 24 Aug 2010 07:02:58 -0700 To: Subject: [Spatial] Geonames and extension to Spatial Solution for Solr Hi Folks, You may have noticed over the p

Re: Solr jam after all my jvm thread pool hang in blocked state

2010-08-24 Thread AlexxelA
Thread dump Got like 240 thread like this : "http-8080-Processor222" daemon prio=10 tid=0x7fe36c010c00 nid=0x1e94 waiting for monitor entry [0x4caa6000..0x4caa6d20] java.lang.Thread.State: BLOCKED (on object monitor) at java.util.logging.StreamHandler.publish(Str

Re: Query speed decreased dramatically, not sure why though?

2010-08-24 Thread Constantijn Visinescu
What happens to your performance if you query for *:* instead of * ? (probably have to url encode the colon) On Tue, Aug 24, 2010 at 11:26 AM, C0re wrote: > > We have a query which takes the form of > > ".../select?q=*&sort=evalDate+desc,score+desc&start=0&rows=10" > > This query takes around 5 s

Re: Problem in setting the request writer in SolrJ (wiki page wrong?)

2010-08-24 Thread Constantijn Visinescu
If my requests aren't serialized via a request writer then why does my embedded solr crash when i comment out the following line in my solrconfig: it crashes with the exception that it can't with the /update URL. (I left in the javabin request handler). On Mon, Aug 23, 2010 at 10:40 PM, Rya

Re: 'Error 404: missing core name in path ' in adminconsole

2010-08-24 Thread Robert Naczinski
Unfortunately, when I use /admin/ I get the error too. My contextroot is not 'solr'. I use the EmbeddedSolrServer in another JEE-App Another idea? Robert 2010/8/24 Lucas F. A. Teixeira : > I hate when this happen. > > Look, if you enter the url: > > http://server:port/solr/core/admin > > yo

Re: 'Error 404: missing core name in path ' in adminconsole

2010-08-24 Thread Lucas F. A. Teixeira
I hate when this happen. Look, if you enter the url: http://server:port/solr/core/admin you'll have this error you said... try a final slash on it http://server:port/solr/core/admin/ and will work. Lucas Frare Teixeira .·. - lucas...@gmail.com - lucastex.com.br - blog.lucastex.com - twitter

'Error 404: missing core name in path ' in adminconsole

2010-08-24 Thread Robert Naczinski
Hello, we use in our application to the JEE EmbeddedSolrServer. It works very well. Now I wanted to create the admin JSPs. For that I have copied the JSPs from webroot Solr example. When I try to access ...admin/index.jsp , I get 'Error 404: missing core name in path' We run the application on We

Re: Removing expired documents from Solr index

2010-08-24 Thread Andreas Jung
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Andy wrote: > My documents have an "expiration_datetime" field that holds the expiration > datetime of the document. > > I use a filter query to exclude expired documents from my query results. > > Is it a good idea to periodically go through the in

Removing expired documents from Solr index

2010-08-24 Thread Andy
My documents have an "expiration_datetime" field that holds the expiration datetime of the document. I use a filter query to exclude expired documents from my query results. Is it a good idea to periodically go through the index and remove expired documents from it? If so what is the best way t

Re: minMergeDocs supported ?

2010-08-24 Thread Simon Willnauer
Hey, I guess this option has been removed in Lucene 2.0 - you could look as maxBufferedDocs and ramBufferSizeMB to control how many documents / heap space is used to buffer documents before they are flushed and merged into a new segment. Don't know what you are trying to do but those are the factor

Re: minMergeDocs supported ?

2010-08-24 Thread stockii
in lucene is this option for the index configuration available. In Solr too ? -- View this message in context: http://lucene.472066.n3.nabble.com/minMergeDocs-supported-tp1302856p1307821.html Sent from the Solr - User mailing list archive at Nabble.com.

Query speed decreased dramatically, not sure why though?

2010-08-24 Thread C0re
We have a query which takes the form of ".../select?q=*&sort=evalDate+desc,score+desc&start=0&rows=10" This query takes around 5 seconds to complete. I changed the query to the following; ".../select?q=[* TO NOW]&sort=evalDate+desc,score+desc&start=0&rows=10" The query now returns in around 6

Re: Why it's boosted up?

2010-08-24 Thread Ahmet Arslan
> Then why short fields are boost up? In other words longer documents are punished. Because they contain possibly many terms/words. If this mechanism does not exist, longer documents takes over and pops up usually in the first page.

Re: stream.url problem

2010-08-24 Thread satya swaroop
> > Hi all, > I got the solution for my problem. I changed my port number and i > kept the old one in the stream.url... so problem was that... > thanks all > > Now i got another problem, it is when i send any requests to remote > system for the files that have names with escape

Re: Doing Shingle but also keep special single word

2010-08-24 Thread Ahmet Arslan
> The request is from our business > team, they wish user of our product can > type in partial string of a word that exists in title or > body field. But now > I also doubt if this request is really necessary? "partial string of a word"? I think there is misunderstanding here. SingleFilter oper

Re: Proper Escaping of Ampersands

2010-08-24 Thread Nikolas Tautenhahn
Hi Chris, On 23.08.2010 21:37, Chris Hostetter wrote: > : The document is indexed correctly, a search for "at s" found it and all > : fields looked great ("at&s and not for example, at&s). > : > : As my stopword list does not contain "at" or "&" or "&", I don't > : quite understand, why my result

Re: Is there a SubstringTransformer?

2010-08-24 Thread Ahmet Arslan
> I need to get the first 100 chars of a string-type field, > but I am not > able to find something like a SubstringTransformer, > therefore I am > using the RegexTransformer, but I suspect that it eats a > lot of time > on indexation time. > > So, in short, I need something like a SubstringTrans

Re: Why it's boosted up?

2010-08-24 Thread MitchK
Hi Scott, > (so shorter fields are automatically boosted up). " > The theory behind that is the following (in easy words): Let's say you got two documents, each doc contains on 1 field (like it was in my example). Additionally we got a query that contains two words. Let's say doc1 contains o

Re: Is there a SubstringTransformer?

2010-08-24 Thread Gora Mohanty
On Tue, 24 Aug 2010 08:46:52 +0200 Gonzalo Payo Navarro wrote: > Hi everyone! > > I need to get the first 100 chars of a string-type field, but I > am not able to find something like a SubstringTransformer, > therefore I am using the RegexTransformer, but I suspect that it > eats a lot of time o