Accented chars (Portuguese)

2008-02-28 Thread Lucas Teixeira
Hello all, I'm using the solr.ISOLatin1AccentFilterFactory TokenFilter in my schema.xml inside both and tag, but I'm having some continuous problemas with accented chars in portuguese (áéíóúàèìòùãĩõũäëïöü.). And this is making my search engin handle this type of queries annormally. I th

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
Christian, Is this an issue with the encoding used when adding the documents to the index? There are two encodings that need to be gotten right, the one for the XML content POSTed to Solr, and also the HTTP header on that POST request. If you are getting mangled content back from a st

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
Christian, This bit of trivia is probably useful to you as well. Lucene's StandardTokenizer uses these Unicode ranges for CJK characters: KOREAN = [\uac00-\ud7af\u1100-\u11ff] // Chinese, Japanese CJ = [\u3040-\u318f\u3100-\u312f\u3040-\u309F\u30A0-\u30FF \u31F0-\u31FF\u3300-\u

RE: Accented chars (Portuguese)

2008-02-28 Thread Steven A Rowe
Hi Lucas, Are you using any stemming? Steve On 02/28/2008 at 6:50 AM, Lucas Teixeira wrote: > Hello all, > > I'm using the solr.ISOLatin1AccentFilterFactory TokenFilter in my > schema.xml inside both and tag, but I'm having some > continuous problemas with accented chars in portuguese > (áéíó

Re: Solr in Windows XP + JDK 5 + Tomcat 6.0.13

2008-02-28 Thread newBea
Hi Alejandro, I followed the steps given by you on this forum for solr installation. Its working well... However do you have any idea about starting solr on prerequisite port. I mean if I want solr to run on port different than 8080 and with the same steps provided by you, what would be the bet

Re: Solr in Windows XP + JDK 5 + Tomcat 6.0.13

2008-02-28 Thread Alejandro Valdez
Hi, I guess you should change the Tomcat's configuration files, I'm new to Tomcat so I haven't any clue about that :-( Maybe you can find the answer at: http://tomcat.apache.org/tomcat-6.0-doc/index.html Please, let me know if you find out how to do that. Regards, Alejandro On 2/28/08, newBe

Re: /solr/admin not found

2008-02-28 Thread Bill Au
The problem was with the default configuration of jetty which expands the webapps in /tmp and some UNIX boxes are configured to purge old files from /tmp. A simple fix is to create a $(jetty.chom)/work directory for jetty to use. See bug SOLR-118 for details: https://issues.apache.org/jira/brows

Tomcat(8080) - Solr(80) port setup confusion??

2008-02-28 Thread newBea
Hi All, I have installed solr through tomcat(5.5.23). Its up and running on port 8080. Its like, if tomcat is running, solr is running and vice versa. I need tomcat on 8080 and solr on 80 port...Is this possible? Do I need to make changes in the server.xml of tomcat/conf...Is there any way to do

RE: Tomcat(8080) - Solr(80) port setup confusion??

2008-02-28 Thread Jae Joo
As I know that tomcat is the server to support servlet and jsp and solr is just one of application of tomcat. So, theer is no meaning of port# of Solr. Thanks Jae Hi All, I have installed solr through tomcat(5.5.23). Its up and running on port 8080. Its like, if tomcat is running, solr i

Optimization taking days/weeks

2008-02-28 Thread F Knudson
Optimization time on solr index has turned into days/weeks. We are using solr 1.2. We use one box to build/optimize indexes. This index is copied to another box for searching purposes. We welcome suggestions/comments, etc. We are a bit stumped on this. Details are below. Box details Proc: 8 Dual

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Ken Krugler
Hi Christian, The documents I am trying to index with Solr contain characters from the CJK Extension B, which had been added to Unicode in version 3.1 (March 2001). Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not yet support these characters. Solr seems to accept the

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Ken Krugler
Hi Christian, The documents I am trying to index with Solr contain characters from the CJK Extension B, which had been added to Unicode in version 3.1 (March 2001). Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not yet support these characters. Solr seems to accept the

Two probably [hopefully] basic questions

2008-02-28 Thread Thomas Dowling
I have some sort of blind spot that's preventing me from seeing this in the online documentation, but how do I know if I'm using the standard or dismax request handlers, and how do I change between them? Also, is there something about boosting fields at index time that messes with the relevanc

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
To elaborate StandardTokenizer comes into play for indexing and querying (and only if you have that configured for that field in schema.xml). But the original issue seems to be with actually parsing the content properly and storing it in the Lucene index, which is separate from the tok

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread Walter Underwood
Have you timed how long it takes to copy the index files? Optimizing can never be faster than that, since it must read every byte and write a whole new set. Disc speed may be your bottleneck. You could also look at disc access rates in a monitoring tool. Is there read contention between the maste

RE: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Steven A Rowe
On 02/28/2008 at 11:26 AM, Ken Krugler wrote: > And as Erik mentioned, it appears that line 114 of > StandardTokenizerImpl.jflex: > > http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex > > needs to be updat

Re: Offsets?

2008-02-28 Thread Steve Suppe
Chris, Thanks, you're exactly right. I think the Search Component may be the right way to go. I will definitely look into this shortly - if it works out, is this something that should be 'donated' if at all possible? Steve At 05:47 PM 2/26/2008, you wrote: (moved to solr-user) analysis.

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
Wow - great stuff Steve! As for StandardTokenizer and Java version - no worries there really, as Solr itself requires Java 1.5+, so when such a tokenizer is made available it could be used just fine in Solr even if it isn't built into a core Lucene release for a while. Erik On

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread James Brady
Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is definitely the bottleneck, you're right -- iostat was showing 100% utilisation for the 5 hours it took to optimise yesterday... The master and slave are on the same disk, and it's definitely on my list to fix that, but the

Re: Optimization taking days/weeks

2008-02-28 Thread Yonik Seeley
Have you checked if this is due to running out of heap memory? When that happens, the garbage collector can start taking a lot of CPU. If you are using a Java6 JVM, it should have management enabled by default and you should be able to connect to it via jconsole and check. -Yonik On Thu, Feb 28,

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread Walter Underwood
We should probably work out a rule of thumb, like "10-20 minutes per gigabyte". I'll send a separate message to collect that info. wunder On 2/28/08 9:59 AM, "James Brady" <[EMAIL PROTECTED]> wrote: > Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is > definitely the bottlenec

How long does optimize take on your Solr installation?

2008-02-28 Thread Walter Underwood
Please answer with the size of your index (post-optimize) and how long an optimize takes. I'll collect the data and see if I can draw a line through it. 190 MB, 55 seconds $ du -sk /apps/wss/solr_home/data/index 191592 /apps/wss/solr_home/data/index $ grep commit /apps/wss/tomcat/logs/stdout.lo

Re: How long does optimize take on your Solr installation?

2008-02-28 Thread sfox
767 MB 76 seconds (single, local SATA 7200rpm disk, unloaded XServe G5) Sean Fox Walter Underwood wrote: Please answer with the size of your index (post-optimize) and how long an optimize takes. I'll collect the data and see if I can draw a line through it. 190 MB, 55 seconds $ du -sk /apps/

Re: Boost the results for filter value in a single query

2008-02-28 Thread Yonik Seeley
On Tue, Feb 26, 2008 at 11:24 PM, Vijay Khurana <[EMAIL PROTECTED]> wrote: > Thanks for the response Yonik. > The content source field is a single valued field. Sorting the results won't > work for me as the content source values are arbitrary strings and there is > no set pattern i.e it can be

Re: How long does optimize take on your Solr installation?

2008-02-28 Thread Grant Ingersoll
You might want to get info about mergeFactors, and Lucene/Solr versions in use. On Feb 28, 2008, at 1:15 PM, Walter Underwood wrote: Please answer with the size of your index (post-optimize) and how long an optimize takes. I'll collect the data and see if I can draw a line through it. 190 MB

Re: Two probably [hopefully] basic questions

2008-02-28 Thread Yonik Seeley
On Thu, Feb 28, 2008 at 11:44 AM, Thomas Dowling <[EMAIL PROTECTED]> wrote: > I have some sort of blind spot that's preventing me from seeing this in > the online documentation, but how do I know if I'm using the standard or > dismax request handlers, and how do I change between them? qt specifi

what's the schedule of the release of solr 1.3?

2008-02-28 Thread Feng Gao
Hi, There are so many new features in solr 1.3. What's the schedule of the release of solr 1.3? Thanks Feng

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Christian Wittern
Thanks to all for clearing this up. It seems we are still quite far away from full Unicode support:-( As to the questions about the encoding in previous messages, all of the other characters in the documents come through without a glitch, so there is definitely no other issue involved. Ch

Can optimization take constant time?

2008-02-28 Thread Nguyen Kien Trung
Hi, Given that: update/commit(optimize=false) is done at constant rate and index size is increasing sequentially The first optimization took X seconds. So if I do optimization everyday, can I eventually obtain constant time for optimization process? Trung

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Erik Hatcher
On Feb 28, 2008, at 6:56 PM, Christian Wittern wrote: Thanks to all for clearing this up. It seems we are still quite far away from full Unicode support:-( As to the questions about the encoding in previous messages, all of the other characters in the documents come through without a glitc

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Ken Krugler
Thanks to all for clearing this up. It seems we are still quite far away from full Unicode support:-( As to the questions about the encoding in previous messages, all of the other characters in the documents come through without a glitch, so there is definitely no other issue involved. What w

Re: Can optimization take constant time?

2008-02-28 Thread Otis Gospodnetic
If I understand your question/situation correctly, then I think the answer is negative - the time required for the full optimization operation will grow as your index grows. But I may not be 100% up to date with the latest Lucene changes around index segment merges and such. Otis -- Sematext -

Re: Optimization taking days/weeks

2008-02-28 Thread Otis Gospodnetic
That's a tiny little index there ;) Circa 100GB? What do you see if you run vmstat 2 while the optimization is happening? Non-idle CPU? A pile of IO? Is there a reason for such a small heap on a machine with 32GB of RAM? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch --

Master/Slave setup

2008-02-28 Thread Alex Benjamen
I'm trying to figure out how best to handle the replication for our system. (We're not using the rsync mechanism because we don't want to have frequent updates on slaves) Current process: 1. Master builds new incremental index once an hour. Commit/Optimize, copy over index to an nfs export

Top N terms of an indexed field

2008-02-28 Thread Alex Benjamen
I was wondering if it is possible to retrieve the top 20 terms for a given fields in an index. For example, if we're indexing user profile data and one of the fields is "interests" - it would be great to get the top 20 terms for interests found in the index. -Alex

RE: How long does optimize take on your Solr installation?

2008-02-28 Thread Alex Benjamen
It mostly depends on whether or not the index is completely new or incremental 4Gb, 28MM docs, ~30min (new index) 4Gb, 28MM docs, 30s (incremental)

Re: Top N terms of an indexed field

2008-02-28 Thread Ryan McKinley
Alex Benjamen wrote: I was wondering if it is possible to retrieve the top 20 terms for a given fields in an index. For example, if we're indexing user profile data and one of the fields is "interests" - it would be great to get the top 20 terms for interests found in the index. check ou

RE: Optimization taking days/weeks

2008-02-28 Thread Alex Benjamen
This sounds too familiar... >java settings used - java -Xmx1024M -Xms1024M Sounds like your settings are pretty low... if you're using 64bit JVM, you should be able to set these much higher, maybe give it like 8gb. Another thing, you may want to look at reducing the index size... is there a

+fieldname sintax question

2008-02-28 Thread Leonardo Santagada
On our search, we need to especify some search criteria in some search fields, the only way to do that for us has been using the default query handler (and inside each field we escape the characters that we don't want to influence the search, more or less what dismax does for the whole quer

Re: Master/Slave setup

2008-02-28 Thread Otis Gospodnetic
Alex, I think you should rethink the approach you described and reconsider using the provided replication scripts. - How often the searchers see the new index depends on how often the snappuller + snapinstaller are run on slaves. - If you want the searchers to get a new and optimized index ever

Re: Top N terms of an indexed field

2008-02-28 Thread Otis Gospodnetic
Alex, You can also use HighFrequencyTerms class (or something with a very similar name) from Lucene contrib/misc (I believe). It's a command line app that will get you exactly what you want. Good for figuring out if you should add more terms to your stopword list, for example. Otis -- Semate

Re: +fieldname sintax question

2008-02-28 Thread Otis Gospodnetic
Are you really trying to the following? -Title:(Canada) +authorOrCreator:(mcdonald) Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: Leonardo Santagada <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, February 28,

RE: Tomcat(8080) - Solr(80) port setup confusion??

2008-02-28 Thread newBea
Thanks for the reply... Ya tomcat is an application server for web applications. However as we can start solr on any port using jetty server as given below... Jetty.xml in apache-solr-1.1.0.0(Release version)\example\etc Above the default attribute can be any port unless n until it is not oc

Re: what's the schedule of the release of solr 1.3?

2008-02-28 Thread Otis Gospodnetic
Hi Feng, Somebody just asked this over on solr-dev. As far as I know, no concrete discussions about this had taken place recently, which means nothing planned for March if not longer. Do you need something that's in 1.3-dev? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch -

Re: Index availability during merge

2008-02-28 Thread Otis Gospodnetic
To be honest, I don't follow, David. "Merging" and "Slave" in the same sentence sounds suspicious. Master is where you want to do index merging, as I described in my original reply. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message > From: David Pra

Re: Question regarding Solr ranking

2008-02-28 Thread Otis Gospodnetic
It's a little hard to read that message, but if I were you I'd go to the Solr admin page, analysis section, enter your query, and see what index and query time analyzers spit out. I think that should at least give you some hints. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nut

Re: +fieldname sintax question

2008-02-28 Thread Leonardo Santagada
yes, but they should be the same no? and the problem is that the we are using something like "+fieldname: ()" to generate this query... so doing like you said is not really easy. On 29/02/2008, at 01:25, Otis Gospodnetic wrote: Are you really trying to the following? -Title:(Canada) +a

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread Otis Gospodnetic
Hola James, I think what Wunder was suggesting was really a copy (time cp -a oldIndex newIndex). I'm not sure why you have both the master and the slave on the same box :) As for the 10 minute Wiki thing - use the Wiki, please edit it, anyone can get an account and help with the Wiki. Oti

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread Otis Gospodnetic
James, Regarding your questions about n users per index - this is a fine approach. The largest Social Network that you know of uses the same approach for various things, including full-text indices (not Solr, but close). You'd have to maintain user->shard/index mapping somewhere, of course.

Re: +fieldname sintax question

2008-02-28 Thread Otis Gospodnetic
Well, you may want to massage that user-entered query string a little bit, perhaps just enough to get rid of AND, OR, and NOT. What if the user enters *foo? Leading wildcards won't work, most likely. You can either let things break or proactively try to fix thing (like the annoying MS Word au

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Christian Wittern
Ken Krugler wrote: What was the actual format of the Extension B characters in the XML being posted? I tried both a binary (UTF-8) format and numeric character representation of the type 𠀀 -- the results where the same. Christian -- Christian Wittern Institute for Research in Humanitie

Re: no support for CJK characters from Extension B in Solr

2008-02-28 Thread Christian Wittern
Erik Hatcher wrote: How are you POSTing the documents to Solr? What content-type are you using with the HTTP header? And what encoding are you using with the XML (file?) being POSTed, and is that encoding specified in the XML file itself? For these tests I used the script post.sh from the ex

Re: +fieldname sintax question

2008-02-28 Thread Leonardo Santagada
On 29/02/2008, at 01:52, Otis Gospodnetic wrote: Well, you may want to massage that user-entered query string a little bit, perhaps just enough to get rid of AND, OR, and NOT. What if the user enters *foo? Leading wildcards won't work, most likely. You can either let things break or pro

Re: Master/Slave setup

2008-02-28 Thread Walter Underwood
You have no cache at all when you stop and restart Solr. I recommend using the provided scripts for index distribution. Run snappuller and snapinstaller every two hours. The scripts already do the right thing. A snapshot is created after a commit on the indexer. Snappuller only copies over an inde

Re: How long does optimize take on your Solr installation?

2008-02-28 Thread Walter Underwood
Good point. My numbers are from a full rebuild. Let's collect maximum times, to keep it simple. --wunder On 2/28/08 7:28 PM, "Alex Benjamen" <[EMAIL PROTECTED]> wrote: > It mostly depends on whether or not the index is completely new or incremental > > 4Gb, 28MM docs, ~30min (new index) > 4Gb,

Re: How long does optimize take on your Solr installation?

2008-02-28 Thread Walter Underwood
And I could collect disk subsystem, JVM, processor, and so on, but we'd have a seven dimensional rule of thumb, which is kinda scary. wunder On 2/28/08 12:14 PM, "Grant Ingersoll" <[EMAIL PROTECTED]> wrote: > You might want to get info about mergeFactors, and Lucene/Solr > versions in use. > >

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread James Brady
Hi Otis, Thanks for your comments -- I didn't realise the wiki is open to editing; my apologies. I've put in a few words to try and clear things up a bit. So determining n will probably be a best guess followed by trial and error, that's fine. I'm still not clear about whether single Solr

Re: Strategy for handling large (and growing) index: horizontal partitioning?

2008-02-28 Thread Otis Gospodnetic
James, I can't comment more on the SN's arch choices. Here is the story about your questions - 1 Solr instance can hold 1+ indices, either via JNDI (see Wiki) or via the new multi-core support which works, but is still being hacked on - See SOLR-303 in JIRA for distributed search. Yonik committ