Hello all,
I'm using the solr.ISOLatin1AccentFilterFactory TokenFilter in
my schema.xml inside both and tag, but I'm
having some continuous problemas with accented chars in portuguese
(áéíóúàèìòùãĩõũäëïöü.). And this is making my search engin handle
this type of queries annormally.
I th
Christian,
Is this an issue with the encoding used when adding the documents to
the index? There are two encodings that need to be gotten right,
the one for the XML content POSTed to Solr, and also the HTTP header
on that POST request. If you are getting mangled content back from
a st
Christian,
This bit of trivia is probably useful to you as well. Lucene's
StandardTokenizer uses these Unicode ranges for CJK characters:
KOREAN = [\uac00-\ud7af\u1100-\u11ff]
// Chinese, Japanese
CJ = [\u3040-\u318f\u3100-\u312f\u3040-\u309F\u30A0-\u30FF
\u31F0-\u31FF\u3300-\u
Hi Lucas,
Are you using any stemming?
Steve
On 02/28/2008 at 6:50 AM, Lucas Teixeira wrote:
> Hello all,
>
> I'm using the solr.ISOLatin1AccentFilterFactory TokenFilter in my
> schema.xml inside both and tag, but I'm having some
> continuous problemas with accented chars in portuguese
> (áéíó
Hi Alejandro,
I followed the steps given by you on this forum for solr installation. Its
working well...
However do you have any idea about starting solr on prerequisite port. I
mean if I want solr to run on port different than 8080 and with the same
steps provided by you, what would be the bet
Hi, I guess you should change the Tomcat's configuration files, I'm new
to Tomcat so I haven't any clue about that :-(
Maybe you can find the answer at:
http://tomcat.apache.org/tomcat-6.0-doc/index.html
Please, let me know if you find out how to do that.
Regards,
Alejandro
On 2/28/08, newBe
The problem was with the default configuration of jetty which expands the
webapps in /tmp and some UNIX boxes are configured to purge old files from
/tmp. A simple fix is to create a $(jetty.chom)/work directory for jetty to
use. See bug SOLR-118 for details:
https://issues.apache.org/jira/brows
Hi All,
I have installed solr through tomcat(5.5.23). Its up and running on port
8080. Its like, if tomcat is running, solr is running and vice versa. I need
tomcat on 8080 and solr on 80 port...Is this possible? Do I need to make
changes in the server.xml of tomcat/conf...Is there any way to do
As I know that tomcat is the server to support servlet and jsp and solr is just
one of application of tomcat.
So, theer is no meaning of port# of Solr.
Thanks
Jae
Hi All,
I have installed solr through tomcat(5.5.23). Its up and running on port
8080. Its like, if tomcat is running, solr i
Optimization time on solr index has turned into days/weeks.
We are using solr 1.2.
We use one box to build/optimize indexes. This index is copied to another
box for searching purposes.
We welcome suggestions/comments, etc. We are a bit stumped on this.
Details are below.
Box details
Proc: 8 Dual
Hi Christian,
The documents I am trying to index with Solr contain characters from the CJK
Extension B, which had been added to Unicode in version 3.1 (March 2001).
Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
yet support these characters.
Solr seems to accept the
Hi Christian,
The documents I am trying to index with Solr contain characters from the CJK
Extension B, which had been added to Unicode in version 3.1 (March 2001).
Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
yet support these characters.
Solr seems to accept the
I have some sort of blind spot that's preventing me from seeing this in
the online documentation, but how do I know if I'm using the standard or
dismax request handlers, and how do I change between them?
Also, is there something about boosting fields at index time that messes
with the relevanc
To elaborate StandardTokenizer comes into play for indexing and
querying (and only if you have that configured for that field in
schema.xml). But the original issue seems to be with actually
parsing the content properly and storing it in the Lucene index,
which is separate from the tok
Have you timed how long it takes to copy the index files? Optimizing
can never be faster than that, since it must read every byte and write
a whole new set. Disc speed may be your bottleneck.
You could also look at disc access rates in a monitoring tool.
Is there read contention between the maste
On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
> And as Erik mentioned, it appears that line 114 of
> StandardTokenizerImpl.jflex:
>
> http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
>
> needs to be updat
Chris,
Thanks, you're exactly right. I think the Search Component may be the
right way to go. I will definitely look into this shortly - if it works
out, is this something that should be 'donated' if at all possible?
Steve
At 05:47 PM 2/26/2008, you wrote:
(moved to solr-user)
analysis.
Wow - great stuff Steve!
As for StandardTokenizer and Java version - no worries there really,
as Solr itself requires Java 1.5+, so when such a tokenizer is made
available it could be used just fine in Solr even if it isn't built
into a core Lucene release for a while.
Erik
On
Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is
definitely the bottleneck, you're right -- iostat was showing 100%
utilisation for the 5 hours it took to optimise yesterday...
The master and slave are on the same disk, and it's definitely on my
list to fix that, but the
Have you checked if this is due to running out of heap memory?
When that happens, the garbage collector can start taking a lot of CPU.
If you are using a Java6 JVM, it should have management enabled by
default and you should be able to connect to it via jconsole and
check.
-Yonik
On Thu, Feb 28,
We should probably work out a rule of thumb, like "10-20 minutes per
gigabyte". I'll send a separate message to collect that info.
wunder
On 2/28/08 9:59 AM, "James Brady" <[EMAIL PROTECTED]> wrote:
> Hi, yes a post-optimise copy takes 45 minutes at present. Disk IO is
> definitely the bottlenec
Please answer with the size of your index (post-optimize) and how long
an optimize takes. I'll collect the data and see if I can draw a line
through it.
190 MB, 55 seconds
$ du -sk /apps/wss/solr_home/data/index
191592 /apps/wss/solr_home/data/index
$ grep commit /apps/wss/tomcat/logs/stdout.lo
767 MB 76 seconds
(single, local SATA 7200rpm disk, unloaded XServe G5)
Sean Fox
Walter Underwood wrote:
Please answer with the size of your index (post-optimize) and how long
an optimize takes. I'll collect the data and see if I can draw a line
through it.
190 MB, 55 seconds
$ du -sk /apps/
On Tue, Feb 26, 2008 at 11:24 PM, Vijay Khurana
<[EMAIL PROTECTED]> wrote:
> Thanks for the response Yonik.
> The content source field is a single valued field. Sorting the results won't
> work for me as the content source values are arbitrary strings and there is
> no set pattern i.e it can be
You might want to get info about mergeFactors, and Lucene/Solr
versions in use.
On Feb 28, 2008, at 1:15 PM, Walter Underwood wrote:
Please answer with the size of your index (post-optimize) and how long
an optimize takes. I'll collect the data and see if I can draw a line
through it.
190 MB
On Thu, Feb 28, 2008 at 11:44 AM, Thomas Dowling <[EMAIL PROTECTED]> wrote:
> I have some sort of blind spot that's preventing me from seeing this in
> the online documentation, but how do I know if I'm using the standard or
> dismax request handlers, and how do I change between them?
qt specifi
Hi,
There are so many new features in solr 1.3. What's the schedule of the
release of solr 1.3?
Thanks
Feng
Thanks to all for clearing this up. It seems we are still quite far
away from full Unicode support:-(
As to the questions about the encoding in previous messages, all of the
other characters in the documents come through without a glitch, so
there is definitely no other issue involved.
Ch
Hi,
Given that: update/commit(optimize=false) is done at constant rate and
index size is increasing sequentially
The first optimization took X seconds. So if I do optimization
everyday, can I eventually obtain constant time for optimization process?
Trung
On Feb 28, 2008, at 6:56 PM, Christian Wittern wrote:
Thanks to all for clearing this up. It seems we are still quite
far away from full Unicode support:-(
As to the questions about the encoding in previous messages, all of
the other characters in the documents come through without a
glitc
Thanks to all for clearing this up. It seems we are still quite far
away from full Unicode support:-(
As to the questions about the encoding in previous messages, all of
the other characters in the documents come through without a glitch,
so there is definitely no other issue involved.
What w
If I understand your question/situation correctly, then I think the answer is
negative - the time required for the full optimization operation will grow as
your index grows. But I may not be 100% up to date with the latest Lucene
changes around index segment merges and such.
Otis
--
Sematext -
That's a tiny little index there ;) Circa 100GB?
What do you see if you run vmstat 2 while the optimization is happening?
Non-idle CPU? A pile of IO? Is there a reason for such a small heap on a
machine with 32GB of RAM?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
--
I'm trying to figure out how best to handle the replication for our system.
(We're
not using the rsync mechanism because we don't want to have frequent updates
on slaves)
Current process:
1. Master builds new incremental index once an hour. Commit/Optimize, copy over
index to an nfs export
I was wondering if it is possible to retrieve the top 20 terms for a given
fields in an index.
For example, if we're indexing user profile data and one of the fields
is "interests" - it would be great to get the top 20 terms for interests
found in the index.
-Alex
It mostly depends on whether or not the index is completely new or incremental
4Gb, 28MM docs, ~30min (new index)
4Gb, 28MM docs, 30s (incremental)
Alex Benjamen wrote:
I was wondering if it is possible to retrieve the top 20 terms for a given
fields in an index.
For example, if we're indexing user profile data and one of the fields
is "interests" - it would be great to get the top 20 terms for interests
found in the index.
check ou
This sounds too familiar...
>java settings used - java -Xmx1024M -Xms1024M
Sounds like your settings are pretty low... if you're using 64bit JVM, you
should be able to set
these much higher, maybe give it like 8gb.
Another thing, you may want to look at reducing the index size... is there a
On our search, we need to especify some search criteria in some search
fields, the only way to do that for us has been using the default
query handler (and inside each field we escape the characters that we
don't want to influence the search, more or less what dismax does for
the whole quer
Alex,
I think you should rethink the approach you described and reconsider using the
provided replication scripts.
- How often the searchers see the new index depends on how often the snappuller
+ snapinstaller are run on slaves.
- If you want the searchers to get a new and optimized index ever
Alex,
You can also use HighFrequencyTerms class (or something with a very similar
name) from Lucene contrib/misc (I believe). It's a command line app that will
get you exactly what you want. Good for figuring out if you should add more
terms to your stopword list, for example.
Otis
--
Semate
Are you really trying to the following?
-Title:(Canada) +authorOrCreator:(mcdonald)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: Leonardo Santagada <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, February 28,
Thanks for the reply...
Ya tomcat is an application server for web applications. However as we can
start solr on any port using jetty server as given below...
Jetty.xml in apache-solr-1.1.0.0(Release version)\example\etc
Above the default attribute can be any port unless n until it is not
oc
Hi Feng,
Somebody just asked this over on solr-dev. As far as I know, no concrete
discussions about this had taken place recently, which means nothing planned
for March if not longer.
Do you need something that's in 1.3-dev?
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
-
To be honest, I don't follow, David. "Merging" and "Slave" in the same
sentence sounds suspicious. Master is where you want to do index merging, as I
described in my original reply.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
- Original Message
> From: David Pra
It's a little hard to read that message, but if I were you I'd go to the Solr
admin page, analysis section, enter your query, and see what index and query
time analyzers spit out. I think that should at least give you some hints.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nut
yes, but they should be the same no?
and the problem is that the we are using something like "+fieldname:
()" to generate this query... so
doing like you said is not really easy.
On 29/02/2008, at 01:25, Otis Gospodnetic wrote:
Are you really trying to the following?
-Title:(Canada) +a
Hola James,
I think what Wunder was suggesting was really a copy (time cp -a oldIndex
newIndex).
I'm not sure why you have both the master and the slave on the same box :)
As for the 10 minute Wiki thing - use the Wiki, please edit it, anyone can get
an account and help with the Wiki.
Oti
James,
Regarding your questions about n users per index - this is a fine approach.
The largest Social Network that you know of uses the same approach for various
things, including full-text indices (not Solr, but close). You'd have to
maintain user->shard/index mapping somewhere, of course.
Well, you may want to massage that user-entered query string a little bit,
perhaps just enough to get rid of AND, OR, and NOT. What if the user enters
*foo? Leading wildcards won't work, most likely. You can either let things
break or proactively try to fix thing (like the annoying MS Word au
Ken Krugler wrote:
What was the actual format of the Extension B characters in the XML
being posted?
I tried both a binary (UTF-8) format and numeric character
representation of the type 𠀀 -- the results where the same.
Christian
--
Christian Wittern
Institute for Research in Humanitie
Erik Hatcher wrote:
How are you POSTing the documents to Solr? What content-type are you
using with the HTTP header? And what encoding are you using with the
XML (file?) being POSTed, and is that encoding specified in the XML
file itself?
For these tests I used the script post.sh from the ex
On 29/02/2008, at 01:52, Otis Gospodnetic wrote:
Well, you may want to massage that user-entered query string a
little bit, perhaps just enough to get rid of AND, OR, and NOT.
What if the user enters *foo? Leading wildcards won't work, most
likely. You can either let things break or pro
You have no cache at all when you stop and restart Solr. I recommend
using the provided scripts for index distribution. Run snappuller
and snapinstaller every two hours.
The scripts already do the right thing. A snapshot is created after
a commit on the indexer. Snappuller only copies over an inde
Good point. My numbers are from a full rebuild. Let's collect maximum
times, to keep it simple. --wunder
On 2/28/08 7:28 PM, "Alex Benjamen" <[EMAIL PROTECTED]> wrote:
> It mostly depends on whether or not the index is completely new or incremental
>
> 4Gb, 28MM docs, ~30min (new index)
> 4Gb,
And I could collect disk subsystem, JVM, processor, and so on, but
we'd have a seven dimensional rule of thumb, which is kinda scary.
wunder
On 2/28/08 12:14 PM, "Grant Ingersoll" <[EMAIL PROTECTED]> wrote:
> You might want to get info about mergeFactors, and Lucene/Solr
> versions in use.
>
>
Hi Otis,
Thanks for your comments -- I didn't realise the wiki is open to
editing; my apologies. I've put in a few words to try and clear
things up a bit.
So determining n will probably be a best guess followed by trial and
error, that's fine. I'm still not clear about whether single Solr
James,
I can't comment more on the SN's arch choices.
Here is the story about your questions
- 1 Solr instance can hold 1+ indices, either via JNDI (see Wiki) or via the
new multi-core support which works, but is still being hacked on
- See SOLR-303 in JIRA for distributed search. Yonik committ
58 matches
Mail list logo