Re: how to improve concurrent request performance and stress testing

2008-02-07 Thread Ziqi Zhang

Thank you so much! I will look into firstSearcher configuration next! thanks

--
From: "Chris Hostetter" <[EMAIL PROTECTED]>
Sent: Wednesday, February 06, 2008 8:56 PM
To: 
Subject: Re: how to improve concurrent request performance and stress 
testing



: > Also make sure that common filters, sort fields, and facets have been
: > warmed.
:
: I assume these are achieved by setting large cache size and large
: autowarmcount number in solr configuration? specifically

autowarming seeds the cahces of a new Searcher using hte keys of an old
searcher -- it does nothing to help you when you first start up SOlr and
all of hte caches are empty.

for that you need to either need to manually trigger some sample queries
externally (before your stress test) or configure something using the
firstSearcher event listener in solrconfig.xml.

If you saw all of your requests block untill the first one finished, then
i suspect your queries involve a sort (or faceting) that use the
FieldCache which is initialized in a single threaded mode (and can't be
auto-warmed, you can put some simple queries that use those sort fields in
the newSearcher listener to ensure that they get reinitialized for each
new searcher)

-Hoss




Re: Search not working for indexed words...

2008-02-07 Thread nithyavembu

Thanks Yonik and Ard.

Yes its the stemming problem and i have removed the
""solr.EnglishPorterFilterFactory"" from indexing and querying analyzers.
Now its working fine. Is any other problem will occur if i remove this?

Thanks,
nithya.



Yonik Seeley wrote:
> 
> It's stemming.  Administrator stems to administr
> Stemming isn't really possible for wildcard queries, so administrator*
> won't match.
> If you really need both wildcard queries and stemming, then use two
> different fields (via copyField).
> 
> -Yonik
> 
> On Feb 4, 2008 6:54 AM, nithyavembu <[EMAIL PROTECTED]> wrote:
>>
>> Hi All,
>>
>>   From past 6 months i am working and using SOLR. Now i am facing some
>> problem with that while searching.
>>   I have searched for some words but it doesnt return the result even its
>> existing and indexed in data folder in SOLR server(i meant solr tomcat).
>>
>>   I have given the following words :
>>"administrators",
>>"visitors",
>>
>>   The format of my search query is:
>>   Search word is : administrator*
>>
>> http://192.168.1.65:8085/solr/select/?q=administrator*&version=2.2&start=0&rows=10&indent=on
>>
>>   Its return nothing even the administrator existing in the data folder.
>>
>>   Search word is : administrator
>>
>> http://192.168.1.65:8085/solr/select/?q=administrator&version=2.2&start=0&rows=10&indent=on
>>
>>   If i search for "administrator" without giving "*", its searching and
>> returning the result.
>>
>>   Search word is : administrator/*
>>
>> http://192.168.1.65:8085/solr/select/?q=administrator%5C*&version=2.2&start=0&rows=10&indent=on
>>
>> ("/" decoded as %5C) here.
>> If i search for "administrator/*", its returning the result.
>>
>>  My query should be optimized, so that i can use it over my project. So i
>> need the query using wildcard character like "searchword+*"
>>  But now its not searching if i use "*". But if i use "/*" it can search.
>> But now i have faced the following problem.
>>
>> Search word is : admini\*
>>
>> http://192.168.1.65:8085/solr/select/?q=admini%5C*&version=2.2&start=0&rows=10&indent=on
>>
>> Not returning any result.
>>
>> Search word is : admini
>>
>> http://192.168.1.65:8085/solr/select/?q=admini&version=2.2&start=0&rows=10&indent=on
>>
>> Not returning any result.
>>
>> Search word is : admini*
>>
>> http://192.168.1.65:8085/solr/select/?q=admini*&version=2.2&start=0&rows=10&indent=on
>>
>> This returning result.
>>
>> Search word is : admin
>>
>> If i search the word "admin" or "admin*" or "admin\*", its return
>> the
>> result.
>>
>> I am using the same SolrConfig.xml and Schema.xml without any
>> change given
>> by solr during download and i didnt make any changes on that.
>>
>> Whether i have to change my query or i have to change Schema.xml
>> and
>> whether i have to add any words in stopwords.txt etc..,
>>
>> And likewise some words i am searching and i am getting the
>> result.But after some time if i search for the same word its not
>> searching.Its coming by random.
>>
>> If anyone know the solution and have any idea, please help me
>> out.
>>
>> Thanks in advance.
>>
>> with regards,
>> V.Nithya.
>> --
>> View this message in context:
>> http://www.nabble.com/Search-not-working-for-indexed-words...-tp15266626p15266626.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Search-not-working-for-indexed-words...-tp15266626p15331379.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Search not working for indexed words...

2008-02-07 Thread nithyavembu

Thanks Yonik and Ard.

Yes its the stemming problem and i have removed the
""solr.EnglishPorterFilterFactory"" from indexing and querying analyzers.
Now its working fine. Is any other problem will occur if i remove this?

Thanks,
nithya.


-- 
View this message in context: 
http://www.nabble.com/Search-not-working-for-indexed-words...-tp15266626p15331509.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: For an "XML" fieldtype

2008-02-07 Thread Frédéric Glorieux (École nationale des chartes)


Thanks Chris,


this idea has been discussed before, most notably in this thread...

http://www.nabble.com/Indexing-XML-files-to7705775.html
...as discussed there, the crux of the isue is not a special fieldtype, 
but a custom ResponseWriter that outputs the XML you want, and leaves any 
field values you want unescaped (assuming you trust them to be wellformed)  
how you decide what field values to leave unescaped could either be 
hardcoded, or driven by the FieldType of each field (in which case you 
might write an XmlField that subclasses StrField, but you wouldn't need to 
override any methods -- just see that the FieldType is XmlField and use 
that as your guide.



Sorry to haven't find this link. I discovered that I have done exactly 
the same as mirko-9


xmlWriter.writePrim("xml", name, f.stringValue(), false);

So, this a good way to implement our need, but, there's good reasons to 
not commit it to Solr core : XmlResponseWriter schema, code injection 
risks. Such prudence make us very confident in Solr.



: I would be glad that this class could be commited, so that I do not need to
: keep it up to date with future Solr release.

as long as you stick to the contracts of FieldType and/or ResponseWriter 
you don't need to worry -- these are published SolrPlugin APIs that Solr 
won't break ... we expect people to implment them, and people can expect 
their plugins to work when they upgrade Solr.




--
Frédéric Glorieux


Highlight on non-text fields and/or field-match list

2008-02-07 Thread jnagro

I've done some searching through the archives and google, as well as some
tinkering on my own with no avail. My goal is to get a list of the fields
that matched a particular query. At first, I thought highlighting was the
solution however its slowly becoming clear that it doesn't do what I need it
to. For example, if I have a field in a document such as "username" which is
a string that I'll do wild-card searches on, Solr will return document
matches but no highlight data for that field. The end-goal is to know which
fields matched, at this point I don't need the highlighted fragment itself.
Is there any way to generate a list of matching fields? Something easier
than trying to parse debug information?

Thanks!


-- 
View this message in context: 
http://www.nabble.com/Highlight-on-non-text-fields-and-or-field-match-list-tp15337656p15337656.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query with literal quote character: 6'2"

2008-02-07 Thread Yonik Seeley
On Feb 7, 2008 12:24 PM, Walter Underwood <[EMAIL PROTECTED]> wrote:
> We have a movie with this title: 6'2"
>
> I can get that string indexed, but I can't get it through the query
> parser and into DisMax. It goes through the analyzers fine. I can
> run the analysis tool in the admin interface and get a match with
> that exact string.
>
> These variants don't work:
>
> 6'2"
> 6'2\"
> 6\'2\"
>
> Any ideas? I'm still running 1.1. Been a bit busy to plan the upgrade.

I confirmed this behavior in trunk with the following query:
http://localhost:8983/solr/select?qt=dismax&q=6'2"&debugQuery=on&qf=cat&pf=cat

The result is that the double quote is dropped:
+DisjunctionMaxQuery((cat:6'2)~0.01) DisjunctionMaxQuery((cat:6'2)~0.01)

This seems like it's a bug (rather than by design), but I could be
wrong... Hoss?

-Yonik


Re: Lucene-based Distributed Index Leveraging Hadoop

2008-02-07 Thread Andrzej Bialecki

Doug Cutting wrote:

Ning,

I am also interested in starting a new project in this area.  The 
approach I have in mind is slightly different, but hopefully we can come 
to some agreement and collaborate.


I'm interested in this too.

My current thinking is that the Solr search API is the appropriate 
model.  Solr's facets are an important feature that require low-level 
support to be practical.  Thus a useful distributed search system should 
support facets from the outset, rather than attempt to graft them on 
later.  In particular, I believe this requirement mandates disjoint shards.


I agree - shards should be disjoint also because if we eventually want 
to manage multiple replicas of each shard across the cluster (for 
reliability and performance) then overlapping documents would complicate 
both the query dispatching process and the merging of partial result sets.



My primary difference with your proposal is that I would like to support 
online indexing.  Documents could be inserted and removed directly, and 
shards would synchronize changes amongst replicas, with an "eventual 
consistency" model.  Indexes would not be stored in HDFS, but directly 
on the local disk of each node.  Hadoop would perhaps not play a role. 
In many ways this would resemble CouchDB, but with explicit support for 
sharding and failover from the outset.


It's true that searching over HDFS is slow - but I'd hate to lose all 
other HDFS benefits and have to start from scratch ... I wonder what 
would be the performance of FsDirectory over an HDFS index that is 
"pinned" to a local disk, i.e. a full local replica is available, with 
block size of each index file equal to the file size.



A particular client should be able to provide a consistent read/write 
view by bonding to particular replicas of a shard.  Thus a user who 
makes a modification should be able to generally see that modification 
in results immediately, while other users, talking to different 
replicas, may not see it until synchronization is complete.


This requires that we use versioning, and that we have a "shard manager" 
that knows the latest versions of each shard among the whole active set 
- or that clients discover this dynamically by querying the shard 
servers every now and then.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: uniqueKey gives duplicate values

2008-02-07 Thread Yonik Seeley
On Feb 7, 2008 2:51 PM, vijay_schi <[EMAIL PROTECTED]> wrote:
> I want to know, what type of analyzers can be used for the data 12345_r,
> 12346_r, 12345_c, 12346_c etc , type of data.
>
> I had text type for that uniqueKey and some query , index analyzers on it. i
> think thats making duplicates.

Yes, that is the problem.  uniqueKeys must be single tokens after analysis.
Use the string type instead.

-Yonik


Re: Query with literal quote character: 6'2"

2008-02-07 Thread Walter Underwood
Huh? Queries come in through URL parameters and this is all ASCII
anyway. Even in XML, entities and UTF-8 decode to the same characters
after parsing.

The glyph formerly known as Prince belongs in the private use area,
of course.

wunder

On 2/7/08 11:06 AM, "Lance Norskog" <[EMAIL PROTECTED]> wrote:

> Some people loathe UTF-8 and do all of their text in XML entities. This
> might work better for your punctuation needs.  But it still won't help you
> with Prince :)
> 
> -Original Message-
> From: Walter Underwood [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 07, 2008 9:25 AM
> To: solr-user@lucene.apache.org
> Subject: Query with literal quote character: 6'2"
> 
> We have a movie with this title: 6'2"
> 
> I can get that string indexed, but I can't get it through the query parser
> and into DisMax. It goes through the analyzers fine. I can run the analysis
> tool in the admin interface and get a match with that exact string.
> 
> These variants don't work:
> 
> 6'2"
> 6'2\"
> 6\'2\"
> 
> Any ideas? I'm still running 1.1. Been a bit busy to plan the upgrade.
> 
> wunder




Re: uniqueKey gives duplicate values

2008-02-07 Thread Yonik Seeley
On Feb 7, 2008 2:27 PM, vijay_schi <[EMAIL PROTECTED]> wrote:
> I'm new to solr. I have a uniqueKey on string which has the data of
> 12345_r,12346_r etc etc.
> when I'm posting xml with same data second time, it allows the docs to be
> added. when i search for id:12345_r on solr client , i'm getting multiple
> records. what might be the problem ?
>
> previously I'm using in integer, it was working fine. As I changed to
> string, its allwing duplicates.
>
> can anyone Clear/Help on this?

Double check that your uniqueKey correctly specifies the name of the
field that contains 12345_r,
and that you restarted Solr after the schema change.

The example schema that comes with Solr uses a string uniqueKey, and
it works fine.

-Yonik


uniqueKey gives duplicate values

2008-02-07 Thread vijay_schi

Hi,

I'm new to solr. I have a uniqueKey on string which has the data of
12345_r,12346_r etc etc.
when I'm posting xml with same data second time, it allows the docs to be
added. when i search for id:12345_r on solr client , i'm getting multiple
records. what might be the problem ? 

previously I'm using in integer, it was working fine. As I changed to
string, its allwing duplicates. 

can anyone Clear/Help on this? 
-- 
View this message in context: 
http://www.nabble.com/uniqueKey-gives-duplicate-values-tp15341288p15341288.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: how to improve concurrent request performance and stress testing

2008-02-07 Thread Chris Hostetter

: Thank you so much! I will look into firstSearcher configuration next! thanks

FYI: prompted by this thread, I added some blurbs about firstSearcher, 
newSearcher, and FieldCache to the SolrCaching wiki ... as a new users 
learning about this stuff, please fele free to update that wiki with any 
improvements you can think of...

http://wiki.apache.org/solr/SolrCaching

(the best people to write documentation are new users who (unlike the 
developers) don't have intrinsic knowledge of the subject in the back of 
their mind that they take for granted).


-Hoss



RE: Indexing Japanese & English

2008-02-07 Thread Lance Norskog
Here are the comments for CJKTokenizer.  First, is this what you want?
Remember, there are three Japanese writing systems.

/**
 * CJKTokenizer was modified from StopTokenizer which does a decent job for
 * most European languages. It performs other token methods for double-byte
 * Characters: the token will return at each two charactors with overlap
match.
 * Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4"
it
 * also need filter filter zero length token ""
 * for Digit: digit, '+', '#' will token as letter
 * for more info on Asia language(Chinese Japanese Korean) text
segmentation:
 * please search  http://www.google.com/search?q=word+chinese+segment";>google
 *
 * @author Che, Dong
 */

-Original Message-
From: Paul Clegg [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 10:36 AM
To: solr-user@lucene.apache.org
Subject: Indexing Japanese & English

I hate asking stupid questions immediately after joining a mailing list, but
I'm in a bit of a pinch here.

 

I'm using Solr/Tomcat for a Ruby on Rails project (acts_as_solr) and I've
had a lot of success getting it working -- for English.  The problem I'm
running into is that our primary customers are actually Japanese.

 

I've done the searching around, and found the thread back in June about
using Lucene's CJKAnalyzer and CJKTokenizer, but apparently I need to write
my own factor or something.  It looks like it's only three lines of Java
code, and I can cut & paste with the best of them.

 

Here's the problem:  I know zip, zilch, zero about Java.  I just hate the
language with an absolute passion.  The reason I went with Solr (besides the
fact it's pretty much the only real game going) is that I could avoid the
Java parts by directly dealing with its XML, JSON and Ruby interfaces.

 

So I'm wondering if there are any "Adding CJKTokenizer to Solr for Dummies"
guides out there someone can point me to, to tell me, pretty much
step-by-step, what I need to do to get this configured.  I saw something
about unpacking the solr.war and repacking it, but, since I know dinkus
about Java, that really didn't mean a whole lot to me, even though I'm
guessing it's probably a grand total of four commands at the unix prompt.
:)

 

.Paul

 

Paul Clegg, Principal Software Engineer

My Digital Life, Inc. (www.mydl.com)

NetService Ventures Group (www.nsv.com)

2108 Sand Hill Road, Menlo Park, CA 94025

Email:  [EMAIL PROTECTED]

Cell: 650-619-1220

 




Re: Query with literal quote character: 6'2"

2008-02-07 Thread Chris Hostetter

: I confirmed this behavior in trunk with the following query:
: http://localhost:8983/solr/select?qt=dismax&q=6'2"&debugQuery=on&qf=cat&pf=cat
: 
: The result is that the double quote is dropped:
: +DisjunctionMaxQuery((cat:6'2)~0.01) DisjunctionMaxQuery((cat:6'2)~0.01)
: 
: This seems like it's a bug (rather than by design), but I could be
: wrong... Hoss?

It was by design ... but it could be handled better.  the idea is that if 
the input has balanced quotes (ie: an even number) then leave them alone 
so they are dealt with as phrase delimiters.  If there is an uneven number 
strip them out since we don't know wether they are a mistake (ie: unclosed 
phrase) or intended to be literal.

auto-escaping them probably would have been a better way to go (ie: let 
the analyzer decide wether or not to strip them) ... i'm not sure why i 
didn't do that in the first place (I think at the time the lucene 
QueryParser didn't deal with escaped quotes very well)

the thing to keep in mind, is that even if it did escape them, this still 
wouldn't work if the user input were...

 the 6'2" man dating the 5'3" woman

...because it would assume the even number of double-quote characters mean 
that   " man dating the 5'3"  is a phrase.  i remember spending a day 
going over query loks trying tp figure out a good set of hueristic rules 
for guessing when quote characters in user input should be interpreted as 
phrase delims vs "inch" markers before a coworker smacked me and made me 
realize it was a fairly intractable problem and simple rules would be 
easier to understand anyway.

FYI: this is all happening in 
SolrPluginUtils.stripUnbalancedQuotes(CharSequence) which 
DisMax(RequestHanler) calls before passing the string to 
DisjunctionMaxQueryParser.



-Hoss



RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
This query works just fine: http://www.tokenizer.org/?q=Romeo+%2B+Juliet

%2B is URL-Encoded presentation of +
It shows, for instance, [Romeo & Juliet] in output.


> -Original Message-
> From: Walter Underwood [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, February 07, 2008 3:25 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query with literal quote character: 6'2"
> 
> 
> Our users can blow up the parser without special characters.
> 
>   AND THE BAND PLAYED ON
>   TO HAVE AND HAVE NOT
> 
> Lower-casing in the front end avoids that.
> 
> We have auto-complete on titles, so the there are plenty
> of chances to inadvertently use special characters:
> 
>   Romeo + Juliet
>   Airplane! 
>   Shrek (Widescreen)
> 
> We also have people type "--" for a dash in titles.
> 
> wunder
> 
> On 2/7/08 12:00 PM, "Chris Hostetter" 
> <[EMAIL PROTECTED]> wrote:
> 
> > 
> > : How about the query parser respecting backslash escaping? I need
> > 
> > one of the orriginal design decisions was "no user 
> escaping" ... be able
> > to take in raw query strings from the user with only '+' '-' and '"'
> > treated as special characters ... if you allow backslash 
> escaping of those
> > characters, then by definition '\' becomes a special character too.
> > 
> > : free-text input, no syntax at all. Right now, I'm escaping every
> > : Lucene special character in the front end. I just figured out that
> > : it breaks for colon, can't search for "12:01" with "12\:01".
> > 
> > yeah ... your '\' character is being taken litterally.  you 
> shouldn't do
> > any escaping if you hand off to dismax.
> > 
> > the right thing to do is probably to expose more the "query 
> parsing" stuff
> > as options for hte handler ... let people configure it with what
> > characters should be escaped, and what should be left 
> alone.  We should
> > also stop using the static utility methods for things like partial
> > escaping and unbalanced quote striping and start using 
> helper methods
> > that subclasses can override.
> > 
> > 
> > -Hoss
> > 
> 
> 
> 



Re: Query with literal quote character: 6'2"

2008-02-07 Thread Walter Underwood
Our users can blow up the parser without special characters.

  AND THE BAND PLAYED ON
  TO HAVE AND HAVE NOT

Lower-casing in the front end avoids that.

We have auto-complete on titles, so the there are plenty
of chances to inadvertently use special characters:

  Romeo + Juliet
  Airplane! 
  Shrek (Widescreen)

We also have people type "--" for a dash in titles.

wunder

On 2/7/08 12:00 PM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : How about the query parser respecting backslash escaping? I need
> 
> one of the orriginal design decisions was "no user escaping" ... be able
> to take in raw query strings from the user with only '+' '-' and '"'
> treated as special characters ... if you allow backslash escaping of those
> characters, then by definition '\' becomes a special character too.
> 
> : free-text input, no syntax at all. Right now, I'm escaping every
> : Lucene special character in the front end. I just figured out that
> : it breaks for colon, can't search for "12:01" with "12\:01".
> 
> yeah ... your '\' character is being taken litterally.  you shouldn't do
> any escaping if you hand off to dismax.
> 
> the right thing to do is probably to expose more the "query parsing" stuff
> as options for hte handler ... let people configure it with what
> characters should be escaped, and what should be left alone.  We should
> also stop using the static utility methods for things like partial
> escaping and unbalanced quote striping and start using helper methods
> that subclasses can override.
> 
> 
> -Hoss
> 



RE: Indexing Japanese & English

2008-02-07 Thread Paul Clegg
Yes, I've seen this bit.  Near as I can tell, it's what I want, so that our
Japanese users can search on a double-byte character and get back results
(since they don't use spaces to delineate words, it's impossible in the
default solr configuration to find a single double-byte character somewhere
"in the middle" of a sentence).

What I need are the directions for how I compile the four-line factory java
code, where in the tree I put it, and how I get solr to recognize it.  I
don't know the java commands to for doing any of that.  That's where I need
the help.  How do I compile the factory code fragment?  How do I get it into
the solr.war file?

I found this in the archives, but don't know what to do with it, primarily
because I don't know how the java and tomcat stuff works:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/200706.mbox/%3c200
[EMAIL PROTECTED]

...Paul

Paul Clegg, Principal Software Engineer
My Digital Life, Inc. (www.mydl.com)
NetService Ventures Group (www.nsv.com)
2108 Sand Hill Road, Menlo Park, CA 94025
Email:  [EMAIL PROTECTED]
Cell: 650-619-1220

-Original Message-
From: Lance Norskog [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 11:05 AM
To: solr-user@lucene.apache.org
Subject: RE: Indexing Japanese & English

Here are the comments for CJKTokenizer.  First, is this what you want?
Remember, there are three Japanese writing systems.

/**
 * CJKTokenizer was modified from StopTokenizer which does a decent job for
 * most European languages. It performs other token methods for double-byte
 * Characters: the token will return at each two charactors with overlap
match.
 * Example: "java C1C2C3C4" will be segment to: "java" "C1C2" "C2C3" "C3C4"
it
 * also need filter filter zero length token ""
 * for Digit: digit, '+', '#' will token as letter
 * for more info on Asia language(Chinese Japanese Korean) text
segmentation:
 * please search  http://www.google.com/search?q=word+chinese+segment";>google
 *
 * @author Che, Dong
 */

-Original Message-
From: Paul Clegg [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 10:36 AM
To: solr-user@lucene.apache.org
Subject: Indexing Japanese & English

I hate asking stupid questions immediately after joining a mailing list, but
I'm in a bit of a pinch here.

 

I'm using Solr/Tomcat for a Ruby on Rails project (acts_as_solr) and I've
had a lot of success getting it working -- for English.  The problem I'm
running into is that our primary customers are actually Japanese.

 

I've done the searching around, and found the thread back in June about
using Lucene's CJKAnalyzer and CJKTokenizer, but apparently I need to write
my own factor or something.  It looks like it's only three lines of Java
code, and I can cut & paste with the best of them.

 

Here's the problem:  I know zip, zilch, zero about Java.  I just hate the
language with an absolute passion.  The reason I went with Solr (besides the
fact it's pretty much the only real game going) is that I could avoid the
Java parts by directly dealing with its XML, JSON and Ruby interfaces.

 

So I'm wondering if there are any "Adding CJKTokenizer to Solr for Dummies"
guides out there someone can point me to, to tell me, pretty much
step-by-step, what I need to do to get this configured.  I saw something
about unpacking the solr.war and repacking it, but, since I know dinkus
about Java, that really didn't mean a whole lot to me, even though I'm
guessing it's probably a grand total of four commands at the unix prompt.
:)

 

.Paul

 

Paul Clegg, Principal Software Engineer

My Digital Life, Inc. (www.mydl.com)

NetService Ventures Group (www.nsv.com)

2108 Sand Hill Road, Menlo Park, CA 94025

Email:  [EMAIL PROTECTED]

Cell: 650-619-1220

 




RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
I have same kind of queries correctly working on my site.

It's probably because I am using URL Escaping:
http://www.tokenizer.org/?q=6%272%22



> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf 
> Of Yonik Seeley
> Sent: Thursday, February 07, 2008 12:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Query with literal quote character: 6'2"
> 
> 
> On Feb 7, 2008 12:24 PM, Walter Underwood 
> <[EMAIL PROTECTED]> wrote:
> > We have a movie with this title: 6'2"
> >
> > I can get that string indexed, but I can't get it through the query
> > parser and into DisMax. It goes through the analyzers fine. I can
> > run the analysis tool in the admin interface and get a match with
> > that exact string.
> >
> > These variants don't work:
> >
> > 6'2"
> > 6'2\"
> > 6\'2\"
> >
> > Any ideas? I'm still running 1.1. Been a bit busy to plan 
> the upgrade.
> 
> I confirmed this behavior in trunk with the following query:
> http://localhost:8983/solr/select?qt=dismax&q=6'2"&debugQuery=
on&qf=cat&pf=cat

The result is that the double quote is dropped:
+DisjunctionMaxQuery((cat:6'2)~0.01) DisjunctionMaxQuery((cat:6'2)~0.01)

This seems like it's a bug (rather than by design), but I could be
wrong... Hoss?

-Yonik




Re: Query with literal quote character: 6'2"

2008-02-07 Thread Chris Hostetter

: How about the query parser respecting backslash escaping? I need

one of the orriginal design decisions was "no user escaping" ... be able 
to take in raw query strings from the user with only '+' '-' and '"' 
treated as special characters ... if you allow backslash escaping of those 
characters, then by definition '\' becomes a special character too.

: free-text input, no syntax at all. Right now, I'm escaping every
: Lucene special character in the front end. I just figured out that
: it breaks for colon, can't search for "12:01" with "12\:01".

yeah ... your '\' character is being taken litterally.  you shouldn't do 
any escaping if you hand off to dismax.

the right thing to do is probably to expose more the "query parsing" stuff 
as options for hte handler ... let people configure it with what 
characters should be escaped, and what should be left alone.  We should 
also stop using the static utility methods for things like partial 
escaping and unbalanced quote striping and start using helper methods 
that subclasses can override.


-Hoss



Re: Query with literal quote character: 6'2"

2008-02-07 Thread Walter Underwood
How about the query parser respecting backslash escaping? I need
free-text input, no syntax at all. Right now, I'm escaping every
Lucene special character in the front end. I just figured out that
it breaks for colon, can't search for "12:01" with "12\:01".

wunder

On 2/7/08 11:06 AM, "Chris Hostetter" <[EMAIL PROTECTED]> wrote:

> 
> : I confirmed this behavior in trunk with the following query:
> : 
> http://localhost:8983/solr/select?qt=dismax&q=6'2"&debugQuery=on&qf=cat&pf=cat
> : 
> : The result is that the double quote is dropped:
> : +DisjunctionMaxQuery((cat:6'2)~0.01) DisjunctionMaxQuery((cat:6'2)~0.01)
> : 
> : This seems like it's a bug (rather than by design), but I could be
> : wrong... Hoss?
> 
> It was by design ... but it could be handled better.  the idea is that if
> the input has balanced quotes (ie: an even number) then leave them alone
> so they are dealt with as phrase delimiters.  If there is an uneven number
> strip them out since we don't know wether they are a mistake (ie: unclosed
> phrase) or intended to be literal.
> 
> auto-escaping them probably would have been a better way to go (ie: let
> the analyzer decide wether or not to strip them) ... i'm not sure why i
> didn't do that in the first place (I think at the time the lucene
> QueryParser didn't deal with escaped quotes very well)
> 
> the thing to keep in mind, is that even if it did escape them, this still
> wouldn't work if the user input were...
> 
>  the 6'2" man dating the 5'3" woman
> 
> ...because it would assume the even number of double-quote characters mean
> that   " man dating the 5'3"  is a phrase.  i remember spending a day
> going over query loks trying tp figure out a good set of hueristic rules
> for guessing when quote characters in user input should be interpreted as
> phrase delims vs "inch" markers before a coworker smacked me and made me
> realize it was a fairly intractable problem and simple rules would be
> easier to understand anyway.
> 
> FYI: this is all happening in
> SolrPluginUtils.stripUnbalancedQuotes(CharSequence) which
> DisMax(RequestHanler) calls before passing the string to
> DisjunctionMaxQueryParser.
> 
> 
> 
> -Hoss
> 



Re: solrcofig.xml - need some info

2008-02-07 Thread Chris Hostetter
: I am pretty new to solr. I was wondering what is this "mm" attribute in
: requestHandler in solrconfig.xml and how it works. Tried to search wiki
: could not find it

Hmmm... yeah wiki search does mid-word matching doesn't it?

the key thng to realize is that the requestHandler you were looking at 
when you saw that option was the DisMaxRequestHandler...

http://wiki.apache.org/solr/DisMaxRequestHandler



-Hoss



RE: Query with literal quote character: 6'2"

2008-02-07 Thread Lance Norskog
Some people loathe UTF-8 and do all of their text in XML entities. This
might work better for your punctuation needs.  But it still won't help you
with Prince :)

-Original Message-
From: Walter Underwood [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 9:25 AM
To: solr-user@lucene.apache.org
Subject: Query with literal quote character: 6'2"

We have a movie with this title: 6'2"

I can get that string indexed, but I can't get it through the query parser
and into DisMax. It goes through the analyzers fine. I can run the analysis
tool in the admin interface and get a match with that exact string.

These variants don't work:

6'2"
6'2\"
6\'2\"

Any ideas? I'm still running 1.1. Been a bit busy to plan the upgrade.

wunder




Search result not coming for normal special characters...

2008-02-07 Thread nithyavembu

Hi All,

 Now i am facing problem in special character search.
 I tried with the following special characters
(!,@,#,$,%,^,&,*,(,),{,},[,]).
 My indexing data is :
!national!
@national@
#national#
$national$
%national%
^national^
&national&

 My search data is :
 @national@

 But when i search for "@national@", it returns the following result:

!national!
@national@
#national#
$national$
%national%
^national^
&national&

 But the actual result should be "@national@".
 So it match only "national" and returning the result. It didnt match "@".
 In solr UI i tried by giving only "@" and there is no search result.But the
index data contains words with   
special character.
 Whether i have to change any congifuration in schema.xml?
 I am using the same solrconfig.xml and schema.xml given by solr during
download.
 If anyone know the solution and idea, please help me.  
 
with Regards,
nithya.
-- 
View this message in context: 
http://www.nabble.com/Search-result-not-coming-for-normal-special-characters...-tp15339827p15339827.html
Sent from the Solr - User mailing list archive at Nabble.com.



Indexing Japanese & English

2008-02-07 Thread Paul Clegg
I hate asking stupid questions immediately after joining a mailing list, but
I'm in a bit of a pinch here.

 

I'm using Solr/Tomcat for a Ruby on Rails project (acts_as_solr) and I've
had a lot of success getting it working -- for English.  The problem I'm
running into is that our primary customers are actually Japanese.

 

I've done the searching around, and found the thread back in June about
using Lucene's CJKAnalyzer and CJKTokenizer, but apparently I need to write
my own factor or something.  It looks like it's only three lines of Java
code, and I can cut & paste with the best of them.

 

Here's the problem:  I know zip, zilch, zero about Java.  I just hate the
language with an absolute passion.  The reason I went with Solr (besides the
fact it's pretty much the only real game going) is that I could avoid the
Java parts by directly dealing with its XML, JSON and Ruby interfaces.

 

So I'm wondering if there are any "Adding CJKTokenizer to Solr for Dummies"
guides out there someone can point me to, to tell me, pretty much
step-by-step, what I need to do to get this configured.  I saw something
about unpacking the solr.war and repacking it, but, since I know dinkus
about Java, that really didn't mean a whole lot to me, even though I'm
guessing it's probably a grand total of four commands at the unix prompt.
:)

 

.Paul

 

Paul Clegg, Principal Software Engineer

My Digital Life, Inc. (www.mydl.com)

NetService Ventures Group (www.nsv.com)

2108 Sand Hill Road, Menlo Park, CA 94025

Email:  [EMAIL PROTECTED]

Cell: 650-619-1220

 



Query with literal quote character: 6'2"

2008-02-07 Thread Walter Underwood
We have a movie with this title: 6'2"

I can get that string indexed, but I can't get it through the query
parser and into DisMax. It goes through the analyzers fine. I can
run the analysis tool in the admin interface and get a match with
that exact string.

These variants don't work:

6'2"
6'2\"
6\'2\"

Any ideas? I'm still running 1.1. Been a bit busy to plan the upgrade.

wunder



RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
This is what appears in Address Bar of IE:
http://localhost:8080/apache-solr-1.2.0/select/?q=item_name%3A%22Romeo%2BJul
iet%22%2Bcategory%3A%22books%22&version=2.2&start=0&rows=10&indent=on

Input was:
item_name:"Romeo+Juliet"+category:"books"

Another input which works just fine: item_name:"6'\"" (user input was just
6'2")

It is not a bug/problem of SOLR. SOLR can't be exposed directly to end
users. For handling user input and generating SOLR-specific query, use
something... So that I don't really understand why do we need HTTP caching
support at SOLR if we can't use it without "front-end" (off-topic; I use
HTTP caching at front-end, and don't use SOLR's HTTP-cashing at all,
especially because it can't reply on request with If-Modified-Sinse header).


> -Original Message-
> From: Fuad Efendi 

> I encapsulate user input into:
> Item_name:"Romeo+Juliet" AND category:"books"
> Item_name:"Romeo+Juliet"+category:"books"
> 



What should be the best config for a multilingual site

2008-02-07 Thread Leonardo Santagada
I'm working for a french/english site and I want to know what filters  
would be nice and are recomended. Should I use 2 steamers or is there  
a way to mark one of them bilingual?  I am using the latin-1 filter  
also, any more ideas?


[]'s
--
Leonardo Santagada





Re: strange updating inconsistency

2008-02-07 Thread Chris Hostetter
: odd behavior while updating. The use case is that a document gets indexed
: with a status, in this case it's -1 for documents that aren't ready to be
: searched yet and 1 otherwise. Initial indexing works perfectly, and getting
: a result set of documents with the status of -1 works as well. 
...
: Where things get messy is the update post after a document has been deemed
: ready for searching. The unique ID in my index is a long called clipId.
: Below are two documents that I've posted multiple times to my index. The
: first one updates the index, the second does not. Is there something I'm
: missing about?

I'm confused: the sentences above suggest that you aren't seeing updates 
when you try to index a document with the same uniqueKey value as a 
previous documnet (because some fields have changed) ... yet in the two 
example docs you listed the "clipId" value is differnet (they seem to be 
completley differnet docs)

what part of your question am i not understanding?

general suggestions: have you double checed that no interesting error 
messages are getting logged after hte updates that don't seem to work?

what does hte response (HTTP status code and body) look like for the 
updates?


-Hoss



Socket exception

2008-02-07 Thread Sundar Sankaranarayanan
Hi All,
   I am using Solr for about a couple of months now and am very
satisfied with it. My solr on dev environment runs on a windows box with
1 gig memory and the solr.war is deployed on a jboss 4.05 version. When
investigating on a "Solr commit not working sometimes issue " in our
application, I found out that the server was sometimes throwing a
"socket exception : connection refused" and when ever this was happening
the commit/optimize did not function properly. I am not sure as to why
this is happening as when the box was used to deploy the application, we
never got the issue, but when it solely is not being used as a solr
server, we are getting this. Any ideas / suggestions to solve this is
appreciated.


Thanks and Regards
Sundar

P.S : The stack trace for the same :::


2008-02-06 17:10:08,101 [STDERR:152] ERROR  - Feb 6, 2008 5:10:08 PM
org.apache.solr.core.SolrException log
SEVERE: java.net.SocketException: Connection reset
 at java.net.SocketInputStream.read(SocketInputStream.java:168)
 at
org.apache.coyote.http11.InternalInputBuffer.fill(InternalInputBuffer.ja
va:747)
 at
org.apache.coyote.http11.InternalInputBuffer$InputStreamInputBuffer.doRe
ad(InternalInputBuffer.java:777)
 at
org.apache.coyote.http11.filters.IdentityInputFilter.doRead(IdentityInpu
tFilter.java:115)
 at
org.apache.coyote.http11.InternalInputBuffer.doRead(InternalInputBuffer.
java:712)
 at org.apache.coyote.Request.doRead(Request.java:418)
 at
org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java
:284)
 at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:404)
 at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:299)
 at
org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.j
ava:192)
 at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java:411)
 at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java:453)
 at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)
 at java.io.InputStreamReader.read(InputStreamReader.java:167)
 at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2972)
 at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
 at org.xmlpull.mxp1.MXParser.parseAttribute(MXParser.java:2026)
 at org.xmlpull.mxp1.MXParser.parseStartTag(MXParser.java:1799)
 at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1259)
 at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
 at org.xmlpull.mxp1.MXParser.nextTag(MXParser.java:1078)
 at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequest
Handler.java:298)
 at
org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestH
andler.java:162)
 at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpd
ateRequestHandler.java:84)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:77)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:191)
 at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:159)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:202)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:173)
 at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilte
r.java:96)
 at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:202)
 at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:173)
 at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:213)
 at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:178)
 at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAs
sociationValve.java:175)
 at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.j
ava:74)
 at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:126)
 at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:105)
 at
org.jboss.web.tomcat.tc5.jca.CachedConnectionValve.invoke(CachedConnecti
onValve.java:156)
 at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:107)
 at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:1
48)
 at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:86
9)
 at
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.proc
essConnection(Http11BaseProtocol.java:664)
 at
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint
.java:527)
 at
org.apache.tomcat.util.net.MasterSlaveWorkerThread.run(MasterSlaveWorker
Thread.java:112)
 at java.lang.Thread.run(Thread.java:595) 


RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
> while i agree that you don't wnat to expose your end users 
> directly to 
> Solr (largely for security reasons) that doesn't mean you *must* 
> preprocess user entered strings before handing them to dismax 
> ... dismax's 
> whole goal is to make it posisble for apps to not have to worry about 
> sanitizing user inputed query strings.

I am using org.apache.solr.client.solrj.SolrQuery to preprocess user entered
strings.
And I am using dismax & facets:

INFO: /select
facet.limit=100&wt=xml&rows=100&start=0&facet=true&facet.mincount=1&q=Romeo%
2BJuliet&fl=id,item_name,category,price,url,host,country&qt=dismax&version=2
.2&facet.field=country&facet.field=host&facet.field=category&fq=category:"ar
mani"&hl=true 0 1943

(catalina.out file of SOLR,
http://www.tokenizer.org/armani/price.htm?q=Romeo%2bJuliet from production)

As you can see, + sign is properly encoded in URL: %2B
Unfortunately, DISMAX queries via CONSOLE do not support that. Fortunately,
SOLRJ does.

(Sorry for mistake in previous Email: it was direct SOLR request via admin
console with "standard" handler.)

===
About https://issues.apache.org/jira/browse/SOLR-127

- We do not need this!!!

Simply add request parameter http.header="If-Modified-Since: Tue, 05 Feb
2008 03:50:00 GMT", and let SOLR to respond via standard XML message "Not
Modified", and avoid using 400/500/304!!!

Let others manage "Reverse-Proxy" via PHP, HTTPD, Tomcat+Spring, etc.; SOLR
can use exact "last-modified" timestamp from the index.


I am going to comment SOLR-127...





> 



Memory improvements

2008-02-07 Thread Sundar Sankaranarayanan
Hi All,
  I am running an application in which I am having to index
about 300,000 records of a table which has 6 columns. I am committing to
the solr server after every 10,000 rows and I observed that the by the
end of about 150,000 the process eats up about 1 Gig of memory, and
since my server has only 1 Gig it throws me an Out of Memory error. How
ever if I commit after every 1000 rows, it is able to process about
200,000 rows before throwing out of memory. This is just dev server and
the production data would be much more bigger. It will be great if
someone suggests a way to improve this scenario.
 
 
Regards
Sundar Sankarnarayanan


solrcofig.xml - need some info

2008-02-07 Thread Ismail Siddiqui
I am pretty new to solr. I was wondering what is this "mm" attribute in
requestHandler in solrconfig.xml and how it works. Tried to search wiki
could not find it




2<-1 5<-2 6<90%



thanks


Ismail


Re: Query with literal quote character: 6'2"

2008-02-07 Thread Chris Hostetter
: Our users can blow up the parser without special characters.
: 
:   AND THE BAND PLAYED ON
:   TO HAVE AND HAVE NOT

Grrr... yeah, i'd forgotten about that problem.  I was hopping LUCENE-682 
could solve that (by "unregistering" AND/OR/NOT as operators) but that 
issue fairly dead in the water since the performance differnece wsa fairly 
significant.

DisMaxQueryParser should really just have it's own grammer instead of the 
hacks i put in to subclass QueryParser.

: We have auto-complete on titles, so the there are plenty
: of chances to inadvertently use special characters:
: 
:   Romeo + Juliet
:   Airplane! 
:   Shrek (Widescreen)
: 
: We also have people type "--" for a dash in titles.

Only the '+' and '-' characters are special in those examples ... the 
others will be treated as literal characters by dismax.  but like i said 
we could patch dismax to have an option containing the list of characters to 
auto-escape that would default to the current hardcoded list ... for your 
use case you could have all the special characters (including '+' and '-') 
.. still wouldn't solve your quote problem though -- that's where 
we'd need hooks for subclasses to override the quote striping. 


-Hoss



RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
Try this query with asterisk *

http://192.168.1.5:18080/apache-solr-1.2.0/select/?q=*&version=2.2&start=0&r
ows=10&indent=on


Response:
HTTP Status 400 - Query parsing error: Cannot parse '*': '*' or '?' not
allowed as first character in WildcardQuery




type Status report

message Query parsing error: Cannot parse '*': '*' or '?' not allowed as
first character in WildcardQuery

description The request sent by the client was syntactically incorrect
(Query parsing error: Cannot parse '*': '*' or '?' not allowed as first
character in WildcardQuery).





Apache Tomcat/6.0.13



I tried to discuss it last year... It shouldn't be HTTP 400. I do not have
such problems probably because I encapsulate * (using front-end layer) into
Item_name:"*"

I encapsulate user input into:
Item_name:"Romeo+Juliet" AND category:"books"
Item_name:"Romeo+Juliet"+category:"books"

Of course I use URL encoding before calling SOLR...


-Fuad



RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
> (catalina.out file of SOLR,
> http://www.tokenizer.org/armani/price.htm?q=Romeo%2bJuliet 
> from production)
> ...
> ... DISMAX queries via CONSOLE do not support 
> that...

Opsss... Again mistake, sorry.
http://192.168.1.5:18080/apache-solr-1.2.0/select?indent=on&version=2.2&q=Ro
meo%2BJuliet&start=0&rows=10&fl=*%2Cscore&qt=dismax&wt=standard&explainOther
=&hl.fl=


Anyway I can't understand where is the problem?!! Everything works fine with
dismax/standard/escaping/encoding. Can we use AND operator with dismax by
the way? I think: no. And 6'2" works just as prescribed:
http://www.tokenizer.org/shimano/price.htm?q=6'2%22



RE: Query with literal quote character: 6'2"

2008-02-07 Thread Chris Hostetter


: http://192.168.1.5:18080/apache-solr-1.2.0/select/?q=*&version=2.2&start=0&r
: ows=10&indent=on

That's using standard request handler right? ... that's a much differnet 
discussion -- when using standard you must of course be aware of hte 
syntax and the special characters ... Walter and i have specificly been 
talking about dismax which attempts to protect you from these things.

: description The request sent by the client was syntactically incorrect
: (Query parsing error: Cannot parse '*': '*' or '?' not allowed as first
: character in WildcardQuery).

: I tried to discuss it last year... It shouldn't be HTTP 400. I do not have

why not?  the "client" sent a bad request (with a query string that is 
"malformed syntax" according to the contract of standard request handler



-Hoss



RE: Query with literal quote character: 6'2"

2008-02-07 Thread Chris Hostetter

: It is not a bug/problem of SOLR. SOLR can't be exposed directly to end
: users. For handling user input and generating SOLR-specific query, use

while i agree that you don't wnat to expose your end users directly to 
Solr (largely for security reasons) that doesn't mean you *must* 
preprocess user entered strings before handing them to dismax ... dismax's 
whole goal is to make it posisble for apps to not have to worry about 
sanitizing user inputed query strings.

: something... So that I don't really understand why do we need HTTP caching
: support at SOLR if we can't use it without "front-end" (off-topic; I use

yeah ... this is *WAY* off topic ... but the sort answer is: the issues 
are orthoginal.  wether or not you let "humans using web browsers" talk to 
Solr directly or not doesn't change the fact that Solr should be "well 
behaved" regarding HTTP -- which includes output response headers useful 
for HTTP caching, and understanding incoming request headers related to 
HTTP caching ... just because you expect a some application to sit between 
your end user browsers and Solr doesn't exclude teh possibility of having 
an HTTP cache sitting between that application and Solr.

-Hoss


RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
I forgot to mention: default opereator is AND; DisMax.
Withot URL-encoding some queries will show exceptions even with dismax.

> -Original Message-
> From: Fuad Efendi [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, February 07, 2008 3:31 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Query with literal quote character: 6'2"
> 
> 
> This query works just fine: 
> http://www.tokenizer.org/?q=Romeo+%2B+Juliet
> 
> %2B is URL-Encoded presentation of +
> It shows, for instance, [Romeo & Juliet] in output.
> 
> 
> > -Original Message-
> > From: Walter Underwood [mailto:[EMAIL PROTECTED] 
> > Sent: Thursday, February 07, 2008 3:25 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Query with literal quote character: 6'2"
> > 
> > 
> > Our users can blow up the parser without special characters.
> > 
> >   AND THE BAND PLAYED ON
> >   TO HAVE AND HAVE NOT
> > 
> > Lower-casing in the front end avoids that.
> > 
> > We have auto-complete on titles, so the there are plenty
> > of chances to inadvertently use special characters:
> > 
> >   Romeo + Juliet
> >   Airplane! 
> >   Shrek (Widescreen)
> > 
> > We also have people type "--" for a dash in titles.
> > 
> > wunder
> > 
> > On 2/7/08 12:00 PM, "Chris Hostetter" 
> > <[EMAIL PROTECTED]> wrote:
> > 
> > > 
> > > : How about the query parser respecting backslash escaping? I need
> > > 
> > > one of the orriginal design decisions was "no user 
> > escaping" ... be able
> > > to take in raw query strings from the user with only '+' 
> '-' and '"'
> > > treated as special characters ... if you allow backslash 
> > escaping of those
> > > characters, then by definition '\' becomes a special 
> character too.
> > > 
> > > : free-text input, no syntax at all. Right now, I'm escaping every
> > > : Lucene special character in the front end. I just 
> figured out that
> > > : it breaks for colon, can't search for "12:01" with "12\:01".
> > > 
> > > yeah ... your '\' character is being taken litterally.  you 
> > shouldn't do
> > > any escaping if you hand off to dismax.
> > > 
> > > the right thing to do is probably to expose more the "query 
> > parsing" stuff
> > > as options for hte handler ... let people configure it with what
> > > characters should be escaped, and what should be left 
> > alone.  We should
> > > also stop using the static utility methods for things like partial
> > > escaping and unbalanced quote striping and start using 
> > helper methods
> > > that subclasses can override.
> > > 
> > > 
> > > -Hoss
> > > 
> > 
> > 
> > 
> 
> 
> 



Re: uniqueKey gives duplicate values

2008-02-07 Thread vijay_schi

I want to know, what type of analyzers can be used for the data 12345_r,
12346_r, 12345_c, 12346_c etc , type of data.

I had text type for that uniqueKey and some query , index analyzers on it. i
think thats making duplicates.




Yonik Seeley wrote:
> 
> On Feb 7, 2008 2:27 PM, vijay_schi <[EMAIL PROTECTED]> wrote:
>> I'm new to solr. I have a uniqueKey on string which has the data of
>> 12345_r,12346_r etc etc.
>> when I'm posting xml with same data second time, it allows the docs to be
>> added. when i search for id:12345_r on solr client , i'm getting multiple
>> records. what might be the problem ?
>>
>> previously I'm using in integer, it was working fine. As I changed to
>> string, its allwing duplicates.
>>
>> can anyone Clear/Help on this?
> 
> Double check that your uniqueKey correctly specifies the name of the
> field that contains 12345_r,
> and that you restarted Solr after the schema change.
> 
> The example schema that comes with Solr uses a string uniqueKey, and
> it works fine.
> 
> -Yonik
> 
> 

-- 
View this message in context: 
http://www.nabble.com/uniqueKey-gives-duplicate-values-tp15341288p15341772.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Query with literal quote character: 6'2"

2008-02-07 Thread Yonik Seeley
On Feb 7, 2008 6:35 PM, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> Anyway I can't understand where is the problem?!! Everything works fine with
> dismax/standard/escaping/encoding.

> Can we use AND operator with dismax by
> the way?

No.

> I think: no. And 6'2" works just as prescribed:

Not really... it depends on the analyzer.  If the index analyzer for
the field ends up stripping off the trailing quote anyway, then the
dismax query (which also dropped the quote) will match documents.
That's why you don't see any issues.

-Yonik


RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
> > I think: no. And 6'2" works just as prescribed:
> 
> Not really... it depends on the analyzer.  If the index analyzer for
> the field ends up stripping off the trailing quote anyway, then the
> dismax query (which also dropped the quote) will match documents.
> That's why you don't see any issues.
> 

Yes, my analyzer drops trailing quote for some fields... 

However,
  6'2" 
  6'2 
  +DisjunctionMaxQuery((item_name:"6 2"^2.0))
() 
  +(item_name:"6 2"^2.0) () 

- is it a bug of DixMax?... It happens even before request reaches dismax.

What about this:
  6'2\" 
  6'2\\ 

  \ 
  \\ 
 
- looks really strange. 




It does not happen with standard queries, prefixed with field names:
  host:"6'2\"" 
  host:"6'2\"" 
  host:6'2" 
  host:6'2" 


P.S.
Try to execute simplest test case with DisMax:
"

500!!!





Re: Query with literal quote character: 6'2"

2008-02-07 Thread Yonik Seeley
On Feb 7, 2008 8:35 PM, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> - is it a bug of DixMax?... It happens even before request reaches dismax.

That's what this whole thread has been about :-)
Stripping unbalanced quotes is part of dismax.

-Yonik


RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
With DisMax, and simple query which is single double-quote character, SOLR
responds with 

500
org.apache.solr.common.SolrException: Cannot parse '': Encountered "" at
line 1, column 0. Was expecting one of: ... " " ... "-" ... "(" ... "*" ...
... ... ... ... "[" ... "{" ... ...
org.apache.lucene.queryParser.ParseException: 


It is not polite neither to user's input nor to HTTP specs...

> while i agree that you don't wnat to expose your end users 
> directly to 
> Solr (largely for security reasons) that doesn't mean you *must* 
> preprocess user entered strings before handing them to dismax 
> ... dismax's 
> whole goal is to make it posisble for apps to not have to worry about 
> sanitizing user inputed query strings.



RE: Query with literal quote character: 6'2"

2008-02-07 Thread Fuad Efendi
> With DisMax, and simple query which is single double-quote 
> character, SOLR
> responds with 
> 500
> org.apache.solr.common.SolrException: Cannot parse '': 
...
> It is not polite neither to user's input nor to HTTP specs...


Ooohh... Sorry again: it is the only case where SOLR is polite with 500. 500
should be used case of bug...



Many updates slow down SOLR performance, no commit/autocommit

2008-02-07 Thread Fuad Efendi
Question:


Why constant updates slow down SOLR performance even if I am not executing
Commit? I just noticed this... Thead dump shows something "Lucene ...
Clone()", and significant CPU usage. I did about 5 mlns updates via HTTP
XML, single document at a time, without commit, and performance went down,
100% CPU...

After Commit/Optimize it is stabilized, 0.5 - 2 seconds per page generation
(100 facets + 100 products), 15%-25% CPU:

filterCache   
class:  org.apache.solr.search.LRUCache   
version:  1.0   
description:  LRU Cache(maxSize=200, initialSize=100)   
stats:  lookups : 109294990 
hits : 107637040 
hitratio : 0.98 
inserts : 1658092 
evictions : 0 
size : 879637 
cumulative_lookups : 341225983 
cumulative_hits : 337721881 
cumulative_hitratio : 0.98 
cumulative_inserts : 3504573 
cumulative_evictions : 0 
 

Performance of SOLR itself is good/acceptable (even with huge facet
distribution), but it goes down when I do a lot of updates (without
commit/autocommit)

Thanks,
Fuad
http://www.tokenizer.org
 



Lucene index verifier

2008-02-07 Thread Lance Norskog
(Sorry, my Lucene java-user access is wonky.)
 
I would like to verify that my snapshots are not corrupt before I enable
them.
 
What is the simplest program to verify that a Lucene index is not corrupt?
 
Or, what is a Solr query that will verify that there is no corruption? With
the minimum amount of time?
 
Thanks,
 
Lance Norskog


RE: Memory improvements

2008-02-07 Thread Lance Norskog
Solr 1.2 has a bug where if you say "commit after N documents" it does not.
But it does honor the "commit after N milliseconds" directive. 

This is fixed in Solr 1.3. 

-Original Message-
From: Sundar Sankaranarayanan [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 07, 2008 3:30 PM
To: solr-user@lucene.apache.org
Subject: Memory improvements

Hi All,
  I am running an application in which I am having to index about
300,000 records of a table which has 6 columns. I am committing to the solr
server after every 10,000 rows and I observed that the by the end of about
150,000 the process eats up about 1 Gig of memory, and since my server has
only 1 Gig it throws me an Out of Memory error. How ever if I commit after
every 1000 rows, it is able to process about 200,000 rows before throwing
out of memory. This is just dev server and the production data would be much
more bigger. It will be great if someone suggests a way to improve this
scenario.
 
 
Regards
Sundar Sankarnarayanan