Re: DismaxRequestHandler

2010-06-17 Thread Joe Calderon
the qs parameter affects matching , but you have to wrap your query in
double quotes,ex

q="oil spill"&qf=title description&qs=4&defType=dismax

im not sure how to formulate such a query to apply that rule just to
description, maybe with nested queries ...

On Thu, Jun 17, 2010 at 12:01 PM, Blargy  wrote:
>
> I have a title field and a description filed. I am searching across both
> fields but I don't want description matches unless they are within some slop
> of each other. How can I query for this? It seems that im getting back crazy
> results when there are matches that are nowhere each other
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p903641.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Exact match on a filter

2010-06-17 Thread Joe Calderon
use a copyField and index the copy as type string, exact matches on
that field should then work as the text wont be tokenized

On Thu, Jun 17, 2010 at 3:13 PM, Pete Chudykowski
 wrote:
> Hi,
>
> I'm trying with no luck to filter on the exact-match value of a field.
> Speciffically:
>  fq=brand:apple
> returns document's whose 'brand' field contains values like "apple bottoms".
>
> Is there a way to formulate the fq expression to match precisely and only 
> "apple" ?
>
> Thanks in advance for your help.
> Pete.
>


Re: DismaxRequestHandler

2010-06-17 Thread Joe Calderon
see yonik's post on nested queries
http://www.lucidimagination.com/blog/2009/03/31/nested-queries-in-solr/

so for example i thought you could possibly do a dismax query across
the main fields (in this case just title) and OR that with
_query_:"{!description:'oil spill'~4}"

On Thu, Jun 17, 2010 at 3:01 PM, MitchK  wrote:
>
> Joe,
>
> please, can you provide an example of what you are thinking of?
>
> Subqueries with Solr... I've never seen something like that before.
>
> Thank you!
>
> Kind regards
> - Mitch
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/DismaxRequestHandler-tp903641p904142.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: federated / meta search

2010-06-17 Thread Joe Calderon
yes, you can use distributed search across shards with different
schemas as long as the query only references overlapping fields, i
usually test adding new fields or tokenizers on one shard and deploy
only after i verified its working properly

On Thu, Jun 17, 2010 at 1:10 PM, Markus Jelsma  wrote:
> Hi,
>
>
>
> Check out Solr sharding [1] capabilities. I never tested it with different 
> schema's but if each node is queried with fields that it supports, it should 
> return useful results.
>
>
>
> [1]: http://wiki.apache.org/solr/DistributedSearch
>
>
>
> Cheers.
>
> -Original message-
> From: Sascha Szott 
> Sent: Thu 17-06-2010 19:44
> To: solr-user@lucene.apache.org;
> Subject: federated / meta search
>
> Hi folks,
>
> if I'm seeing it right Solr currently does not provide any support for
> federated / meta searching. Therefore, I'd like to know if anyone has
> already put efforts into this direction? Moreover, is federated / meta
> search considered a scenario Solr should be able to deal with at all or
> is it (far) beyond the scope of Solr?
>
> To be more precise, I'll give you a short explanation of my
> requirements. Assume, there are a couple of Solr instances running at
> different places. The documents stored within those instances are all
> from the same domain (bibliographic records), but it can not be ensured
> that the schema definitions conform to 100%. But lets say, there are at
> least some index fields that are present in all instances (fields with
> the same name and type definition). Now, I'd like to perform a search on
> all instances at the same time (with the restriction that the query
> contains only those fields that overlap among the different schemas) and
> combine the results in a reasonable way by utilizing the score
> information associated with each hit. Please note, that due to legal
> issues it is not feasible to build a single index that integrates the
> documents of all Solr instances under consideration.
>
> Thanks in advance,
> Sascha
>
>


Re: Comma delemitered words shawn in terms like one word.

2010-06-18 Thread Joe Calderon
set generateWordParts=1 on wordDelimiter or use
PatternTokenizerFactory to split on commas

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternTokenizerFactory


you can use the analysis page to see what your filter chains are going
to do before you index

/admin/analysis.jsp

On Fri, Jun 18, 2010 at 6:41 AM, Vitaliy Avdeev  wrote:
> Hello.
> In indexing text I have such string John,Mark,Sam. Then I looks at it in
> TermVectorComponent it looks like this johnmarksam.
>
> I am using this type for storing data
>
>     positionIncrementGap="100" >
>      
>    
>         ignoreCase="true" expand="false"/>
>         words="stopwords.txt"/>
>         generateWordParts="0" generateNumberParts="0" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        
>        
>      
>    
>
> What filter I need to use to get John Mark Sam as different words?
>


Re: SOLR partial string matching question

2010-06-22 Thread Joe Calderon
you want a combination of WhitespaceTokenizer and EdgeNGramFilter
http://lucene.apache.org/solr/api/org/apache/solr/analysis/WhitespaceTokenizerFactory.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/EdgeNGramFilterFactory.html

the first will create tokens for each word the second will create
multiple tokens from each word prefix

use the analysis link from the admin page to test your filter chain
and make sure its doing what you want.


On Tue, Jun 22, 2010 at 4:06 PM, Vladimir Sutskever
 wrote:
> Hi,
>
> Can you guys make a recommendation for which types/filters to use accomplish 
> the following partial keyword match:
>
>
> A. Actual Indexed Term:  "bank of america"
>
> B. User Enters Search Term:  "of ameri"
>
>
> I would like SOLR to match document "bank of america" with the partial string 
> "of ameri"
>
> Any suggestions?
>
>
>
> Kind regards,
>
> Vladimir Sutskever
> Investment Bank - Technology
> JPMorgan Chase, Inc.
>
>
>
> This email is confidential and subject to important disclaimers and
> conditions including on offers for the purchase or sale of
> securities, accuracy and completeness of information, viruses,
> confidentiality, legal privilege, and legal entity disclaimers,
> available at http://www.jpmorgan.com/pages/disclosures/email.


Re: preside != president

2010-06-28 Thread Joe Calderon
the general consensus among people who run into the problem you have
is to use a plurals only stemmer, a synonyms file or a combination of
both (for irregular nouns etc)

if you search the archives you can find info on a plurals stemmer

On Mon, Jun 28, 2010 at 6:49 AM,   wrote:
> Thanks for the tip. Yeah, I think the stemming confounds search results as
> it stands (porter stemmer).
>
> I was also thinking of using my dictionary of 500,000 words with their
> complete morphologies and conjugations and create a synonyms.txt to
> provide english accurate morphology.
>
> Is this a good idea?
>
> Darren
>
>> Hi Darren,
>>
>> You might want to look at the KStemmer
>> (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem)
>> instead of the standard PorterStemmer. It essentially has a 'dictionary'
>> of exception words where stemming stops if found, so in your case
>> president won't be stemmed any further than president (but presidents will
>> be stemmed to president). You will have to integrate it into solr
>> yourself, but that's straightforward.
>>
>> HTH
>> Brendan
>>
>>
>> On Jun 28, 2010, at 8:04 AM, Darren Govoni wrote:
>>
>>> Hi,
>>>  It seems to me that because the stemming does not produce
>>> grammatically correct stems in many of the cases,
>>> search anomalies can occur like the one I am seeing where I have a
>>> document with "president" in it and it is returned
>>> when I search for "preside", a different word entirely.
>>>
>>> Is this correct or acceptable behavior? Previous discussions here on
>>> stemming, I was told its ok as long as all the words reduce
>>> to the same stem, but when different words reduce to the same stem it
>>> seems to affect search results in a "bad way".
>>>
>>> Darren
>>
>>
>
>


Re: Strange query behavior

2010-06-28 Thread Joe Calderon
splitOnCaseChange is creating multiple tokens from 3dsMax disable it
or enable catenateAll, use the analysys page in the admin tool to see
exactly how your text will be indexed by analyzers without having to
reindex your documents, once you have it right you can do a full
reindex.

On Mon, Jun 28, 2010 at 5:48 AM, Marc Ghorayeb  wrote:
>
> Hello,
> I have a title that says "3DVIA Studio & Virtools Maya and 3dsMax Exporters". 
> The analysis tool for this field gives me these 
> tokens:3dviadviastudio&;virtoolmaya3dsmaxdssystèmmaxexport
>
>
> However, when i search for "3dsmax", i get no results :( Furthermore, if i 
> search for "dsmax" i get the spellchecker that suggests me "3dsmax" even 
> though it doesn't find any results. If i search for any other token ("3dvia", 
> or "max" for example), the document is found. "3dsmax" is the only token that 
> doesn't seem to work!! :(
> Here is my schema for this field: class="solr.TextField" positionIncrementGap="100">
>        
>                
>
>                                        generateWordParts="1"
>                        generateNumberParts="1"
>                        catenateWords="0"
>                        catenateNumbers="0"
>                        catenateAll="0"
>                        splitOnCaseChange="1"
>                        preserveOriginal="1"
>                />
>
>                
>                    
>           words="stopwords.txt" enablePositionIncrements="true" />               
>  ignoreCase="true" expand="true"/>
>
>                
>                
>                 language="${Language}" protected="protwords.txt"/>
>        
>
>        
>                
>
>                                        generateWordParts="1"
>                        generateNumberParts="1"
>                        catenateWords="1"
>                        catenateNumbers="1"
>                        catenateAll="0"
>                        splitOnCaseChange="1"
>                        preserveOriginal="1"
>                />
>
>                
>                
>                 words="stopwords.txt" enablePositionIncrements="true" />
>                
>                
>                 language="${Language}" protected="protwords.txt" />
>        
> 
> Can anyone help me out please? :(
> PS: the ${Language} is set to "en" (for english) in this case...
>
> _
> La boîte mail NOW Génération vous permet de réunir toutes vos boîtes mail 
> dans Hotmail !
> http://www.windowslive.fr/hotmail/nowgeneration/


Re: questions about Solr shards

2010-06-28 Thread Joe Calderon
there is a first pass query to retrieve all matching document ids from
every shard along with relevant sorting information, the document ids
are then sorted and limited to the amount needed, then a second query
is sent for the rest of the documents metadata.

On Sun, Jun 27, 2010 at 7:32 PM, Babak Farhang  wrote:
> Otis,
>
> Belated thanks for your reply.
>
>>> 2. "The index could change between stages, e.g. a
>>> document that matched a
>>> query and was subsequently changed may no
>>> longer match but will still be
>>> retrieved."
>
>> 2. This describes the situation where, for instance, a
>> document with ID=10 is updated between the 2 calls
>> to the Solr instance/shard where that doc ID=10 lives.
>
> Can you explain why this happens? (I.e. does each query to the sharded
> index somehow involve 2 calls to each shard instance from the base
> instance?)
>
> -Babak
>
> On Thu, Jun 24, 2010 at 10:14 PM, Otis Gospodnetic
>  wrote:
>> Hi Babak,
>>
>> 1. Yes, you are reading that correctly.
>>
>> 2. This describes the situation where, for instance, a document with ID=10 
>> is updated between the 2 calls to the Solr instance/shard where that doc 
>> ID=10 lives.
>>
>> 3. Yup, orthogonal.  You can have a master with multiple cores for sharded 
>> and non-sharded indices and you can have a slave with cores that hold 
>> complete indices or just their shards.
>>  Otis
>> 
>> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> Lucene ecosystem search :: http://search-lucene.com/
>>
>>
>>
>> - Original Message 
>>> From: Babak Farhang 
>>> To: solr-user@lucene.apache.org
>>> Sent: Thu, June 24, 2010 6:32:54 PM
>>> Subject: questions about Solr shards
>>>
>>> Hi everyone,
>>
>> There are a couple of notes on the limitations of this
>>> approach at
>>
>>> target=_blank >http://wiki.apache.org/solr/DistributedSearch which I'm
>>> having trouble
>> understanding.
>>
>> 1. "When duplicate doc IDs are received,
>>> Solr chooses the first doc
>>   and discards subsequent
>>> ones"
>>
>> "Received" here is from the perspective of the base Solr instance
>>> at
>> query time, right?  I.e. if you inadvertently indexed 2 versions
>>> of
>> the document with the same unique ID but different contents to
>>> 2
>> shards, then at query time, the "first" document (putting aside for
>> the
>>> moment what exactly "first" means) would win.  Am I reading
>>> this
>> right?
>>
>>
>> 2. "The index could change between stages, e.g. a
>>> document that matched a
>>   query and was subsequently changed may no
>>> longer match but will still be
>>   retrieved."
>>
>> I have no idea what
>>> this second statement means.
>>
>>
>> And one other question about
>>> shards:
>>
>> 3. The examples I've seen documented do not illustrate
>>> sharded,
>> multicore setups; only sharded monolithic cores.  I assume
>>> sharding
>> works with multicore as well (i.e. the two issues are
>>> orthogonal).  Is
>> this right?
>>
>>
>> Any help on interpreting the
>>> above would be much appreciated.
>>
>> Thank you,
>> -Babak
>>
>


Re: dismax request handler without q

2010-07-20 Thread Joe Calderon
try something like this:
q.alt=*:*&fq=keyphrase:hotel

though if you dont need to query across multiple fields, dismax is
probably not the best choice

On Tue, Jul 20, 2010 at 4:57 AM, olivier sallou
 wrote:
> q will search in defaultSearchField if no field name is set, but you can
> specify in your "q" param the fields you want to search into.
>
> Dismax is a handler where you can specify to look in a number of fields for
> the input query. In this case, you do not specify the fields and dismax will
> look in the fields specified in its configuration.
> However, by default, dismax is not used, it needs to be called help with the
> query type parameter (qt=dismax).
>
> In default solr config, you can call ...solr/select?q=keyphrase:hotel if
> keyphrzase is a declared field in your schema
>
> 2010/7/20 Chamnap Chhorn 
>
>> I can't put q=keyphrase:hotel in my request using dismax handler. It
>> returns
>> no result.
>>
>> On Tue, Jul 20, 2010 at 1:19 PM, Chamnap Chhorn > >wrote:
>>
>> > There are some default configuration on my solrconfig.xml that I didn't
>> > show you. I'm a little confused when reading
>> > http://wiki.apache.org/solr/DisMaxRequestHandler#q. I think q is for
>> plain
>> > user input query.
>> >
>> >
>> > On Tue, Jul 20, 2010 at 12:08 PM, olivier sallou <
>> olivier.sal...@gmail.com
>> > > wrote:
>> >
>> >> Hi,
>> >> this is not very clear, if you need to query only keyphrase, why don't
>> you
>> >> query directly it? e.g. q=keyphrase:hotel ?
>> >> Furthermore, why dismax if only keyphrase field is of interest? dismax
>> is
>> >> used to query multiple fields automatically.
>> >>
>> >> At least dismax do not appear in your query (using query type). It is
>> set
>> >> in
>> >> your config for your default request handler?
>> >>
>> >> 2010/7/20 Chamnap Chhorn 
>> >>
>> >> > I wonder how could i make a query to return only *all books* that has
>> >> > keyphrase "web development" using dismax handler? A book has multiple
>> >> > keyphrases (keyphrase is multivalued column). Do I have to pass q
>> >> > parameter?
>> >> >
>> >> >
>> >> > Is it the correct one?
>> >> > http://locahost:8081/solr/select?&q=hotel&fq=keyphrase:%20hotel
>> >> >
>> >> > --
>> >> > Chhorn Chamnap
>> >> > http://chamnapchhorn.blogspot.com/
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Chhorn Chamnap
>> > http://chamnapchhorn.blogspot.com/
>> >
>>
>>
>>
>> --
>> Chhorn Chamnap
>> http://chamnapchhorn.blogspot.com/
>>
>


dealing with duplicates

2009-07-31 Thread Joe Calderon
hello all, i have a collection of a few million documents; i have many
duplicates in this collection. they have been clustered with a simple
algorithm, i have a field called 'duplicate' which is 0 or 1 and a
fields called 'description, tags, meta', documents are clustered on
different criteria and the text i search against could be very
different among members of a cluster.

im currently using a dismax handler to search across the text fields
with different boosts, and a filter query to restrict to masters
(duplicate: 0)

my question is then, how do i best query for documents which are
masters OR match text but are not included in the matched set of
masters?

does this make sense?


Re: dealing with duplicates

2009-08-01 Thread Joe Calderon
hello, thanks for the response, i did take a look at that document but
in my application i actually want the duplicates, as i mentioned, the
matching text could be very different among cluster members, what
joins them together is a similar set of numeric features.

currently i do a query with fq=duplicate:0 and show a link to
optionally show the "dupes" via by querying for all dupes of the
master id, however im currently missing any documents that matched the
query but are duplicates of other masters not included in that result
set.

in a relational database (fulltext indexing aside) i would use a
subquery, i imagine a similar approach could be used with lucene, i
just dont know the syntax

best,

--joe

On Fri, Jul 31, 2009 at 11:32 PM, Otis
Gospodnetic wrote:
> Joe,
>
> Maybe we can take a step back first.  Would it be better if your index was 
> cleaner and didn't have flagged duplicates in the first place?  If so, have 
> you tried using http://wiki.apache.org/solr/Deduplication ?
>
>  Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> - Original Message 
>> From: Joe Calderon 
>> To: solr-user@lucene.apache.org
>> Sent: Friday, July 31, 2009 5:06:48 PM
>> Subject: dealing with duplicates
>>
>> hello all, i have a collection of a few million documents; i have many
>> duplicates in this collection. they have been clustered with a simple
>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
>> fields called 'description, tags, meta', documents are clustered on
>> different criteria and the text i search against could be very
>> different among members of a cluster.
>>
>> im currently using a dismax handler to search across the text fields
>> with different boosts, and a filter query to restrict to masters
>> (duplicate: 0)
>>
>> my question is then, how do i best query for documents which are
>> masters OR match text but are not included in the matched set of
>> masters?
>>
>> does this make sense?
>
>


concurrent csv loading

2009-08-06 Thread Joe Calderon
for first time loads i currently post to
/update/csv?commit=false&separator=%09&escape=\&stream.file=workfile.txt&map=NULL:&keepEmpty=false",
this works well and finishes in about 20 minutes for my work load.

this is mostly cpu bound, i have an 8 core box and it seems one takes
the brunt of the work.

 if i wanted to optimize, would i see any benefit to splitting
workfile.txt in two and doing two posts ?

im running lucid's build of solr 1.3.0 on jetty 6, io is not a
bottleneck as the data folder is on tmpfs

thx much
--joe


Re: dealing with duplicates

2009-08-10 Thread Joe Calderon
so in the case someone can help me with the query syntax, the
relational query i would use for this would be something like:

SELECT * FROM videos
WHERE
title LIKE 'family guy'
AND desc LIKE 'stewie%'
AND (
  ( is_dup = 0 )
  OR
  ( is_dup = 1 AND id NOT IN
(
SELECT id FROM videos
WHERE
title LIKE 'family guy'
AND desc LIKE 'stewie%'
AND is_dup = 0
)
  )
)
ORDER BY views
LIMIT 10

can a similar query be written in lucene or do i need to structure my
index differently to be able to do such a query?

thx much

--joe


On Sat, Aug 1, 2009 at 9:15 AM, Joe Calderon wrote:
> hello, thanks for the response, i did take a look at that document but
> in my application i actually want the duplicates, as i mentioned, the
> matching text could be very different among cluster members, what
> joins them together is a similar set of numeric features.
>
> currently i do a query with fq=duplicate:0 and show a link to
> optionally show the "dupes" via by querying for all dupes of the
> master id, however im currently missing any documents that matched the
> query but are duplicates of other masters not included in that result
> set.
>
> in a relational database (fulltext indexing aside) i would use a
> subquery, i imagine a similar approach could be used with lucene, i
> just dont know the syntax
>
> best,
>
> --joe
>
> On Fri, Jul 31, 2009 at 11:32 PM, Otis
> Gospodnetic wrote:
>> Joe,
>>
>> Maybe we can take a step back first.  Would it be better if your index was 
>> cleaner and didn't have flagged duplicates in the first place?  If so, have 
>> you tried using http://wiki.apache.org/solr/Deduplication ?
>>
>>  Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>>
>>
>> - Original Message 
>>> From: Joe Calderon 
>>> To: solr-user@lucene.apache.org
>>> Sent: Friday, July 31, 2009 5:06:48 PM
>>> Subject: dealing with duplicates
>>>
>>> hello all, i have a collection of a few million documents; i have many
>>> duplicates in this collection. they have been clustered with a simple
>>> algorithm, i have a field called 'duplicate' which is 0 or 1 and a
>>> fields called 'description, tags, meta', documents are clustered on
>>> different criteria and the text i search against could be very
>>> different among members of a cluster.
>>>
>>> im currently using a dismax handler to search across the text fields
>>> with different boosts, and a filter query to restrict to masters
>>> (duplicate: 0)
>>>
>>> my question is then, how do i best query for documents which are
>>> masters OR match text but are not included in the matched set of
>>> masters?
>>>
>>> does this make sense?
>>
>>
>


where to get solr 1.4 nightly

2009-08-20 Thread Joe Calderon
i want to try out the improvements in 1.4 but the nightly site is down

http://people.apache.org/builds/lucene/solr/nightly/


is there a mirror for nightlies?


--joe


shingle filter

2009-08-24 Thread Joe Calderon
hello *, im currently faceting on a shingled field to obtain popular
phrases and its working well, however ide like to limit the number of
shingles that get created, the solr.ShingleFilterFactory supports
maxShingleSize, can it be made to support a minimum as well? can
someone point me in the right direction?

thx much
--joe


extended documentation on analyzers

2009-08-27 Thread Joe Calderon
is there an online resource or a book that contains a thorough list of
tokenizers and filters available and their functionality?

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

is very helpful but i would like to go through additional filters to
make sure im not reinventing the wheel adding my own

--joe


Re: non-exhaustive results?

2009-08-28 Thread Joe Calderon

facet.mincount=1, facet.limit=-1
On 08/28/2009 09:15 AM, Candide Kemmler wrote:

Sorry, I have misinterpreted my test results.
In fact, I can see that facets are missing in the original search.
So the question becomes: how is it possible that a search doesn't 
report all the facets of a specific result set?



On 28 Aug 2009, at 17:46, Candide Kemmler wrote:


Hi,

I have a faceted-search enabled index where I first do a simple 
search, e.g. by specifying the query term "nursing". Then, I'm trying 
tor refine that result using facets.


I found out that the facet-filtered results contain titles which 
match "nursing" exactly that aren't in the original result set. How 
is that possible?






Re: Responses getting truncated

2009-08-28 Thread Joe Calderon
i had a similar issue with text from past requests showing up, this was 
on 1.3 nightly, i switched to using the lucid build of 1.3 and the 
problem went away, im using a nightly of 1.4 right now also without 
probs, then again your mileage may vary as i also made a bunch of schema 
changes that might have had some effect, it wouldnt hurt to try though



On 08/28/2009 02:04 PM, Rupert Fiasco wrote:

Firstly, to everyone who has been helping me, thank you very much. All
this feedback is helping me narrow down these issues.

I deleted the index and re-indexed all the data from scratch and for a
couple of days we were OK, but now it seems to be erring again.

It happens on different input documents so what was broken before now
works (documents that were having issues before are OK now, after a
fresh re-index).

An issue we are seeing now is that an XML response from Solr will
contain the "tail" of an earlier response, for an example:

http://brockwine.com/solr2.txt

That is a response we are getting from Solr - using the web interface
for Solr in Firefox, Firefox freaks out because it tries to parse
that, and of course, its invalid XML, but I can retrieve that via
curl.

Anyone seeing this before?

In regards to earlier questions:

   

i assume you are correct, but you listed several steps of transformation
above, are you certian they all work correctly and produce valid UTF-8?
 

Yes, I have looked at the source and contacted the author of the
conversion library we are using and have verified that if UTF8 goes in
then UTF8 will come out and UTF8 is definitely going in.

I dont think sending over an actual input document would help because
it seems to change. Plus, this latest issue appears to be more an
issue of the last response buffer not clearing or something.

Whats strange is that if I wait a few minutes and reload, then the
buffer is cleared and I get back a valid response, its intermittent,
but appears to be happening frequently.

If it matters, we started using LucidGaze for Solr about 10 days ago,
approximately when these issues started happening (but its hard to say
if thats an issue because at this same time we switched from a PHP to
Java indexing client).

Thanks for your patience

-Rupert

On Tue, Aug 25, 2009 at 8:33 PM, Chris
Hostetter  wrote:
   

: We are running an instance of MediaWiki so the text goes through a
: couple of transformations: wiki markup ->  html ->  plain text.
: Its at this last step that I take a "snippet" and insert that into Solr.
...
: doc.addField("text_snippet_t", article.getSnippet(1000));

ok, well first off: that's the not the field we're you are having problems
is it?  if i remember correctly from your previous posts, wasn't the
response getting aborted in the middle of the Contents field?

: and a maximum of 1K chars if its bigger. I initialized this String
: from the DB by using the String constructor where I pass in the
: charset/collation
:
: text = new String(textFromDB, "UTF-8");
:
: So to the best of my knowledge, accessing a substring of a UTF-8
: encoded string should not break up the UTF-8 code point. Is that an

i assume you are correct, but you listed several steps of transformation
above, are you certian they all work correctly and produce valid UTF-8?

this leads back to my suggestion before

:>  Can you put the orriginal (pre solr, pre solrj, raw untouched, etc...)
:>  file that this solr doc came from online somewhere?
:>
:>  What does your *indexing* code look like? ... Can you add some debuging to
:>  the SolrJ client when you *add* this doc to print out exactly what those
:>  1000 characters are?


-Hoss

 




Re: Responses getting truncated

2009-08-28 Thread Joe Calderon
yonik has a point, when i ran into this i also upgraded to the latest 
stable jetty, im using jetty 6.1.18


On 08/28/2009 04:07 PM, Rupert Fiasco wrote:

I deployed LucidWorks with my existing solrconfig / schema and
re-indexed my data into it and pushed it out to production, we'll see
how it stacks up over the weekend. Already queries that were breaking
on the prior Jetty/stock Solr setup are now working - but I have seen
it before where upon an initial re-index things work OK then a couple
of days later they break.

Keep y'all posted.

Thanks
-Rupert

On Fri, Aug 28, 2009 at 3:12 PM, Rupert Fiasco  wrote:
   

Yes, I am hitting the Solr server directly (medsolr1.colo:9007)

Versions / architectures:

Jetty(6.1.3)

o...@medsolr1 ~ $ uname -a
Linux medsolr1 2.6.18-xen-r12 #9 SMP Tue Mar 3 15:34:08 PST 2009
x86_64 Intel(R) Xeon(R) CPU L5420 @ 2.50GHz GenuineIntel GNU/Linux

o...@medsolr1 ~ $ java -version
java version "1.6.0_11"
Java(TM) SE Runtime Environment (build 1.6.0_11-b03)
Java HotSpot(TM) 64-Bit Server VM (build 11.0-b16, mixed mode)


I was thinking of trying LucidWorks for Solr (1.3.02) x64 - worth a try.

-Rupert

On Fri, Aug 28, 2009 at 3:08 PM, Yonik Seeley  wrote:
 

On Mon, Aug 24, 2009 at 6:30 PM, Rupert Fiasco  wrote:
   

If I run these through curl on the command its
truncated and if I run the search through the web-based admin panel
then I get an XML parse error.
 

Are you running curl directly against the solr server, or going
through a load balancer?  Cutting out the middle-men using curl was a
great idea - just make sure to go all the way.

At first I thought it could possibly be a FastWriter bug (internal
Solr class), but that's only used on the TextWriter (JSON, Python,
Ruby) based formats, not on the original XML format.

It really looks like you're hitting a lower-level IO buffering bug
(esp when you see a response starting off with the tail of another
response).  That doesn't look like it could be a Solr bug... but
rather smells like a thread safety bug in the servlet container.

What type of machine are you running on?  What JVM?
You could try upgrading your version of Jetty, the JVM, or try
switching to Tomcat.

-Yonik
http://www.lucidimagination.com


   

This appears to have just started recently and the only thing we have
done is change our indexer from a PHP one to a Java one, but
functionally they are identical.

Any thoughts? Thanks in advance.

- Rupert

 
   
 




score = sum of boosts

2009-09-02 Thread Joe Calderon
hello *, what would be the best approach to return the sum of boosts
as the score?

ex:
a dismax handler boosts matches to field1^100 and field2^50, a query
matches both fields hence the score for that row would be 150



is this something i could do with a function query or do i need to
hack up DisjunctionMaxScorer ?

--joe


stemming plurals

2009-09-04 Thread Joe Calderon
i saw some post regarding stemming plurals in the archives from 2008,
i was wondering if this was ever integrated or if custom hackery is
still needed, is there something like a stemplurals analyzer is the
kstemmer the closest thing?


thx much
--joe


Re: Geographic clustering

2009-09-08 Thread Joe Calderon
there are clustering libraries like
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/, that have
bindings to perl/python, you can preprocess your results and create
clusters for each zoom level

On Tue, Sep 8, 2009 at 8:08 AM, gwk wrote:
> Hi,
>
> I just completed a simple proof-of-concept clusterer component which
> naively clusters with a specified bounding box around each position,
> similar to what the javascript MarkerClusterer does. It's currently very
> slow as I loop over the entire docset and request the longitude and
> latitude of each document (Not to mention that my unfamiliarity with
> Lucene/Solr isn't helping the implementations performance any, most code
> is copied from grep-ing the solr source). Clustering a set of about
> 80.000 documents takes about 5-6 seconds. I'm currently looking into
> storing the hilber curve mapping in Solr and clustering using facet
> counts on numerical ranges of that mapping but I'm not sure it will pan out.
>
> Regards,
>
> gwk
>
> Grant Ingersoll wrote:
>>
>> Not directly related to geo clustering, but
>> http://issues.apache.org/jira/browse/SOLR-769 is all about a pluggable
>> interface to clustering implementations.  It currently has Carrot2
>> implemented, but the APIs are marked as experimental.  I would definitely be
>> interested in hearing your experience with implementing your clustering
>> algorithm in it.
>>
>> -Grant
>>
>> On Sep 8, 2009, at 4:00 AM, gwk wrote:
>>
>>> Hi,
>>>
>>> I'm working on a search-on-map interface for our website. I've created a
>>> little proof of concept which uses the MarkerClusterer
>>> (http://code.google.com/p/gmaps-utility-library-dev/) which clusters the
>>> markers nicely. But because sending tens of thousands of markers over Ajax
>>> is not quite as fast as I would like it to be, I'd prefer to do the
>>> clustering on the server side. I've considered a few options like storing
>>> the morton-order and throwing away precision to cluster, assigning all
>>> locations to a grid position. Or simply cluster based on country/region/city
>>> depending on zoom level by adding latitude on longitude fields for each zoom
>>> level (so that for smaller countries you have to be zoomed in further to get
>>> the next level of clustering).
>>>
>>> I was wondering if anybody else has worked on something similar and if so
>>> what their solutions are.
>>>
>>> Regards,
>>>
>>> gwk
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>
>
>


help with solr.PatternTokenizerFactory

2009-09-09 Thread Joe Calderon
hello *, im not sure what im doing wrong i have this field defined in
schema.xml, using admin/analysis.jsp its working as expected,


  





  



but when i try to update via csvhandler i get

Error 500 org.apache.solr.analysis.PatternTokenizerFactory$1 cannot be
cast to org.apache.lucene.analysis.Tokenizer

java.lang.ClassCastException:
org.apache.solr.analysis.PatternTokenizerFactory$1 cannot be cast to
org.apache.lucene.analysis.Tokenizer
at 
org.apache.solr.analysis.TokenizerChain.getStream(TokenizerChain.java:69)
at 
org.apache.solr.analysis.SolrAnalyzer.reusableTokenStream(SolrAnalyzer.java:74)
...



im using nightly of solr 1.4

thx much,
--joe


query parser question

2009-09-10 Thread Joe Calderon
i have field called text_stem that has a kstemmer on it, im having
trouble matching wildcard searches on a word that got stemmed

for example i index the word "america's", which according to
analysis.jsp after stemming gets indexed as "america"

when matching i do a query like myfield:(ame*) which matches the
indexed term, this all works fine until the query becomes
myfield:(america's*) at which point it doesnt match, however if i
remove the wildcard like myfield:(america's) the it works again

its almost like the term doesnt get stemmed when using a wildcard

im using 1.4 nightly, is this the correct behaviour, is there
something i should do differently?

in the mean time ive added "americas" as protected word in the
kstemmer but im afraid of more edge cases that will come up

--joe


Re: KStem download

2009-09-14 Thread Joe Calderon
is the source for the lucid kstemmer available ? from the lucid solr
package i only found the compiled jars

On Mon, Sep 14, 2009 at 11:04 AM, Yonik Seeley
 wrote:
> On Mon, Sep 14, 2009 at 1:56 PM, darniz  wrote:
>> Pascal Dimassimo wrote:
>>>
>>> Hi,
>>>
>>> I want to try KStem. I'm following the instructions on this page:
>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters/Kstem
>>>
>>> ... but the download link doesn't work.
>>>
>>> Is anyone know the new location to download KStem?
>>>
>> I am stuck with the same issue
>> its link is not working for a long time
>>
>>
>> is there any alternate link
>> Please let us know
>
> *shrug* - looks like they changed their download structure (or just
> took it down).  I searched around their site a bit but couldn't find
> another one (and google wasn't able to find it either).
>
> The one from Lucid is functionally identical, free, and much, much
> faster though - I'd just use that.
>
> -Yonik
> http://www.lucidimagination.com
>


boost function for date as unix stamp

2009-09-25 Thread Joe Calderon
hello *, i read on the wiki about using recip(rord(...)...) to boost
recent documents with a date field, does anyone have a good function
for doing something similar with unix timestamps?

if not, is there a lot of overhead related to counting the number of
distinct values for rord() ?


thx much

--joe


field collapsing sums

2009-09-30 Thread Joe Calderon
hello all, i have a question on the field collapsing patch, say i have
an integer field called "num_in_stock" and i collapse by some other
column, is it possible to sum up that integer field and return the
total in the output, if not how would i go about extending the
collapsing component to support that?


thx much

--joe


changing dismax parser to not treat symbols differently

2009-09-30 Thread Joe Calderon
how would i go about modifying the dismax parser to treat +/- as regular text?


Re: field collapsing sums

2009-10-01 Thread Joe Calderon
hello martijn, thx for the tip, i tried that approach but ran into two
snags, 1. returning the fields makes collapsing a lot slower for
results, but that might just be the nature of iterating large results.
2. it seems like only dupes of records on the first page are returned

or is tehre a a setting im missing? currently im only sending,
collapse.field=brand and collapse.includeCollapseDocs.fl=num_in_stock

--joe

On Thu, Oct 1, 2009 at 1:14 AM, Martijn v Groningen
 wrote:
> Hi Joe,
>
> Currently the patch does not do that, but you can do something else
> that might help you in getting your summed stock.
>
> In the latest patch you can include fields of collapsed documents in
> the result per distinct field value.
> If your specify collapse.includeCollapseDocs.fl=num_in_stock in the
> request nd lets say you collapse on brand then in the response you
> will receive the following xml:
> 
>   
>        
>          2
>        
>         
>          3
>        
>      ...
>   
>   
>      ...
>   
> 
>
> On the client side you can do whatever you want with this data and for
> example sum it together. Although the patch does not sum for you, I
> think it will allow to implement your requirement without to much
> hassle.
>
> Cheers,
>
> Martijn
>
> 2009/10/1 Matt Weber :
>> You might want to see how the stats component works with field collapsing.
>>
>> Thanks,
>>
>> Matt Weber
>>
>> On Sep 30, 2009, at 5:16 PM, Uri Boness wrote:
>>
>>> Hi,
>>>
>>> At the moment I think the most appropriate place to put it is in the
>>> AbstractDocumentCollapser (in the getCollapseInfo method). Though, it might
>>> not be the most efficient.
>>>
>>> Cheers,
>>> Uri
>>>
>>> Joe Calderon wrote:
>>>>
>>>> hello all, i have a question on the field collapsing patch, say i have
>>>> an integer field called "num_in_stock" and i collapse by some other
>>>> column, is it possible to sum up that integer field and return the
>>>> total in the output, if not how would i go about extending the
>>>> collapsing component to support that?
>>>>
>>>>
>>>> thx much
>>>>
>>>> --joe
>>>>
>>>>
>>
>>
>


Re: field collapsing sums

2009-10-01 Thread Joe Calderon
thx for the reply, i just want the number of dupes in the query
result, but it seems i dont get the correct totals,

for example a non collapsed dismax query for belgian beer returns X
number results
but when i collapse and sum the number of docs under collapse_counts,
its much less than X

it does seem to work when the collapsed results fit on one page (10
rows in my case)


--joe

> 2) It seems that you are using the parameters as was intended. The
> collapsed documents will contain all documents (from whole query
> result) that have been collapsed on a certain field value that occurs
> in the result set that is being displayed. That is how it should work.
> But if I'm understanding you correctly you want to display all dupes
> from the whole query result set (also those which collapse field value
> does not occur in the in the displayed result set)?


JVM OOM when using field collapse component

2009-10-01 Thread Joe Calderon
i gotten two different out of memory errors while using the field
collapsing component, using the latest patch (2009-09-26) and the
latest nightly,

has anyone else encountered similar problems? my collection is 5
million results but ive gotten the error collapsing as little as a few
thousand

SEVERE: java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:173)
at 
org.apache.lucene.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:749)
at org.apache.lucene.util.OpenBitSet.ensureCapacity(OpenBitSet.java:757)
at 
org.apache.lucene.util.OpenBitSet.expandingWordNum(OpenBitSet.java:292)
at org.apache.lucene.util.OpenBitSet.set(OpenBitSet.java:233)
at 
org.apache.solr.search.AbstractDocumentCollapser.addCollapsedDoc(AbstractDocumentCollapser.java:402)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:115)
at 
org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)

SEVERE: java.lang.OutOfMemoryError: Java heap space
at 
org.apache.solr.util.DocSetScoreCollector.(DocSetScoreCollector.java:44)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser.doQuery(NonAdjacentDocumentCollapser.java:68)
at 
org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:205)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)

Re: JVM OOM when using field collapse component

2009-10-02 Thread Joe Calderon
heap space is 4gb set to grow up to 8gb, usage is normally ~1-2gb,
seems to happen within a few searches.

if its just me ill try to isolate it, it could be some other part of
my implementation

thx much

On Fri, Oct 2, 2009 at 1:18 AM, Martijn v Groningen
 wrote:
> No I have not encountered OOM exception yet with current field collapse patch.
> How large is your configured JVM heap space (-Xmx)? Field collapsing
> requires more memory then regular searches so. Does Solr run out of
> memory during the first search(es) or does it run out of memory after
> a while when it performed quite a few field collapse searches?
>
> I see that you are also using the collapse.includeCollapsedDocs.fl
> parameter for your search. This feature will require more memory then
> a normal field collapse search.
>
> I normally give the Solr instance a heap space of 1024M when having an
> index of a few million.
>
> Martijn
>
> 2009/10/2 Joe Calderon :
>> i gotten two different out of memory errors while using the field
>> collapsing component, using the latest patch (2009-09-26) and the
>> latest nightly,
>>
>> has anyone else encountered similar problems? my collection is 5
>> million results but ive gotten the error collapsing as little as a few
>> thousand
>>
>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>        at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:173)
>>        at 
>> org.apache.lucene.util.OpenBitSet.ensureCapacityWords(OpenBitSet.java:749)
>>        at 
>> org.apache.lucene.util.OpenBitSet.ensureCapacity(OpenBitSet.java:757)
>>        at 
>> org.apache.lucene.util.OpenBitSet.expandingWordNum(OpenBitSet.java:292)
>>        at org.apache.lucene.util.OpenBitSet.set(OpenBitSet.java:233)
>>        at 
>> org.apache.solr.search.AbstractDocumentCollapser.addCollapsedDoc(AbstractDocumentCollapser.java:402)
>>        at 
>> org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:115)
>>        at 
>> org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208)
>>        at 
>> org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
>>        at 
>> org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
>>        at 
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
>>        at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>        at 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>        at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>        at 
>> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
>>        at 
>> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
>>        at 
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>        at 
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>>        at 
>> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
>>        at 
>> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
>>        at 
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>        at 
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>>        at 
>> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>        at org.mortbay.jetty.Server.handle(Server.java:326)
>>        at 
>> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
>>        at 
>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
>>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
>>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
>>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
>>        at 
>> org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
>>        at 
>> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)
>>
>> SEVERE: java.lang.OutOfMemoryError: Java heap space
>>        at 
>> org.apache.solr.util.DocSetScoreCollector.(DocSetScoreCollector.java:44)
>>        at 
>> org.apache.solr.search.NonAdjacentDocumentCollapser.doQuer

stats page slow in latest nightly

2009-10-06 Thread Joe Calderon
hello *, ive been noticing that /admin/stats.jsp is really slow in the
recent builds, has anyone else encountered this?


--joe


Re: stats page slow in latest nightly

2009-10-06 Thread Joe Calderon
thx much guys, no biggie for me, i just wanted to get to the bottom of
it in case i had screwed something else up..

--joe

On Tue, Oct 6, 2009 at 1:19 PM, Mark Miller  wrote:
> I was worried about that actually. I havn't tested how fast the RAM
> estimator is on huge String FieldCaches - it will be fast on everything
> else, but it checks the size of each String in the array.
>
> When I was working on it, I was actually going to default to not show
> the size, and make you click a link that added a param to get the sizes
> in the display too. But I foolishly didn't bring it up when Hoss made my
> life easier with his simpler patch.
>
> Yonik Seeley wrote:
>> Might be the new Lucene fieldCache stats stuff that was recently added?
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>>
>> On Tue, Oct 6, 2009 at 3:56 PM, Joe Calderon  wrote:
>>
>>> hello *, ive been noticing that /admin/stats.jsp is really slow in the
>>> recent builds, has anyone else encountered this?
>>>
>>>
>>> --joe
>>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>


concatenating tokens

2009-10-08 Thread Joe Calderon
hello *, im using a combination of tokenizers and filters that give me
the desired tokens, however for a particular field i want to
concatenate these tokens back to a single string, is there a filter to
do that, if not what are the steps needed to make my own filter to
concatenate tokens?

for example, i start with "Sprocket (widget) - Blue" the analyzers
churn out the tokens [sprocket,widget,blue] i want to end up with the
string "sprocket widget blue", this is a simple example and in the
general case lowercasing and punctuation removal does not work, hence
why im looking to concatenate tokens

--joe


Re: Solr 1.4 release candidate

2009-10-14 Thread Joe Calderon
maybe im just not familiar with the way the version numbers works in
trunk but when i build the latest nightly the jars have names like
*-1.5-dev.jar,  is that normal?

On Wed, Oct 14, 2009 at 7:01 AM, Yonik Seeley
 wrote:
> Folks, we've been in code freeze since Monday and a test release
> candidate was created yesterday, however it already had to be updated
> last night due to a serious bug found in Lucene.
>
> For now you can use the latest nightly build to get any recent changes
> like this:
> http://people.apache.org/builds/lucene/solr/nightly/
>
> We'll probably release the final bits next week, so in the meantime,
> download the latest nightly build and give it a spin!
>
> -Yonik
> http://www.lucidimagination.com
>


how to get field contents out of Document object

2009-10-14 Thread Joe Calderon
hello *, sorry if this seems like a dumb question, im still fairly new
to working with lucene/solr internals.

given a Document object, what is the proper way to fetch an integer
value for a field called "num_in_stock", it is both indexed and stored

thx much

--joe


lucene 2.9 bug

2009-10-16 Thread Joe Calderon
hello * , ive read in other threads that lucene 2.9 had a serious bug
in it, hence trunk moved to 2.9.1 dev, im wondering what the bug is as
ive been using the 2.9.0 version for the past weeks with no problems,
is it critical to upgrade?

--joe


max words/tokens

2009-10-20 Thread Joe Calderon
i have a pretty basic question, is there an existing analyzer that
limits the number of words/tokens indexed from a field? let say i only
wanted to index the top 25 words...

thx much

--joe


Re: max words/tokens

2009-10-20 Thread Joe Calderon
cool np, i just didnt want to duplicate code if that already existed.

On Tue, Oct 20, 2009 at 12:49 PM, Yonik Seeley
 wrote:
> On Tue, Oct 20, 2009 at 1:53 PM, Joe Calderon  wrote:
>> i have a pretty basic question, is there an existing analyzer that
>> limits the number of words/tokens indexed from a field? let say i only
>> wanted to index the top 25 words...
>
> It would be really easy to write one, but no there is not currently.
>
> -Yonik
> http://www.lucidimagination.com
>


boostQParser and dismax

2009-10-22 Thread Joe Calderon
hello *, i was just reading over the wiki function query page and
found this little gem for boosting recent docs thats much better than
what i was doing before

recip(ms(NOW,mydatefield),3.16e-11,1,1)


my question is, at the bottom it says
The most effective way to use such a boost is to multiply it with the
relevancy score, rather than add it in. One way to do this is with the
boost query parser.


how exactly do i use the boost query parser along with the dismax
parser? can someone post an example solrconfig snippet?


thx much

--joe


field collapsing bug (java.lang.ArrayIndexOutOfBoundsException)

2009-10-23 Thread Joe Calderon
seems to happen when sort on anything besides strictly score, even
score desc, num desc triggers it, using latest nightly and 10/14 patch

Problem accessing /solr/core1/select. Reason:

4731592

java.lang.ArrayIndexOutOfBoundsException: 4731592
at 
org.apache.lucene.search.FieldComparator$StringOrdValComparator.copy(FieldComparator.java:660)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentComparator.compare(NonAdjacentDocumentCollapser.java:235)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser$DocumentPriorityQueue.lessThan(NonAdjacentDocumentCollapser.java:173)
at 
org.apache.lucene.util.PriorityQueue.insertWithOverflow(PriorityQueue.java:158)
at 
org.apache.solr.search.NonAdjacentDocumentCollapser.doCollapsing(NonAdjacentDocumentCollapser.java:95)
at 
org.apache.solr.search.AbstractDocumentCollapser.collapse(AbstractDocumentCollapser.java:208)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:98)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:66)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)


profiling solr

2009-10-26 Thread Joe Calderon
as a curiosity ide like to use a profiler to see where within solr
queries spend most of their time, im curious what tools if any others
use for this type of task..

im using jetty as my servlet container so ideally ide like a profiler
thats compatible with it

--joe


field collapsing exception

2009-10-26 Thread Joe Calderon
found another exception, i cant find specific steps to reproduce
besides starting with an unfiltered result and then given an int field
with values (1,2,3) filtering by 3 triggers it sometimes, this is in
an index with very frequent updates and deletes


--joe


java.lang.NullPointerException
at 
org.apache.solr.search.fieldcollapse.collector.FieldValueCountCollapseCollectorFactory$FieldValueCountCollapseCollector.getResult(FieldValueCountCollapseCollectorFactory.java:84)
at 
org.apache.solr.search.fieldcollapse.AbstractDocumentCollapser.getCollapseInfo(AbstractDocumentCollapser.java:191)
at 
org.apache.solr.handler.component.CollapseComponent.doProcess(CollapseComponent.java:179)
at 
org.apache.solr.handler.component.CollapseComponent.process(CollapseComponent.java:121)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1148)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:387)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:765)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:539)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at 
org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409)
at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:520)


faceting ordering

2009-10-28 Thread Joe Calderon
curious...is it possible to have faceted results ordered by score?

im having a problem where im faceting on a field while searching for
the same word twice, for example:

im searching for "the the" on a tokenized field and faceting by the
untokenized version, faceting returns records with "the the", but way
at the bottom since everything with a single "the" happens to be way
more frequent, i tried restricting my search to phrases with small
slop:

myfield:"token1 token2 token3"~3

but that affects other searches negatively, ideally i want to be as
loose as possible as these searches power an auto suggest feature

i figured if faceted results could be sorted by score, i could simply
boost phrases instead of restricting by them, thoughts?


--joe


tokenize after filters

2009-11-02 Thread Joe Calderon
 is it possible to tokenize a field on whitespace after some filters
have been applied:

ex: "A + W Root Beer"
the field uses a keyword tokenizer to keep the string together, then
it will get converted to "aw root beer" by a custom filter ive made, i
now want to split that up into 3 tokens (aw, root, beer), but seems
like you cant use a tokenizer after a filter ... so whats the best way
of accomplishing this?

thx much

--joe


Re: apply a patch on solr

2009-11-03 Thread Joe Calderon
patch -p0 < /path/to/field-collapse-5.patch

On Tue, Nov 3, 2009 at 7:48 PM, michael8  wrote:
>
> Hmmm, perhaps I jumped the gun.  I just looked over the field collapse patch
> for SOLR-236 and each file listed in the patch has its own revision #.
>
> E.g. from field-collapse-5.patch:
> --- src/java/org/apache/solr/core/SolrConfig.java       (revision 824364)
> --- src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java
> (revision 816372)
> --- src/solrj/org/apache/solr/client/solrj/SolrQuery.java       (revision 
> 823653)
> --- src/java/org/apache/solr/search/SolrIndexSearcher.java      (revision 
> 794328)
> --- src/java/org/apache/solr/search/DocSetHitCollector.java     (revision
> 794328)
>
> Unless there is a better way, it seems like I would need to do "svn up
> --revision ..." for each of the files to be patched and then apply the
> patch?  This seems error prone and tedious.  Am I missing something simpler
> here?
>
> Michael
>
>
> michael8 wrote:
>>
>> Perfect.  This is what I need to know instead of patching 'in the dark'.
>> Good thing SVN revision cuts across all files like a tag.
>>
>> Thanks Mike!
>>
>> Michael
>>
>>
>> cambridgemike wrote:
>>>
>>> You can see what revision the patch was written for at the top of the
>>> patch,
>>> it will look like this:
>>>
>>> Index: org/apache/solr/handler/MoreLikeThisHandler.java
>>> ===
>>> --- org/apache/solr/handler/MoreLikeThisHandler.java (revision 772437)
>>> +++ org/apache/solr/handler/MoreLikeThisHandler.java (working copy)
>>>
>>> now check out revision 772437 using the --revision switch in svn, patch
>>> away, and then svn up to make sure everything merges cleanly.  This is a
>>> good guide to follow as well:
>>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg10189.html
>>>
>>> cheers,
>>> -mike
>>>
>>> On Mon, Nov 2, 2009 at 3:55 PM, michael8  wrote:
>>>

 Hi,

 First I like to pardon my novice question on patching solr (1.4).  What
 I
 like to know is, given a patch, like the one for collapse field, how
 would
 one go about knowing what solr source that patch is meant for since this
 is
 a source level patch?  Wouldn't the exact versions of a set of java
 files
 to
 be patched critical for the patch to work properly?

 So far what I have done is to pull the latest collapse field patch down
 from
 http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch),
 and
 then svn up the latest trunk from
 http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and
 build.
 Intuitively I was thinking I should be doing svn up to a specific
 revision/tag instead of just latest.  So far everything seems fine, but
 I
 just want to make sure I'm doing the right thing and not just being
 lucky.

 Thanks,
 Michael
 --
 View this message in context:
 http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html
 Sent from the Solr - User mailing list archive at Nabble.com.


>>>
>>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26190563.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: apply a patch on solr

2009-11-03 Thread Joe Calderon
sorry got cut off,
patch, then ant clean dist, will give you the modified solr war file,
if it doesnt apply cleanly (which i dont think is currently the case),
you can go back to the latest revision referenced in the patch,


On Tue, Nov 3, 2009 at 8:17 PM, Joe Calderon  wrote:
> patch -p0 < /path/to/field-collapse-5.patch
>
> On Tue, Nov 3, 2009 at 7:48 PM, michael8  wrote:
>>
>> Hmmm, perhaps I jumped the gun.  I just looked over the field collapse patch
>> for SOLR-236 and each file listed in the patch has its own revision #.
>>
>> E.g. from field-collapse-5.patch:
>> --- src/java/org/apache/solr/core/SolrConfig.java       (revision 824364)
>> --- src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java
>> (revision 816372)
>> --- src/solrj/org/apache/solr/client/solrj/SolrQuery.java       (revision 
>> 823653)
>> --- src/java/org/apache/solr/search/SolrIndexSearcher.java      (revision 
>> 794328)
>> --- src/java/org/apache/solr/search/DocSetHitCollector.java     (revision
>> 794328)
>>
>> Unless there is a better way, it seems like I would need to do "svn up
>> --revision ..." for each of the files to be patched and then apply the
>> patch?  This seems error prone and tedious.  Am I missing something simpler
>> here?
>>
>> Michael
>>
>>
>> michael8 wrote:
>>>
>>> Perfect.  This is what I need to know instead of patching 'in the dark'.
>>> Good thing SVN revision cuts across all files like a tag.
>>>
>>> Thanks Mike!
>>>
>>> Michael
>>>
>>>
>>> cambridgemike wrote:
>>>>
>>>> You can see what revision the patch was written for at the top of the
>>>> patch,
>>>> it will look like this:
>>>>
>>>> Index: org/apache/solr/handler/MoreLikeThisHandler.java
>>>> ===
>>>> --- org/apache/solr/handler/MoreLikeThisHandler.java (revision 772437)
>>>> +++ org/apache/solr/handler/MoreLikeThisHandler.java (working copy)
>>>>
>>>> now check out revision 772437 using the --revision switch in svn, patch
>>>> away, and then svn up to make sure everything merges cleanly.  This is a
>>>> good guide to follow as well:
>>>> http://www.mail-archive.com/solr-user@lucene.apache.org/msg10189.html
>>>>
>>>> cheers,
>>>> -mike
>>>>
>>>> On Mon, Nov 2, 2009 at 3:55 PM, michael8  wrote:
>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>> First I like to pardon my novice question on patching solr (1.4).  What
>>>>> I
>>>>> like to know is, given a patch, like the one for collapse field, how
>>>>> would
>>>>> one go about knowing what solr source that patch is meant for since this
>>>>> is
>>>>> a source level patch?  Wouldn't the exact versions of a set of java
>>>>> files
>>>>> to
>>>>> be patched critical for the patch to work properly?
>>>>>
>>>>> So far what I have done is to pull the latest collapse field patch down
>>>>> from
>>>>> http://issues.apache.org/jira/browse/SOLR-236 (field-collapse-5.patch),
>>>>> and
>>>>> then svn up the latest trunk from
>>>>> http://svn.apache.org/repos/asf/lucene/solr/trunk/, then patch and
>>>>> build.
>>>>> Intuitively I was thinking I should be doing svn up to a specific
>>>>> revision/tag instead of just latest.  So far everything seems fine, but
>>>>> I
>>>>> just want to make sure I'm doing the right thing and not just being
>>>>> lucky.
>>>>>
>>>>> Thanks,
>>>>> Michael
>>>>> --
>>>>> View this message in context:
>>>>> http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26157827.html
>>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context: 
>> http://old.nabble.com/apply-a-patch-on-solr-tp26157827p26190563.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>


wildcard oddity

2009-12-15 Thread Joe Calderon
im trying to do a wild card search

"q":"item_title:(gets*)"returns no results
"q":"item_title:(gets)"returns results
"q":"item_title:(get*)"returns results


seems like * at the end of a token is requiring a character, instead
of being 0 or more its acting like1 or more

the text im trying to match is "The Gang Gets Extreme: Home Makeover Edition"

the field uses the following analyzers


  





  



is anybody else having similar problems?


best,
--joe


Re: SOLR Performance Tuning: Pagination

2009-12-24 Thread Joe Calderon
fwiw, when implementing distributed search i ran into a similar
problem, but then i noticed even google doesnt let you go past page
1000,  easier to just set a limit on start

On Thu, Dec 24, 2009 at 8:36 AM, Walter Underwood  wrote:
> When do users do a query like that? --wunder
>
> On Dec 24, 2009, at 8:09 AM, Fuad Efendi wrote:
>
>> I used pagination for a while till found this...
>>
>>
>> I have filtered query ID:[* TO *] returning 20 millions results (no
>> faceting), and pagination always seemed to be fast. However, fast only with
>> low values for start=12345. Queries like start=28838540 take 40-60 seconds,
>> and even cause OutOfMemoryException.
>>
>> I use highlight, faceting on nontokenized "Country" field, standard handler.
>>
>>
>> It even seems to be a bug...
>>
>>
>> Fuad Efendi
>> +1 416-993-2060
>> http://www.linkedin.com/in/liferay
>>
>> Tokenizer Inc.
>> http://www.tokenizer.ca/
>> Data Mining, Vertical Search
>>
>
>


boosting on string distance

2009-12-29 Thread Joe Calderon
hello *, i want to boost documents that match the query better,
currently i also index my field as a string an boost if i match the
string field

but im wondering if its possible to boost with bf parameter with a
formula using the function strdist(), i know one of the columns would
be the field name, but how do i specify the use query as the other
parameter?

http://wiki.apache.org/solr/FunctionQuery#strdist


best,

--joe


score = result of function query

2009-12-30 Thread Joe Calderon
how can i make the score be solely the output of a function query?

the function query wiki page details something like
 q=boxname:findbox+_val_:"product(product(x,y),z)"&fl=*,score


but that doesnt seems to work


--joe


analyzer type="query" with NGramTokenFilterFactory forces phrase query

2009-12-31 Thread Joe Calderon
Hello *, im trying to make an index to support spelling errors/fuzzy
matching, ive indexed my document titles with NGramFilterFactory
minGramSize=2 maxGramSize=3, using the analysis page i can see the
common grams match between the indexed value and the query value,
however when i try to do a query for it ex. title_ngram:(family)  the
debug output says the query is converted to a phrase query "f a m i l
y fa am mi il ly fam ami mil ily", if this is the expected behavior is
there a way to override it?

or should i scrap this approach and use title:(family) and boost on
strdist("family", title, ngram, 3) ?


Re: analyzer type="query" with NGramTokenFilterFactory forces phrase query

2009-12-31 Thread Joe Calderon
"if this is the expected behaviour is there a way to override it?"[1]

[1] me

On Thu, Dec 31, 2009 at 10:13 AM, AHMET ARSLAN  wrote:
>> Hello *, im trying to make an index
>> to support spelling errors/fuzzy
>> matching, ive indexed my document titles with
>> NGramFilterFactory
>> minGramSize=2 maxGramSize=3, using the analysis page i can
>> see the
>> common grams match between the indexed value and the query
>> value,
>> however when i try to do a query for it ex.
>> title_ngram:(family)  the
>> debug output says the query is converted to a phrase query
>> "f a m i l
>> y fa am mi il ly fam ami mil ily", if this is the expected
>> behavior is
>> there a way to override it?
>
> "If a single token is split into more tokens during the analysis phase, solr 
> will do a phrase query instead of a term query." [1]
>
> [1]http://www.mail-archive.com/solr-user@lucene.apache.org/msg30055.html
>
>
>
>
>


custom wildcarding in qparser

2010-01-08 Thread Joe Calderon
hello *, what do i need to do to make a query parser that works just
like the standard query parser but also runs analyzers/tokenizers on a
wildcarded term, specifically im looking to only wildcarding the last
token

ive tried the edismax qparser and the prefix qparser and neither is
exactly what im looking for, the problem im trying to solve is
matching wildcards on terms that can be entered multiple ways, i have
a set of analyzers that generate the various terms, ex wildcarding on
stemmed fields etc


thx much

--joe


help implementing a couple of business rules

2010-01-11 Thread Joe Calderon
hello *, im looking for help on writing queries to implement a few
business rules.


1. given a set of fields how to return matches that match across them
but not just one specific one, ex im using a dismax parser currently
but i want to exclude any results that only match against a field
called 'description2'


2. given a set of fields how to return matches that match across them
but on one specific field match as a phrase only, ex im using a dismax
parser currently but i want matches against a field called 'people' to
only match as a phrase


thx much,

--joe


Re: help implementing a couple of business rules

2010-01-11 Thread Joe Calderon
thx, but im not sure that covers all edge cases, to clarify
1. matching description2 is okay if other fields are matched too, but
results matching only to description2 should be omitted

2. its okay to not match against the people field, but matches against
the people field should only be phrase matches

sorry if  i was unclear

--joe
On Mon, Jan 11, 2010 at 10:13 AM, Erik Hatcher  wrote:
>
> On Jan 11, 2010, at 12:56 PM, Joe Calderon wrote:
>>
>> 1. given a set of fields how to return matches that match across them
>> but not just one specific one, ex im using a dismax parser currently
>> but i want to exclude any results that only match against a field
>> called 'description2'
>
> One way could be to add an fq parameter to the request:
>
>   &fq=-description2:()
>
>> 2. given a set of fields how to return matches that match across them
>> but on one specific field match as a phrase only, ex im using a dismax
>> parser currently but i want matches against a field called 'people' to
>> only match as a phrase
>
> Doesn't setting pf=people accomplish this?
>
>        Erik
>
>


Re: Solr 1.4 Field collapsing - What are the steps for applying the SOLR-236 patch?

2010-01-11 Thread Joe Calderon
it seems to be in flux right now as the solr developers slowly make 
improvements and ingest the various pieces into the solr trunk, i think 
your best bet might be to use the 12/24 patch and fix any errors where 
it doesnt apply cleanly


im using solr trunk r892336 with the 12/24 patch


--joe
On 01/11/2010 08:48 PM, Kelly Taylor wrote:

Hi,

Is there a step-by-step for applying the patch for SOLR-236 to enable field
collapsing in Solr 1.4?

Thanks,
Kelly
   




Re: question about date boosting

2010-01-12 Thread Joe Calderon

I think you need to use the new trieDateField
On 01/12/2010 07:06 PM, Daniel Higginbotham wrote:

Hello,

I'm trying to boost results based on date using the first example 
here:http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents 



However, I'm getting an error that reads, "Can't use ms() function on 
non-numeric legacy date field"


The date field uses solr.DateField . What am I doing wrong?

Thank you!
Daniel Higginbotham




Re: Field collapsing patch error

2010-01-19 Thread Joe Calderon
this has come up before, my suggestions would be to use the 12/24
patch with trunk revision 892336

http://www.lucidimagination.com/search/document/797549d29e1810d9/solr_1_4_field_collapsing_what_are_the_steps_for_applying_the_solr_236_patch

2010/1/19 Licinio Fernández Maurelo :
> Hi folks,
>
> i've downloaded solr release 1.4 and tried to apply  latest field collapsing
> patchi've
> found . Found errors :
>
> d...@backend05:~/workspace/solr-release-1.4.0$ patch -p0 -i SOLR-236.patch
>
> patching file src/test/test-files/solr/conf/solrconfig-fieldcollapse.xml
> patching file src/test/test-files/solr/conf/schema-fieldcollapse.xml
> patching file src/test/test-files/solr/conf/solrconfig.xml
> patching file src/test/test-files/fieldcollapse/testResponse.xml
> patching file
> src/test/org/apache/solr/search/fieldcollapse/FieldCollapsingIntegrationTest.java
> patching file
> src/test/org/apache/solr/search/fieldcollapse/DistributedFieldCollapsingIntegrationTest.java
> patching file
> src/test/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapserTest.java
>
> patching file
> src/test/org/apache/solr/search/fieldcollapse/AdjacentCollapserTest.java
>
> patching file
> src/test/org/apache/solr/handler/component/CollapseComponentTest.java
>
> patching file
> src/test/org/apache/solr/client/solrj/response/FieldCollapseResponseTest.java
>
> patching file
> src/java/org/apache/solr/search/DocSetAwareCollector.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/CollapseGroup.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/DocumentCollapseResult.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/DocumentCollapser.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollectorFactory.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/DocumentGroupCountCollapseCollectorFactory.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AverageFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MinFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/SumFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/MaxFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/aggregate/AggregateFunction.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/CollapseContext.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/DocumentFieldsCollapseCollectorFactory.java
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/AggregateCollapseCollectorFactory.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/CollapseCollector.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/FieldValueCountCollapseCollectorFactory.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/collector/AbstractCollapseCollector.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/AbstractDocumentCollapser.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/NonAdjacentDocumentCollapser.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/AdjacentDocumentCollapser.java
>
> patching file
> src/java/org/apache/solr/search/fieldcollapse/util/Counter.java
>
> patching file
> src/java/org/apache/solr/search/SolrIndexSearcher.java
>
> patching file
> src/java/org/apache/solr/search/DocSetHitCollector.java
>
> patching file
> src/java/org/apache/solr/handler/component/CollapseComponent.java
>
> patching file
> src/java/org/apache/solr/handler/component/QueryComponent.java
>
> Hunk #1 FAILED at
> 522.
>
> 1 out of 1 hunk FAILED -- saving rejects to file
> src/java/org/apache/solr/handler/component/QueryComponent.java.rej
>
> patching file
> src/java/org/apache/solr/util/DocSetScoreCollector.java
>
> patching file
> src/common/org/apache/solr/common/params/CollapseParams.java
>
> patching file src/solrj/org/apache/solr/client/solrj/SolrQuery.java
> Hunk #1 FAILED at 17.
> Hunk #2 FAILED at 50.
> Hunk #3 FAILED at 76.
> Hunk #4 FAILED at 148.
> Hunk #5 FAILED at 197.
> Hunk #6 succeeded at 510 (offset -155 lines).
> Hunk #7 succeeded at 566 (offset -155 lines).
> 5 out of 7 hunks FAILED -- saving rejects to file
> src/solrj/org/apache/solr/client/solrj/SolrQuery.java.rej
> patching file
> src/solrj/org/apache/solr/client/solrj/response/QueryResponse.java
> Hunk #1 succeeded at 17 with fuzz 1.
> Hunk #2 FAILED at 42.
> Hunk #3 FAILED at 58.
> Hunk #4 succeeded at 117 with fuzz 2 (offset -8 lines).
> Hunk #5 succeeded at 315 with fuzz 2 (offset 17 lines).
> 2 out of 5 hunks FAILED -- saving rejects to file
> src/solrj/org/apache/solr/client/solrj/response/QueryResp

create requesthandler with default shard parameter for different query parser

2010-01-21 Thread Joe Calderon
hello *, what is the best way to create a requesthandler for
distributed search with a default shards parameter but that can use
different query parsers

thus far i have

  

 
   *,score
   json
   host0:8080/solr/core0,host1:8080/solr/core1,host2:8080/solr/core2,localhost:8080/solr/core3


  query
  facet
  spellcheck
  debug

  


which works as long as qt=standard, if i change it to dismax it doenst
use the shards parameter anymore...


thx much

--joe


Re: create requesthandler with default shard parameter for different query parser

2010-01-21 Thread Joe Calderon
thx much, i see now, having request handlers with the same name as the
query parsers was confusing me, i do however have an additional
problem, if i use defType it does indeed use the right query parser
but is there a way to not send all the query parameters in the url
(qf, pf, bf etc), its the main reason im creating the new request
handler, or do i put them all as defaults under my new request handler
and let the query parser use whichever ones it supports?

On Thu, Jan 21, 2010 at 11:45 AM, Yonik Seeley
 wrote:
> On Thu, Jan 21, 2010 at 2:39 PM, Joe Calderon  wrote:
>> hello *, what is the best way to create a requesthandler for
>> distributed search with a default shards parameter but that can use
>> different query parsers
>>
>> thus far i have
>>
>>  
>>    
>>     
>>       *,score
>>       json
>>       > name="shards">host0:8080/solr/core0,host1:8080/solr/core1,host2:8080/solr/core2,localhost:8080/solr/core3
>>    
>>    
>>      query
>>      facet
>>      spellcheck
>>      debug
>>    
>>  
>>
>>
>> which works as long as qt=standard, if i change it to dismax it doenst
>> use the shards parameter anymore...
>
> Legacy terminology causing some confusion I think... qt does stand for
> "query type", but it actually picks the request handler.
> "defType" defines the default query parser to use, so you probably
> don't want to be using "qt" at all.
>
> So try something like:
> http://localhost:8983/solr/ds?defType=dismax&qf=text&q=foo
>
> -Yonik
> http://www.lucidimagination.com
>


Re: index of facet fields are not same as original string

2010-01-28 Thread Joe Calderon
facets are based off the indexed version of your string nor the stored
version, you probably have an analyzer thats removing punctuation,
most people index the same field multiple ways for different purposes,
matching. storting, faceting etc...

index a copy of your field as string type and facet on that

On Thu, Jan 28, 2010 at 3:12 AM, Sergey Pavlikovskiy
 wrote:
> Hi,
>
> probably, it's because of stemming
> if you need unstemmed text you can use 'textgen' data type for the field
>
> Sergey
>
> On Thu, Jan 28, 2010 at 12:25 PM, Solr user wrote:
>
>>
>> Hi,
>>
>>  I am new to Solr. I found facets fields does not reflect the original
>> string in the record. For example,
>>
>> the returned xml is,
>>
>> - 
>>  G-EUPE
>> 
>> - 
>>  
>> - 
>> -       
>>  1
>> 
>>  
>> -  
>>  
>>
>> Here, "G-EUPE" is displayed under facet field as 'gupe' where it is not
>> capital and missing '-' from the original string. Is there any way we could
>> fix this to match the original text in record? Thanks in advance.
>>
>> Regards,
>> uma
>> --
>> View this message in context:
>> http://old.nabble.com/index-of-facet-fields-are-not-same-as-original-string-tp27353838p27353838.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>


distributed search and failed core

2010-01-29 Thread Joe Calderon
hello *, in distributed search when a shard goes down, an error is
returned and the search fails, is there a way to avoid the error and
return the results from the shards that are still up?

thx much

--joe


Re: Basic indexing question

2010-02-02 Thread Joe Calderon
by default solr will only search the default fields, you have to
either query all fields field1:(ore) or field2:(ore) or field3:(ore)
or use a different query parser like dismax

On Tue, Feb 2, 2010 at 3:31 PM, Stefan Maric  wrote:
> I have got a basic configuration of Solr up and running and have loaded some 
> data to experiment with
>  When I run a query for 'ore' I get 3 results when I'm expecting 4
> Dataimport is pulling the expected number of rows in from my DB view
>
>  In my schema.xml I have
>   />
>   required="true" />
>  
>  
>
>  and  the defaults
>  multiValued="true"/>
> 
>
>  From an SQL point of view - I am expecting a search for 'ore' to retrieve 4 
> results (which the following does)
> select * from v_sm_search_sectors where description like '% ore%' or name 
> like '% ore%';
> 121 B0.010.010  Mining and quarrying  
> Mining of metal ore, stone, sand, clay, coal and other solid minerals
> 1000144 E0.030  Metal and metal ores wholesale   
> (null)
> 1000145 E0.030.010  Metal and metal ores wholesale   (null)
> 1000146 E0.030.020  Metal and metal ores wholesale agents   (null)
>
> From a Solr query for 'ore' - I get the following
> 
> -
>  
>  0
>  0
>  -
>  
>  10
>  0
>  on
>  ore
>  2.2
>  
>  
>  -
>  
>  -
>  
>  E0.030
>  1000144
>  Metal and metal ores wholesale
>  
>  -
>  
>  E0.030.010
>  1000145
>  Metal and metal ores wholesale
>  
>  -
>  
>  E0.030.020
>  1000146
>  Metal and metal ores wholesale agents
>  
>  
>  
>
>
>  So I don't retrieve the document where 'ore' is in the descritpion field 
> (and NOT the name field)
>
>  It would seem that Solr is ONLY returning me results based on what has 
> been put into the  dest="text"/>
>
>  Any hints as to what I've missed ??
>
>  Regards
>  Stefan Maric
>


Re: Basic indexing question

2010-02-02 Thread Joe Calderon
see http://wiki.apache.org/solr/SchemaXml#The_Default_Search_Field for
details on default field, most people use the dismax handler when
handling queries from user
see http://wiki.apache.org/solr/DisMaxRequestHandler for more details,
if you dont have many fields you can write your own query using the
lucene query parser as i mentioned before, the syntax cen be found at
http://lucene.apache.org/java/2_9_1/queryparsersyntax.html

hope this helps


--joe
On Tue, Feb 2, 2010 at 3:59 PM, Stefan Maric  wrote:
> Thanks for the quick reply
> I will have to see if the default query mechanism will suffice for most of
> my needs
>
> I have skimmed through most of the Solr documentation and didn't see
> anything describing
>
> I can easily change my DB View so that I only source Solr with a single
> string plus my id field
> (as my application makng the search will have to collate associated
> information into a presentable screen anyhow - so I'm not too worried about
> info being returned by Solr as such)
>
> Would that be a reasonable way of using Solr
>
>
>
>
> -Original Message-
> From: Joe Calderon [mailto:calderon@gmail.com]
> Sent: 02 February 2010 23:42
> To: solr-user@lucene.apache.org
> Subject: Re: Basic indexing question
>
>
> by default solr will only search the default fields, you have to
> either query all fields field1:(ore) or field2:(ore) or field3:(ore)
> or use a different query parser like dismax
>
> On Tue, Feb 2, 2010 at 3:31 PM, Stefan Maric  wrote:
>> I have got a basic configuration of Solr up and running and have loaded
> some data to experiment with
>>  When I run a query for 'ore' I get 3 results when I'm expecting 4
>> Dataimport is pulling the expected number of rows in from my DB view
>>
>>  In my schema.xml I have
>>   required="true" />
>>   required="true" />
>>  
>>  
>>
>>  and  the defaults
>>  multiValued="true"/>
>> 
>>
>>  From an SQL point of view - I am expecting a search for 'ore' to retrieve
> 4 results (which the following does)
>> select * from v_sm_search_sectors where description like '% ore%' or name
> like '% ore%';
>> 121 B0.010.010      Mining and quarrying
> Mining of metal ore, stone, sand, clay, coal and other solid minerals
>> 1000144 E0.030              Metal and metal ores wholesale
> (null)
>> 1000145 E0.030.010      Metal and metal ores wholesale
> (null)
>> 1000146 E0.030.020      Metal and metal ores wholesale agents   (null)
>>
>> From a Solr query for 'ore' - I get the following
>> 
>> -
>>      
>>      0
>>      0
>>      -
>>      
>>      10
>>      0
>>      on
>>      ore
>>      2.2
>>      
>>      
>>      -
>>      
>>      -
>>      
>>      E0.030
>>      1000144
>>      Metal and metal ores wholesale
>>      
>>      -
>>      
>>      E0.030.010
>>      1000145
>>      Metal and metal ores wholesale
>>      
>>      -
>>      
>>      E0.030.020
>>      1000146
>>      Metal and metal ores wholesale agents
>>      
>>      
>>      
>>
>>
>>      So I don't retrieve the document where 'ore' is in the descritpion
> field (and NOT the name field)
>>
>>      It would seem that Solr is ONLY returning me results based on what
> has been put into the  dest="text"/>
>>
>>      Any hints as to what I've missed ??
>>
>>      Regards
>>      Stefan Maric
>>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.435 / Virus Database: 271.1.1/2663 - Release Date: 02/02/10
> 07:35:00
>
>


Re: distributed search and failed core

2010-02-03 Thread Joe Calderon
thx guys, i ended up using a mix of code from the solr-1143 and
solr-1537 patches, now whenever there is an exception theres is a
section in the results indicating the result is partial and also lists
the failed core(s), weve added some monitoring to check for that
output as well to alert us when a shard has failed

On Wed, Feb 3, 2010 at 10:55 AM, Yonik Seeley
 wrote:
> On Fri, Jan 29, 2010 at 3:31 PM, Joe Calderon  wrote:
>> hello *, in distributed search when a shard goes down, an error is
>> returned and the search fails, is there a way to avoid the error and
>> return the results from the shards that are still up?
>
> The SolrCloud branch has load-balancing capabilities for distributed
> search amongst shard replicas.
> http://wiki.apache.org/solr/SolrCloud
>
> -Yonik
> http://www.lucidimagination.com
>


fuzzy matching / configurable distance function?

2010-02-04 Thread Joe Calderon
is it possible to configure the distance formula used by fuzzy
matching? i see there are other under the function query page under
strdist but im wondering if they are applicable to fuzzy matching

thx much


--joe


source tree for lucene

2010-02-04 Thread Joe Calderon
i want to recompile lucene with
http://issues.apache.org/jira/browse/LUCENE-2230, but im not sure
which source tree to use, i tried using the implied trunk revision
from the admin/system page but solr fails to build with the generated
jars, even if i exclude the patches from 2230...

im wondering if there is another lucene tree i should grab to use to build solr?


--joe


old wildcard highlighting behaviour

2010-02-05 Thread Joe Calderon
hello *, currently with hl.usePhraseHighlighter=true, a query for (joe
jack*) will highlight joe jackson, however after reading the
archives, what im looking for is the old 1.1 behaviour so that only
joe jack is highlighted, is this possible in solr 1.5 ?


thx  much
--joe


Re: old wildcard highlighting behaviour

2010-02-06 Thread Joe Calderon
when i set hl.highlightMultiTerm=false the term that matches the wild
card is not highlighted at all, ideally ide like a partial highlight
(the characters before the wildcard), but if not i can live without it

thx much for the help

--joe

On Fri, Feb 5, 2010 at 10:44 PM, Mark Miller  wrote:
> On iPhone so don't remember exact param I named it, but check wiki -
> something like hl.highlightMultiTerm - set it to false.
>
> - Mark
>
> http://www.lucidimagination.com (mobile)
>
> On Feb 6, 2010, at 12:00 AM, Joe Calderon  wrote:
>
>> hello *, currently with hl.usePhraseHighlighter=true, a query for (joe
>> jack*) will highlight joe jackson, however after reading the
>> archives, what im looking for is the old 1.1 behaviour so that only
>> joe jack is highlighted, is this possible in solr 1.5 ?
>>
>>
>> thx  much
>> --joe
>


analysing wild carded terms

2010-02-09 Thread Joe Calderon
hello *, quick question, what would i have to change in the query
parser to allow wildcarded terms to go through text analysis?


Re: question/suggestion for Solr-236 patch

2010-02-10 Thread Joe Calderon
you can do that very easily yourself in a post processing step after
you receive the solr response

On Wed, Feb 10, 2010 at 8:12 AM, gdeconto
 wrote:
>
> I have been able to apply and use the solr-236 patch (field collapsing)
> successfully.
>
> Very, very cool and powerful.
>
> My one comment/concern is that the collapseCount and aggregate function
> values in the collapse_counts list only represent the collapsed documents
> (ie the ones that are not shown in results).
>
> Are there any plans to include the non-collapsed (?) document in the
> collapseCount and aggregate function values (ie so that it includes ALL
> documents, not just the collapsed ones)?  Possibly via some parameter like
> collapse.includeAll?
>
> I think this would be a great addition to the collapse code (and solr
> functionality) via what I would think is a small change.
> --
> View this message in context: 
> http://old.nabble.com/question-suggestion-for-Solr-236-patch-tp27533695p27533695.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: analysing wild carded terms

2010-02-10 Thread Joe Calderon
sorry, what i meant to say is apply text analysis to the part of the
query that is wildcarded, for example if a term with latin1 diacritics
is wildcarded ide still like to run it through ISOLatin1Filter

On Wed, Feb 10, 2010 at 4:59 AM, Fuad Efendi  wrote:
>> hello *, quick question, what would i have to change in the query
>> parser to allow wildcarded terms to go through text analysis?
>
> I believe it is illogical. "wildcarded terms" will go through terms
> enumerator.
>
>
>


Re: How to reindex data without restarting server

2010-02-11 Thread Joe Calderon
if you use the core model via solr.xml you can reload a core without 
having to to restart the servlet container,

http://wiki.apache.org/solr/CoreAdmin
On 02/11/2010 02:40 PM, Emad Mushtaq wrote:

Hi,

I would like to know if there is a way of reindexing data without restarting
the server. Lets say I make a change in the schema file. That would require
me to reindex data. Is there a solution to this ?

   




reloading sharedlib folder

2010-02-12 Thread Joe Calderon
when using solr.xml, you can specify a sharedlib directory to share
among cores, is it possible to reload the classes in this dir without
having to restart the servlet container? it would be useful to be able
to make changes to those classes on the fly or be able to drop in new
plugins


problem with edgengramtokenfilter and highlighter

2010-02-13 Thread Joe Calderon
i ran into a problem while using the edgengramtokenfilter, it seems to
report incorrect offsets when generating tokens, more specifically all
the tokens have offset 0 and term length as start and end, this leads
to goofy highlighting behavior when creating edge grams for tokens
beyond the first one, i created a small patch that takes into account
the start of the original token and adds that to the reported
start/end offsets.


Re: problem with edgengramtokenfilter and highlighter

2010-02-14 Thread Joe Calderon

lucene-2266 filed and patch posted.
On 02/13/2010 09:14 PM, Robert Muir wrote:

Joe, can you open a Lucene JIRA issue for this?

I just glanced at the code and it looks like a bug to me.

On Sun, Feb 14, 2010 at 12:07 AM, Joe Calderonwrote:

   

i ran into a problem while using the edgengramtokenfilter, it seems to
report incorrect offsets when generating tokens, more specifically all
the tokens have offset 0 and term length as start and end, this leads
to goofy highlighting behavior when creating edge grams for tokens
beyond the first one, i created a small patch that takes into account
the start of the original token and adds that to the reported
start/end offsets.

 



   




Re: and DisMaxRequestHandler

2010-02-15 Thread Joe Calderon

no but you can set a default for the qf parameter with the same value
On 02/15/2010 01:50 AM, Steve Radhouani wrote:

Hi there,
Can the  option be used by the DisMaxRequestHandler?
  Thanks,
-Steve

   




Re: Reindex after changing defaultSearchField?

2010-02-17 Thread Joe Calderon
no, youre just changing how your querying the index, not the actual 
index, you will need to restart the servlet container or reload the core 
for the config changes to take effect tho

On 02/17/2010 10:04 AM, Frederico Azeiteiro wrote:

Hi,



If i change the "defaultSearchField" in the core schema, do I need to
recreate the index?



Thanks,

Frederico




   




Re: including 'the' dismax query kills results

2010-02-18 Thread Joe Calderon
use the common grams filter, itll create tokens for stop words and
their adjacent terms

On Thu, Feb 18, 2010 at 7:16 AM, Nagelberg, Kallin
 wrote:
> I've noticed some peculiar behavior with the dismax searchhandler.
>
> In my case I'm making the search "The British Open", and am getting 0 
> results. When I change it to "British Open" I get many hits. I looked at the 
> query analyzer and it should be broken down to "british" and "open" tokens 
> ('the' is a stopword). I imagine it is doing an 'and' type search, and by 
> setting the 'mm' parameter to 1 I once again get results for 'the british 
> open'. I would like mm to be 100% however, but just not care about stopwords. 
> Is there a way to do this?
>
> Thanks,
> -Kal
>


Re: Autosuggest/Autocomplete with solr 1.4 and EdgeNGrams

2010-02-24 Thread Joe Calderon
i had to create a autosuggest implementation not too long ago,
originally i was using faceting, where i would match wildcards on a
tokenized field and facet on an unaltered field, this had the
advantage that i could do everything from one index, though it was
also limited by the fact suggestions came though facets and scoring
and highlighting went out the window


what i settled on was to create a separate core for suggest to use, i
analyze the fields i want to match against with whitespace tokenizer
and edgengram filter, this has multiple advantages:
query is ran through text analysis where as with wildcarded terms they are not
highlighter will highlight only the text matched not the expanded word
scoring and boosts can be used to rank suggest results

i tokenize on whitespace so i can match out of order tokens , ex
q=family guy stewie  and q=stewie family guy, etc, this is something
that prefix based solutions wont be able to do,

one small gotcha is that i recently submitted a patch to edgengram
filter to fix highlighting behaviour, it has been comitted to lucenes
trunk but its only available in versions 2.9.2 and up unless you patch
it yourself

On Wed, Feb 24, 2010 at 7:35 AM, Grant Ingersoll  wrote:
> You might also look at http://issues.apache.org/jira/browse/SOLR-1316
>
> On Feb 24, 2010, at 1:17 AM, Sachin wrote:
>
>>
>>
>> Hi All,
>>
>> I am trying to setup autosuggest using solr 1.4 for my site and needed some 
>> pointers on that. Basically, we provide autosuggest for user typed in 
>> characters in the searchbox. The autosuggest index is created with older 
>> user typed in search queries which returned > 0 results. We do some lazy 
>> writing to store this information into the db and then export it to solr on 
>> a nightly basis. As far as I know, there are 3 ways (apart from wild card 
>> search) of achieving autosuggest using solr 1.4:
>>
>> 1. Use EdgeNGrams
>> 2. Use shingles and prefix query.
>> 3. Use the new Terms component.
>>
>> I am for now more inclinded towards using the EdgeNGrams (no method to 
>> madness) and just wanted to know is there any recommended approach out of 
>> the 3 in terms of performance, since the user excepts the suggestions to be 
>> almost instantaneous? We do some heavy caching at our end to avoid hitting 
>> solr everytime but is any of these 3 approaches faster than the other?
>>
>> Also, I would also like to return the suggestion even if the user typed in 
>> query matches in between: for instance if I have the query "chicken pasta" 
>> in my index and the user types in "pasta", I would also like this query to 
>> be returned as part of the suggestion (ala Yahoo!). Below is my field 
>> definition:
>>
>>        > positionIncrementGap="100">
>>            
>>                
>>                
>>                > maxGramSize="50" />
>>            
>>            
>>                
>>                
>>            
>>        
>>
>>
>> I tried changing the KeywordTokenizerFactory with LetterTokenizerFactory, 
>> and though it works great for the above scenario (does a in-between match), 
>> it has the side-effect of removing everything which are not letters so if 
>> the user types in "123" he gets absolutely no suggestions. Is there anything 
>> that I'm missing in my configuration, is this even achievable by using 
>> EdgeNGrams or shall I look at using perhaps the TermsComponent after 
>> applying the regex patch from 1.5 and maybe do something like 
>> ".*user-typed-in-chars.*"?
>>
>> Thanks!
>>
>>
>>
>
>
>


Re: Changing term frequency according to value of one of the fields

2010-02-26 Thread Joe Calderon
extend the similarity class, compile it against the jars in lib, put in 
a path solr can find and set your schema to use it

http://wiki.apache.org/solr/SolrPlugins#Similarity
On 02/25/2010 10:09 PM, Pooja Verlani wrote:

Hi,
I want to modify Similarity class for my app like the following-
Right now tf is Math.sqrt(termFrequency)
I would like to modify it to
Math.sqrt(termFrequncy/solrDoc.getFieldValue("count"))
where count is one of the fields in the particular solr document.
Is it possible to do so? Can I import solrDocument class and take the
particular solrDoc for calculating tf in the similarity class?

Please suggest.

regards,
Pooja

   




Re: Solr 1.4 distributed search configuration

2010-02-26 Thread Joe Calderon
you can set a default shard parameter on the request handler doing
distributed search, you can set up two different request handlers one
with shards default and one without

On Thu, Feb 25, 2010 at 1:35 PM, Jeffrey Zhao
 wrote:
> Now I got it, just forgot put qt=search in query.
>
> By the way, in solr 1.3, I used shards.txt under conf directory and
> "distributed=true" in query for distributed search.  In that way,in my
> java application, I can hard code solr query with "distributed=true" and
> control the using of distributed search by  define shards.txt or not.
>
> In solr 1.4, it is more difficult to use distributed search dynamically.Is
> there a way I just change configuration  without changing query to make DS
> work?
>
> Thanks,
>
>
>
> From:   Mark Miller 
> To:     solr-user@lucene.apache.org
> Date:   25/02/2010 04:13 PM
> Subject:        Re: Solr 1.4 distributed search configuration
>
>
>
> Can you elaborate on "doesn't work" when you put it in the /search
> handler?
>
> You get an error in the logs? Nothing happens?
>
> On 02/25/2010 03:47 PM, Jeffrey Zhao wrote:
>> Hi Mark,
>>
>> Thanks for your reply. I did make a new handler as following, but it
> does
>> not work, anything wrong with my configuration?
>>
>> Thanks,
>>
>>   
>>        
>>         
>>           
>> name="shards">202.161.196.189:8080/solr,localhost:8080/solr
>>         
>>       
>>         query
>>         facet
>>         spellcheck
>>         debug
>>       
>> 
>>
>>
>>
>> From:   Mark Miller
>> To:     solr-user@lucene.apache.org
>> Date:   25/02/2010 03:41 PM
>> Subject:        Re: Solr 1.4 distributed search configuration
>>
>>
>>
>> On 02/25/2010 03:32 PM, Jeffrey Zhao wrote:
>>
>>> How do define a new search handler with a shards parameter?  I defined
>>>
>> as
>>
>>> following way but it doesn't work. If I put the shards parameter in
>>> default handler, it seems I got an infinite loop.
>>>
>>>
>>> >>
>> default="true">
>>
>>>       
>>>        
>>>          explicit
>>>        
>>>     
>>>
>>> 
>>>       
>>>        
>>>          >> name="shards">202.161.196.189:8080/solr,localhost:8080/solr
>>>        
>>>      
>>>        query
>>>        facet
>>>        spellcheck
>>>        debug
>>>      
>>>     
>>>
>>>
>>> Thanks,
>>>
>>>
>> Not seeing this on the wiki (it should be there), but you can't put the
>> shards param on the default search handler without causing an infinite
>> loop - you have to make a new request handler and put it on that.
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
>
>


Re: Search Result differences Standard vs DisMax

2010-03-01 Thread Joe Calderon
what are you using for the mm parameter? if you set it to 1 only one 
word has to match,

On 03/01/2010 05:07 PM, Steve Reichgut wrote:
***Sorry if this was sent twice. I had connection problems here and it 
didn't look like the first time it went out


I have been testing out results for some basic queries using both the 
Standard and DisMax query parsers. The results though aren't what I 
expected and am wondering if I am misundertanding how the DisMax query 
parser works.


For example, let's say I am doing a basic search for "Apache Solr" 
across a single field = Field 1 using the Standard parser. My results 
are exactly what I expected. Any document that includes either 
"Apache" or "Solr" or "Apache Solr" in Field 1 is listed with priority 
given to those that include both words.


Now, if I do the same search for "Apache Solr" across multiple fields 
- Field 1, Field 2 - using DisMax, I would expect basically the same 
results. The results should include any document that has one or both 
words in Field 1 or Field 2.


When I run that query in DisMax though, it only returns the documents 
that have BOTH words included which in my sample set only includes 1 
or 2 documents. I thought that, by default, DisMax should make both 
words optional so I am confused as to why I am only getting such a 
small subset.


Can anyone shed some light on what I am doing wrong or if I am 
misunderstanding how DisMax works.


Thanks,
Steve




Re: Issue on stopword list

2010-03-02 Thread Joe Calderon
or you can try the commongrams filter that combines tokens next to a stopword

On Tue, Mar 2, 2010 at 6:56 AM, Walter Underwood  wrote:
> Don't remove stopwords if you want to search on them. --wunder
>
> On Mar 2, 2010, at 5:43 AM, Erick Erickson wrote:
>
>> This is a classic problem with Stopword removal. Have you tried
>> just removing stopwords from the indexing definition and the
>> query definition and reindexing?
>>
>> You can't search on them no matter what you do if they've
>> been removed, they just aren't there
>>
>> HTH
>> Erick
>>
>> On Tue, Mar 2, 2010 at 5:47 AM, Suram  wrote:
>>
>>>
>>> Hi,
>>>
>>> How can i search using stopword my query like this
>>>
>>> This             - 0 results becuase it is a stopword
>>> is                 - 0 results becuase it is a stopword
>>> that             - 0 results becuase it is a stopword
>>>
>>> if i search like  "This is that" - it must give the result
>>>
>>> for that i need to change anything in my schema file to get result "This is
>>> that"
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Issue-on-stopword-list-tp27754434p27754434.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>
>


Re: indexing a huge data

2010-03-05 Thread Joe Calderon
ive found the csv update to be exceptionally fast, though others enjoy
the flexibility of the data import handler

On Fri, Mar 5, 2010 at 10:21 AM, Mark N  wrote:
> what should be the fastest way to index a documents , I am indexing huge
> collection of data after extracting certain meta - data information
> for example author and filename of each files
>
> i am extracting these information and storing in XML format
>
> for example :    1abc 
> abc.doc
>                      2abc 
> abc1.doc
>
> I can not index these documents directly to solr as it is not in the format
> required by solr ( i can not change the format as its used in other modules)
>
> should converting these file to CSV will be better and faster approach
> compared to XML?
>
>
>
> please  suggest
>
>
>
>
> --
> Nipen Mark
>


Re: Highlighting

2010-03-09 Thread Joe Calderon
did u enable the highlighting component in solrconfig.xml? try setting
debugQuery=true to see if the highlighting component is even being
called...

On Tue, Mar 9, 2010 at 12:23 PM, Lee Smith  wrote:
> Hey All
>
> I have indexed a whole bunch of documents and now I want to search against 
> them.
>
> My search is going great all but highlighting.
>
> I have these items set
>
> hl=true
> hl.snippets=2
> hl.fl = attr_content
> hl.fragsize=100
>
> Everything works apart from the highlighted text found not being surrounded 
> with a 
>
> Am I missing a setting ?
>
> Lee


Re: Highlighting

2010-03-10 Thread Joe Calderon
just to make sure were on the same page, youre saying that the
highlight section of the response is empty right? the results section
is never highlighted but a separate section contains the highlighted
fields specified in hl.fl=

On Wed, Mar 10, 2010 at 5:23 AM, Ahmet Arslan  wrote:
>
>
>> Yes Content is stored and I get same
>> results adding that parameter.
>>
>> Still not highlighting the content :-(
>>
>> Any other ideas
>
> What is the field type of attr_content? And what is your query?
>
> Are you running your query on another field and then requesting snippets from
> attr_content?
>
> q:attr_content:somequery&hl=true&hl.fl=attr_content&hl.maxAnalyzedChars=-1 
> should return highlighting.
>
>
>
>


Re: Highlighting

2010-03-10 Thread Joe Calderon
no, thats not the case, see this example response in json format:
{
 "responseHeader":{
  "status":0,
  "QTime":0,
  "params":{
"indent":"on",
"q":"title_edge:fami",
"hl.fl":"title_edge",
"wt":"json",
"hl":"on",
"rows":"1"}},
 "response":{"numFound":18,"start":0,"docs":[
{
 "title_id":1581,
 "title_edge":"Family",
 "num":4}]
 },
 "highlighting":{
  "1581":{
"title_edge":["Family"]}}



see how the highlight info is separate from the results?

On Wed, Mar 10, 2010 at 7:44 AM, Lee Smith  wrote:
> Im am getting results no problem with the query.
>
> But from what I believe it should wrap  around the text in the result.
>
> So if I search ie Andrew  within the return content Ie would have the 
> contents with the word Andrew
>
> and hl.fl=attr_content
>
> Thank you for you help
>
> Begin forwarded message:
>
>> From: Joe Calderon 
>> Date: 10 March 2010 15:37:35 GMT
>> To: solr-user@lucene.apache.org
>> Subject: Re: Highlighting
>> Reply-To: solr-user@lucene.apache.org
>>
>> just to make sure were on the same page, youre saying that the
>> highlight section of the response is empty right? the results section
>> is never highlighted but a separate section contains the highlighted
>> fields specified in hl.fl=
>>
>> On Wed, Mar 10, 2010 at 5:23 AM, Ahmet Arslan  wrote:
>>>
>>>
>>>> Yes Content is stored and I get same
>>>> results adding that parameter.
>>>>
>>>> Still not highlighting the content :-(
>>>>
>>>> Any other ideas
>>>
>>> What is the field type of attr_content? And what is your query?
>>>
>>> Are you running your query on another field and then requesting snippets 
>>> from
>>> attr_content?
>>>
>>> q:attr_content:somequery&hl=true&hl.fl=attr_content&hl.maxAnalyzedChars=-1 
>>> should return highlighting.
>>>
>>>
>>>
>>>
>
>


Re: Need help in deploying the modified SOLR source code

2010-03-12 Thread Joe Calderon
do `ant clean dist` within the solr source and use the resulting war 
file, though in the future you might think about extending the built in 
parser and creating a parser plugin rather than modifying the actual sources


see http://wiki.apache.org/solr/SolrPlugins#QParserPlugin for more info

--joe
On 03/12/2010 07:34 PM, JavaGuy84 wrote:

Hi,

I had made some changes to solrqueryparser.java using Eclipse and I am able
to do a leading wildcard search using Jetty plugin (downloaded this plugin
for eclipse).. Now I am not sure how I can package this code and redploy it.
Can someone help me out please?

Thanks,
B
   




how to create this highlighter behaviour

2010-03-29 Thread Joe Calderon
hello *,  ive been using the highlighter and been pretty happy with
its results, however theres an edge case im not sure how to fix

for query: amazing grace

the record matched and highlighted is
amazing rendition of amazing grace

is there any way to only highlight amazing grace without using phrase
queries, can i modify the highlighter components to only use terms
once and to favor contiguous sections?

i dont want to enforce phrase queries as sometimes i do want terms out
of order highlighter but i only want each term matched highlighted
once


does this make sense?


highlighter issue

2010-04-02 Thread Joe Calderon
hello *, i have a field that is indexing the string "the
ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend]
then they are passed to the edgengram filter, this allows me to match
different user spellings and allows for partial highlighting, however
a token like 'ex' would get generated twice which should be fine
except the highlighter seems to highlight that token twice even though
it has the same offsets (4,6)

is there away to make the highlighter not highlight the same token
twice, or do i have to create a token filter that would dump tokens
with equal text and offsets ?


basically whats happening now is if i search

'the e', i get:
'SeinfeldThe EEx-Girlfriend'

for 'the ex', i get:
'SeinfeldThe ExEx-Girlfriend'

and so on


thx much

--joe


Re: highlighter issue

2010-04-02 Thread Joe Calderon
i had tried it earlier with no effect, when i looked at the source, it
doesnt look at offsets at all, just position increments, so short of
somebody finding a better way i going to create a similar filter that
compared offsets...

On Fri, Apr 2, 2010 at 2:07 PM, Erik Hatcher  wrote:
> Will adding the RemoveDuplicatesTokenFilter(Factory) do the trick here?
>
>        Erik
>
> On Apr 2, 2010, at 4:13 PM, Joe Calderon wrote:
>
>> hello *, i have a field that is indexing the string "the
>> ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend]
>> then they are passed to the edgengram filter, this allows me to match
>> different user spellings and allows for partial highlighting, however
>> a token like 'ex' would get generated twice which should be fine
>> except the highlighter seems to highlight that token twice even though
>> it has the same offsets (4,6)
>>
>> is there away to make the highlighter not highlight the same token
>> twice, or do i have to create a token filter that would dump tokens
>> with equal text and offsets ?
>>
>>
>> basically whats happening now is if i search
>>
>> 'the e', i get:
>> 'Seinfeld The EEx-Girlfriend'
>>
>> for 'the ex', i get:
>> 'Seinfeld The ExEx-Girlfriend'
>>
>> and so on
>>
>>
>> thx much
>>
>> --joe
>
>


synonym filter and offsets

2010-04-19 Thread Joe Calderon
hello *, im having issues with the synonym filter altering token offsets,

my input text is
"saturday night live"
its is tokenized by the whitespace tokenizer yielding 3 tokens
[saturday, 0,8], [night, 9, 14], [live, 15,19]

on indexing these are passed through a synonym filter that has this line
saturday night live => snl, saturday night live


i now end up with four tokens
[saturday, 0, 19], [snl, 0, 19], [night, 0, 19], [live, 0,19]

what i want is
[saturday, 0,8], [snl, 0,19], [night, 9, 14], [live, 15,19]


when using the highlighter i want to make it so only the relevant part
of the text is highlighted, how can i fix my filter chain?


thx much
--joe


Re: Field Collapsing: How to estimate total number of hits

2010-05-12 Thread Joe Calderon
dont know if its the best solution but i have a field i facet on
called type its either 0,1, combined with collapse.facet=before i just
sum all the values of the facet field to get the total number found

if you dont have such a field u can always add a field with a single value

--joe

On Wed, May 12, 2010 at 10:41 AM, Sergey Shinderuk  wrote:
> Hi, fellows!
>
> I use field collapsing to collapse near-duplicate documents based on
> document fuzzy signature calculated at index time.
> The problem is that, when field collapsing is enabled, in query
> response numFound is equal to the number of rows requested.
>
> For instance, with solr example schema i can issue the following query
>
> http://localhost:8983/solr/select?q=*:*&rows=3&collapse.field=manu_exact
>
> In response i get collapse_counts together with ordinary result list,
> but numFound equals 3.
> As far as I understand, this is due to the way field collapsing works.
>
> I want to show the total number of hits to the user and provide a
> pagination through the results.
>
> Any ideas?
>
> Regards,
> Sergey Shinderuk
>


Re: how to have "shards" parameter by default

2010-06-10 Thread Joe Calderon
youve created an infinite loop, the shard you query calls all other
shards and itself and so on, create a separate requestHandler and
query that, ex


 
   localhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr


  facet
  debug

  


On Wed, Jun 9, 2010 at 9:10 PM, Scott Zhang  wrote:
> I tried put "shards" into default request handler.
> But now each time if search, solr hangs forever.
> So what's the correct solution?
>
> Thanks.
>
>   default="true">
>    
>     
>       explicit
>
>       10
>       *
>       2.1
>        name="shards">localhost:7500/solr,localhost:7501/solr,localhost:7502/solr,localhost:7503/solr,localhost:7504/solr,localhost:7505/solr,localhost:7506/solr
>       
>     
>  
>
> On Thu, Jun 10, 2010 at 11:48 AM, Scott Zhang wrote:
>
>> Hi. I am running distributed search on solr.
>> I have 70 solr instances. So each time I want to search I need to use
>> ?shards=localhost:7500/solr,localhost..7620/solr
>>
>> It is very long url.
>>
>> so how can I encode shards into config file then i don't need to type each
>> time.
>>
>>
>> thanks.
>> Scott
>>
>