Re: Posting pdf file and posting from remote

2010-02-11 Thread alendo

Thanks a lot: this tip was very important for me.
I tried with php curl with the purpose to send from Windows to MAC OS, after
one day I discovered that the @filename doesn't work on Windows, the error
was "26 failed creating formpost data" and the reason is that Windows php
curl (I don't know where is the bug) is not able to open the file passing
@filename. PHP Version 5.2.4.
I tried:
http://localhost:8010/solr/update/extract?literal.id=doc2&commit=true');
 curl_setopt ($ch, CURLOPT_POST, 1);
 curl_setopt ($ch, CURLOPT_POSTFIELDS, array('myfile'=>"@paper.pdf"));
 $result= curl_exec ($ch);
?>
and it works fine: I hope it'll work also from a remote Linux server.


Lance Norskog-2 wrote:
> 
> stream.file= means read a local file from the server that solr runs
> on. It has to be a complete path that works from that server. To load
> the file over HTTP you have to use @filename to have curl open it.
> This path has to work from the program you run curl on, and relative
> paths work.
> 
> Also, tika does not save the PDF binary, it only pulls words out of
> the PDF and stores those.
> 
> There's a tika example in solr/trunk/example/exampleDIH in the current
> solr trunk. (I don't remember if it's in the solr 1.4 release.) With
> this you can save the pdf binary in one field and save the extracted
> text in another field. I'm doing this now with html.
> 
> On Tue, Feb 9, 2010 at 2:08 AM, alendo 
> wrote:
>>
>> Ok I'm going ahead (may be:).
>> I tried another curl command to send the file from remote:
>>
>> http://mysolr:/solr/update/extract?literal.id=8514&stream.file=files/attach-8514.pdf&stream.contentType=application/pdf
>>
>> and the behaviour has been changed: now I get an error in solr log file:
>>
>> HTTP Status 500 - files/attach-8514.pdf (No such file or directory)
>> java.io.FileNotFoundException: files/attach-8514.pdf (No such file or
>> directory) at java.io.FileInputStream.open(Native Method) at
>> java.io.FileInputStream.(FileInputStream.java:106) at
>> org.apache.solr.common.util.ContentStreamBase$FileStream.getStream(ContentStreamBase.java:108)
>> at
>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:158)
>> at
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>> at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> at
>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
>>
>> etc etc...
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27512952.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Lance Norskog
> goks...@gmail.com
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Posting-pdf-file-and-posting-from-remote-tp27512455p27543540.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: The Riddle of the Underscore and the Dollar Sign . . .

2010-02-11 Thread Ahmet Arslan
> 1) How can I get rid of underscores('_') without using the
> wordDelimiter
> Filter (which gets rid of other syntax I need)?

Before TokenizerFactory you can apply  that will replace 
"_" with " " or "" depending of your needs.

mapping.txt will contain:

"_" => "" or 
"_" => " " 


  


continuously creating index packages for katta with solr

2010-02-11 Thread Thomas Koch
Hi,

I'd like to use SOLR to create indices for deployment with katta. I'd like to 
install a SOLR server on each crawler. The crawling script then sends the 
content directly to the local SOLR server. Every 5-10 minutes I'd like to take 
the current SOLR index, add it to katta and let SOLR start with an empty index 
again.
Does anybody has an idea, how this could be achieved?

Thanks a lot,

Thomas Koch, http://www.koch.ro


Re: hl.maxAlternateFieldLength defaults in solrconfig.xml

2010-02-11 Thread Ahmet Arslan

> It appears the  parameter default
> setting in
> solrconfig.xml does not take effect. 

Where did you put it? In  ...  section?

You need to put it into : 


   
   10  

  


  


Re: dismax and multi-language corpus

2010-02-11 Thread Claudio Martella
I'll try removing the '-'. I do need now to search it. the other option
would be to request the user what language to query. but in my region we
use italian and german in the same quantity, so it would turn out in
querying both the languages all the time. or you meant a more performant
solution of query both the languages all the time? :)


Otis Gospodnetic wrote:
> Claudio - fields with '-' in them can be problematic.
>
> Side comment: do you really want to search across all languages at once?  If 
> not, maybe 3 different dismax configs would make your searches better.
>
>  Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> - Original Message 
>   
>> From: Claudio Martella 
>> To: solr-user@lucene.apache.org
>> Sent: Wed, February 10, 2010 3:15:40 PM
>> Subject: dismax and multi-language corpus
>>
>> Hello list,
>>
>> I have a corpus with 3 languages, so i setup a text content field (with
>> no stemming) and 3 text-[en|it|de] fields with specific snowball stemmers.
>> i copyField the text to my language-away fields. So, I setup this dismax
>> searchHandler:
>>
>>
>>
>>   dismax
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   0.1
>>
>>
>>
>>
>> but i get this error:
>>
>> HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Expected
>> ',' at position 7 in 'content-en'
>>
>> type Status report
>>
>> message org.apache.lucene.queryParser.ParseException: Expected ',' at
>> position 7 in 'content-en'
>>
>> description The request sent by the client was syntactically incorrect
>> (org.apache.lucene.queryParser.ParseException: Expected ',' at position
>> 7 in 'content-en').
>>
>> Any idea?
>>
>> TIA
>>
>> Claudio
>>
>> -- 
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.marte...@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13 of 
>> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
>> process your personal data in order to fulfil contractual and fiscal 
>> obligations 
>> and also to send you information regarding our services and events. Your 
>> personal data are processed with and without electronic means and by 
>> respecting 
>> data subjects' rights, fundamental freedoms and dignity, particularly with 
>> regard to confidentiality, personal identity and the right to personal data 
>> protection. At any time and without formalities you can write an e-mail to 
>> priv...@tis.bz.it in order to object the processing of your personal data 
>> for 
>> the purpose of sending advertising materials and also to exercise the right 
>> to 
>> access personal data and other rights referred to in Section 7 of Decree 
>> 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens 
>> Street n. 19, Bolzano. You can find the complete information on the web site 
>> www.tis.bz.it.
>> 
>
>
>   


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.




Re: Need a bit of help, Solr 1.4: type "text".

2010-02-11 Thread Sven Maurmann

Hi,

the parameter for WordDelimiterFilterFactory is catenateAll;
you should set it to 1.

Cheers,
Sven

--On Mittwoch, 10. Februar 2010 16:37 -0800 Yu-Shan Fung 
 wrote:



Check out the configuration of WordDelimiterFilterFactory in your
schema.xml.

Depending on your settings, it's probably tokenizaing 13th into "13" and
"th". You can also have them concatenated back into a single token, but I
can't remember the exact parameter. I think it could be catenateAll.



On Wed, Feb 10, 2010 at 4:32 PM, Dickey, Dan 
wrote:


I'm using the standard "text" type for a field, and part of the data
being indexed is "13th", as in "Friday the 13th".

I can't seem to get it to match when I'm querying for "Friday the 13th"
either quoted or not.

One thing that does match is "13 th" if I send the search query with a
space between...

Any suggestions?

I know this is short on detail, but it's been a long day... time to get
outta here.

Thanks for any and all help.

   -Dan




This message contains information which may be confidential and/or
privileged. Unless you are the intended recipient (or authorized to
receive for the intended recipient), you may not read, use, copy or
disclose to anyone the message or any information contained in the
message. If you have received the message in error, please advise the
sender by reply e-mail and delete the message and any attachment(s)
thereto without retaining any copies.





--
"When nothing seems to help, I go look at a stonecutter hammering away
at his rock perhaps a hundred times without as much as a crack showing in
it. Yet at the hundred and first blow it will split in two, and I know it
was not that blow that did it, but all that had gone before." — Jacob
Riis


Re: Cannot get like exact searching to work

2010-02-11 Thread Ahmet Arslan
> I am using SOLR 1.3 and my server is
> embedded and accessed using SOLRJ.
> I would like to setup my searches so that exact matches are
> the first
> results returned, followed by near matches, and finally
> token based
> matches.
> For example, if I have a summary field in schema which is
> created
> using copyField from a bunch of other fields:
> "My item title, keyword, other, stuff"
> 
> I want this search to match the item above first and
> foremost:
> 1) "My item title*"
> 
> Then this one:
> 2) "my item*"

Wildcards inside phrases are not supported by default. You can use SOLR-1604 
for that in solr 1.4.0. But i am not sure it will work with 1.3. Can you try?

> I tried creating a field to hold exact match data
> (summaryExact) which
> actually works if I paste in the precise text but stops
> working as
> soon as I add any wildcard to it. 

Your 

> I could not quite figure out which tokenizer to use if I
> don't want
> any tokens created but just want to trim and lowercase the
> string so
> let me know if you have ideas on this.

KeywordTokenizerFactory + TrimFilterFactory + LowercaseFilterFactory 
combination can do that put punctuations won't be removed between tokens.
 
> Basically, I want something
> similar to DB "like" matching without case sensitivity and
> probably
> trimmed as well. I don't really want the field to be
> tokenized though.

Your examples seem you want to search something like startsWith? Can you 
explain more in detail?

Also your  declation is also 
wrong. It should use class="solr.TextField".


  


Re: implementing profanity detector

2010-02-11 Thread Alexey Serba
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
https://issues.apache.org/jira/browse/SOLR-1536

On Fri, Jan 29, 2010 at 12:46 AM, Mike Perham  wrote:
> We'd like to implement a profanity detector for documents during indexing.
>  That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
>
> I'm trying to figure out how best to implement this with Solr 1.4:
>
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
>
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
>
> Any suggestions on how to best implement this?
>
> Thanks in advance,
> mike
>


sorting

2010-02-11 Thread Claudio Martella
Hi,

i defined a requestHandler like this:



   dismax
   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
   0.1




content* fields are tokenized. The content comes from nutch. As it is
now, solr is complaining about some sorting issues on content* as they
are tokenized. From my perspective i have not overridden any scoring or
ordering. Have I?


As the content comes from nutch with solrindex, what is the best way of
integrating result ordering based on the graph-based information, and
not only on the score based on query/content?

Thanks

-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.




Re: Cannot get like exact searching to work

2010-02-11 Thread Aaron Zeckoski
On Thu, Feb 11, 2010 at 8:39 AM, Ahmet Arslan  wrote:
>> I am using SOLR 1.3 and my server is
>> embedded and accessed using SOLRJ.
>> I would like to setup my searches so that exact matches are
>> the first
>> results returned, followed by near matches, and finally
>> token based
>> matches.
>> For example, if I have a summary field in schema which is
>> created
>> using copyField from a bunch of other fields:
>> "My item title, keyword, other, stuff"
>>
>> I want this search to match the item above first and
>> foremost:
>> 1) "My item title*"
>>
>> Then this one:
>> 2) "my item*"
>
> Wildcards inside phrases are not supported by default. You can use SOLR-1604 
> for that in solr 1.4.0. But i am not sure it will work with 1.3. Can you try?

I might be able to try this out though in general the project has a
policy about only using released code (no trunk/unstable).
https://issues.apache.org/jira/browse/SOLR-1604
It looks like the kind of searching I want to do is not really
supported in SOLR by default though. Is that correct?


>> I tried creating a field to hold exact match data
>> (summaryExact) which
>> actually works if I paste in the precise text but stops
>> working as
>> soon as I add any wildcard to it.
>
> Your  string field type which is not analyzed/tonenized. Where string definiton is:
>
>  omitNorms="true"/>

I thought that was what my exact definition was doing except I also
want the exact field to be lowercased and trimmed (which I don't want
for all strings). Can you explain what is wrong with the current
definition so I can fix it?


>> I could not quite figure out which tokenizer to use if I
>> don't want
>> any tokens created but just want to trim and lowercase the
>> string so
>> let me know if you have ideas on this.
>
> KeywordTokenizerFactory + TrimFilterFactory + LowercaseFilterFactory 
> combination can do that put punctuations won't be removed between tokens.
>
>> Basically, I want something
>> similar to DB "like" matching without case sensitivity and
>> probably
>> trimmed as well. I don't really want the field to be
>> tokenized though.
>
> Your examples seem you want to search something like startsWith? Can you 
> explain more in detail?

What I really want is the equivalent of a match like this along with
the normal tokenized matching (where the query has been lowercased and
trimmed as well):
select * from blah where lowercase(column) like '%query%';
I think this is called a phrase match or something like that. However,
wildcards cannot be used at the beginning of query so I guess I can
live with only being able to startsWith type matching until that is
fixed. For now I have tried to do that using this:
query = (summary:"my item" || summaryExact:"my item*"^3)
but I would do this if I could:
query = (summary:"my item" || summaryExact:"*my item*"^3)

The idea is that a "phrase" match would be boosted over the normal
token matches and would show up first in the listing. Let me know if
more examples would help. I am happy to provide them.


> Also your  declation is also 
> wrong. It should use class="solr.TextField".

OK, I will see if I can figure out how to correct that.

Thanks for all the help so far
-AZ


-- 
Aaron Zeckoski (azeckoski (at) vt.edu)
Senior Research Engineer - CARET - University of Cambridge
https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile


Posting Concurrently to Solr

2010-02-11 Thread abhishes

Hello Everyone,

If I have a large data set which needs to be indexed, what strategy I can
take to build the index fast?

1. split the input into multiple xml files and then open different shells
and post each of the split xml file? will this work and help me build index
faster than 1 large xml file?

2. What if I don't want to build the XML files at all. I want to write the
extraction logic in an ETL tool and then let the ETL tool send the command
to SOLR. then I run my ETL tool in a multi-threaded manner where each thread
is extracting the data from the backed and send it to Solr for indexing.

3. Use the Core Feature and then populate each core separately, then merge
the cores.

Any other approach?



-- 
View this message in context: 
http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
Sent from the Solr - User mailing list archive at Nabble.com.



ExternalFileField

2010-02-11 Thread Julian Hille
Hi,

were trying to implement another sortby Algorythm which is calculate outside of 
our solr Server.
Is there a limit for the lines in that outside file? Cause we sometimes have 
1.5 million lines in some situations.
Also is this a performance killer for 1.5 million rows?
Most of the other files have rows about 1000-3000 thousand.

What does happen if im writing that file, and then solr tries to read it, is 
there somekind of timeout? does it display nothing? Or does it wait until i 
finish? Cause sometimes the writing could take some time (especially in the 
cause
when 1.5 million lines have to be written).

Last question to the ExternalFileField is, what does happen, if i start writing 
the externalFile while lucene trys to
sort by fields? Any exception or is there a softcache? 



Thanks for your help,
Julian Hille


---
NetImpact KG
Altonaer Straße 8
20357 Hamburg

Tel: 040 / 6738363 2
Mail: jul...@netimpact.de

Geschäftsführer: Tarek Müller



Re: Posting Concurrently to Solr

2010-02-11 Thread Vijayant Kumar
Why don't you approach for DIH

http://wiki.apache.org/solr/DataImportHandler


Thank you,
Vijayant Kumar
Software Engineer
Website Toolbox Inc.
http://www.websitetoolbox.com
1-800-921-7803 x211
>
> Hello Everyone,
>
> If I have a large data set which needs to be indexed, what strategy I can
> take to build the index fast?
>
> 1. split the input into multiple xml files and then open different shells
> and post each of the split xml file? will this work and help me build
> index
> faster than 1 large xml file?
>
> 2. What if I don't want to build the XML files at all. I want to write the
> extraction logic in an ETL tool and then let the ETL tool send the command
> to SOLR. then I run my ETL tool in a multi-threaded manner where each
> thread
> is extracting the data from the backed and send it to Solr for indexing.
>
> 3. Use the Core Feature and then populate each core separately, then merge
> the cores.
>
> Any other approach?
>
>
>
> --
> View this message in context:
> http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


-- 





Re: Question on Solr Scalability

2010-02-11 Thread abhishes

Thanks really useful article.

I am wondering about this statement in the article

"Keep in mind that Solr does not calculate universal term/doc frequencies.
At a large scale, its not likely  to matter that tf/idf is calculated at the
shard level - however, if your collection is heavily skewed in its
distribution across servers, you might take issue with the relevance
results. Its probably best to randomly distribute documents to your shards"

So if there is no universal tf/idf kept, then how does solr determine the
rank of two documents which came from different shards in a distributed
search query?

Regards,
Abhishek





Juan Pedro Danculovic-2 wrote:
> 
> To scale solr, take a look to this article
> 
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
> 
> 
> 
> Juan Pedro Danculovic
> CTO - www.linebee.com
> 
> 
> On Thu, Feb 11, 2010 at 4:12 AM, abhishes  wrote:
> 
>>
>> Suppose I am indexing very large data (5 billion rows in a database)
>>
>> Now I want to use the Solr Core feature to split the index into
>> manageable
>> chunks.
>>
>> However I have two questions
>>
>>
>> 1. Can Cores reside on difference physical servers?
>>
>> 2. when a query comes, will the query be answered by index in 1 core or
>> the
>> query will be sent to all the cores?
>>
>> My desire is to have a system which from outside appears as a single
>> large
>> index... but inside it is multiple small indexes running on different
>> hardware machines.
>> --
>> View this message in context:
>> http://old.nabble.com/Question-on-Solr-Scalability-tp27543068p27543068.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Question-on-Solr-Scalability-tp27543068p27544436.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Copying dynamic fields into default text field messing up fieldNorm?

2010-02-11 Thread Jan Høydahl / Cominvent
This sounds like an ideal use case for payloads. You could attach a boost value 
to each term in your "keywords" field.
See 
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/

Another common workaround is to create, say, 8 multi-valued fields with boosts 
0.5, 1.0, 1.5, 2.0, 4.0, 8.0, 16.0, 32.0, 64.0 and index your keywords into the 
kw-field which has the nearest boost to what you want. For your example, that 
could be:
kw05=, kw10=, kw15=politics;politicians, kw20=obama;barack, kw40=liberal, kw80= 
.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 10. feb. 2010, at 03.07, Yu-Shan Fung wrote:

> Hi All,
> 
> I'm trying to create an index of documents, where for each document, I am
> trying to associate with it a set of related keywords, each with individual
> boost values that I compute externally.
> 
> eg:
> Document Title: Democrats
>  related keywords:
>liberal: 4.0
>politics: 1.5
>obama: 2.0
>etc. (hundreds of related keywords)
> 
> Since boosts in solr is per field instead of per field-instance, I am trying
> to get around this by creating dynamic fields for each related keyword, and
> setting boost values accordingly. To be able to surface this document by
> searching the related keywords, I have the schema setup to copy these
> related keyword fields into the default text field.
> 
> But when I query any of these related keywords, I get back fieldNorms with
> the max value:
> 
>  1.5409492E10 = (MATCH) weight(text:liberal in 11), product of:
>0.8608541 = queryWeight(text:liberal), product of:
>  1.6840147 = idf(docFreq=109, maxDocs=218)
>  0.51119155 = queryNorm
>1.79002368E10 = (MATCH) fieldWeight(text:liberal in 11), product of:
>  1.4142135 = tf(termFreq(text:liberal)=2)
>  1.6840147 = idf(docFreq=109, maxDocs=218)
> 
> According to this email exchange between Koji and Mat Brown,
> 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg23759.html
> 
> The boost value from copyField's shouldn't be accumulated into the boost for
> the text field, can anyone else verify this? This seem to go against what
> I'm observing. When I turn off copyField, the fieldNorm goes back to normal
> (in the single digit range).
> 
> Any idea what could be causing this? I'm running Solr 1.4 in case that
> matters.
> 
> Any pointers/advice would be greatly appreciated! Thanks,
> Yu-Shan



Re: Question on Tokenizing email address

2010-02-11 Thread Jan Høydahl / Cominvent
My point is that I WANT the AT, DOT to be indexed, to avoid these being treated 
the same: foo-...@brown.fox and foo-bar.brown.fox
By using the LowerCaseFilterFactory before the replacements, you actually 
ensure that a search for email:at will not give a match because the query will 
be lower-cased and not match the indexed term "AT". For this reason I would not 
add the special tokens to stopword lists either, as you DO want them in the 
index.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 10. feb. 2010, at 08.34, abhishes wrote:

> 
> Thank you! it works very well.
> 
> I think that the field type suggested by you will index words like DOT, AT,
> com also
> 
> In order to prevent these words from getting indexed, I have changed the
> field type to 
> 
> 
>  
> 
>replacement="
> DOT " replace="all" />
>replacement="
> AT " replace="all" />
>generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>   
>words="stopwords.txt" enablePositionIncrements="true" />  
>  
> 
> 
> I have added the words dot, com to the stoplist file (at was already there).
> 
> Is this correct?
> 
> -- 
> View this message in context: 
> http://old.nabble.com/Question-on-Tokenizing-email-address-tp27518673p27527033.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: Replication and querying

2010-02-11 Thread Jan Høydahl / Cominvent
Hi again,

I would still keep all fields in the original schema of the global Solr, just 
for the sake of simplicity.

For custom sort order, you can look at ExternalFileField which is a text file 
that you can add to your local Solr index independently of the pre-built index. 
However, this only supports float and cannot be returned in result.  
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html

The Solr replication does a binary copy of the index - i.e. no  way to change 
docs in input.
But if you instead replicate source XML feed from master to slaves, you can 
hook into that stream to modify/add fields (see 
http://wiki.apache.org/solr/SolrConfigXml#UpdateRequestProcessorChain_section). 
But then you need to index locally of course. 1.5 mill docs isn't that much, so 
why not?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 10. feb. 2010, at 10.22, Julian Hille wrote:

> Hi,
> 
> its would be possible to add that to the main solr but the problem is:
> Lets face it (example):
> We have kind of 1.5 million documents in the solr master. These Documents are 
> books.
> These books have fields like title, ids, numbers and authors and more.
> This solr is global.
> 
> Now: The slave solr is for a local library which has all these books, but 
> want to sort in another way,
> and wants to add their own fields. For sorting and output (these fields 
> doesnt need to be indexed or searched through).
> 
> So we try to replicate the whole database but have a slightly differen 
> schema.xml in the slaves.
> 
> 
> Secondly we need for another Project to know if its possible to change data 
> "oninsert", "onupdate".
> So that the replicationed data gets edited before its really inserted. Is 
> there some kind of hook?
> As an exmaple lets take the book example from top:
> On replication the slave gets a updated document set. But before updated on 
> the the slaves db
> we like to add fields which come from another database or we like to replace 
> strings in some fields and such things.
> 
> Is that possible?
> 
> Thanks for any answers.
> 
> 
> 
> Am 09.02.2010 um 16:53 schrieb Jan Høydahl / Cominvent:
> 
>> Hi,
>> 
>> Index replication in Solr makes an exact copy of the original index.
>> Is it not possible to add the 6 extra fields to both instances?
>> An alternative to replication is to feed two independent Solr instances -> 
>> full control :)
>> Please elaborate on your specific use case if this is not useful answer to 
>> you.
>> 
>> --
>> Jan Høydahl  - search architect
>> Cominvent AS - www.cominvent.com
>> 
>> On 9. feb. 2010, at 13.21, Julian Hille wrote:
>> 
>>> Hi,
>>> 
>>> id like to know if its possible to have a solr Server with a schema and 
>>> lets say 10 fields indexed.
>>> I know want to replicate this whole index to another solr server which has 
>>> a slightly different schema.
>>> There are additional 6 fields these fields change the sort order for a 
>>> product which base is our solr database.
>>> 
>>> Is this kind of replication possible?
>>> 
>>> Is there another way to interact with data in solr? We'd like to calculate 
>>> some fields when they will be added.
>>> I cant seem to find a good documentation about the possible calls in the 
>>> query itself nor documentaion about queries/calculation  which should be 
>>> done on update.
>>> 
>>> 
>>> so far,
>>> Julian Hille
>>> 
>>> 
>>> ---
>>> NetImpact KG
>>> Altonaer Straße 8
>>> 20357 Hamburg
>>> 
>>> Tel: 040 / 6738363 2
>>> Mail: jul...@netimpact.de
>>> 
>>> Geschäftsführer: Tarek Müller
>>> 
> 
> Mit freundlichen Grüßen,
> Julian Hille
> 
> 
> ---
> NetImpact KG
> Altonaer Straße 8
> 20357 Hamburg
> 
> Tel: 040 / 6738363 2
> Mail: jul...@netimpact.de
> 
> Geschäftsführer: Tarek Müller
> 



Re: spellcheck

2010-02-11 Thread Jan Høydahl / Cominvent
Can you show us how you configured spell check?
--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 10. feb. 2010, at 11.48, michaelnazaruk wrote:

> 
> Hello,all! 
> I have some problem with spellcheck! I download,build and connect
> dictionary(~500 000 words)!It work fine! But i have suggestions for any word
> (even correct word)! Is there possible to get suggestion only for wrong
> word? 
> -- 
> View this message in context: 
> http://old.nabble.com/spellcheck-tp27527425p27527425.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: hl.maxAlternateFieldLength defaults in solrconfig.xml

2010-02-11 Thread Mark Miller
Yao Ge wrote:
> It appears the hl.maxAlternateFieldLength parameter default setting in
> solrconfig.xml does not take effect. I can only get it to work by explicitly
> sending the parameter via the client request. It is not big deal but it
> appears to be a bug.
>   
Are you sure? Its handled the same way the other hl params are - so I
can't see how it would be the only thats broken ...

If you are positive this is the case, could you make a JIRA issue?

-- 
- Mark

http://www.lucidimagination.com





Re: Getting max/min dates from solr index

2010-02-11 Thread Jan Høydahl / Cominvent
How about a field indextime_dt filled with "NOW". Then do a facet query to get 
the montly stats last 12 months:
http://localhost:8983/solr/select/?q=*:*&rows=0&facet=true&facet.date=indextime_dt&facet.date.start=NOW/MONTH-12MONTHS&facet.date.end=NOW/MONTH%2B1MONTH&facet.date.gap=%2B1MONTH

To get min date, why not do a query sorted by index time and pull the timestamp 
from first hit?
http://localhost:8983/solr/select/?q=*:*&rows=1&fl=indextime_dt&sort=indextime_dt+asc

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 10. feb. 2010, at 14.12, Mark N wrote:

> How can we get the max and min date from the Solr index ? I would need these
> dates to draw a graph ( for example timeline graph )
> 
> 
> Also can we use date faceting to show how many documents are indexed every
> month  .
> Consider I need to draw a timeline graph for current year to show how many
> records are indexed for every month  .So i will have months in X axis and no
> of document in Y axis.
> 
> What should be the better approach to design a schema to achieve this
> functionality ?
> 
> 
> Any suggestions would be appreciated
> 
> thanks
> 
> 
> -- 
> Nipen Mark



Re: Question on Solr Scalability

2010-02-11 Thread Erik Hatcher
There is already a patch available to address that short-coming in  
distributed search:


http://issues.apache.org/jira/browse/SOLR-1632


On Feb 11, 2010, at 6:56 AM, abhishes wrote:



Thanks really useful article.

I am wondering about this statement in the article

"Keep in mind that Solr does not calculate universal term/doc  
frequencies.
At a large scale, its not likely  to matter that tf/idf is  
calculated at the

shard level - however, if your collection is heavily skewed in its
distribution across servers, you might take issue with the relevance
results. Its probably best to randomly distribute documents to your  
shards"


So if there is no universal tf/idf kept, then how does solr  
determine the
rank of two documents which came from different shards in a  
distributed

search query?

Regards,
Abhishek





Juan Pedro Danculovic-2 wrote:


To scale solr, take a look to this article

http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr



Juan Pedro Danculovic
CTO - www.linebee.com


On Thu, Feb 11, 2010 at 4:12 AM, abhishes  wrote:



Suppose I am indexing very large data (5 billion rows in a database)

Now I want to use the Solr Core feature to split the index into
manageable
chunks.

However I have two questions


1. Can Cores reside on difference physical servers?

2. when a query comes, will the query be answered by index in 1  
core or

the
query will be sent to all the cores?

My desire is to have a system which from outside appears as a single
large
index... but inside it is multiple small indexes running on  
different

hardware machines.
--
View this message in context:
http://old.nabble.com/Question-on-Solr-Scalability-tp27543068p27543068.html
Sent from the Solr - User mailing list archive at Nabble.com.







--
View this message in context: 
http://old.nabble.com/Question-on-Solr-Scalability-tp27543068p27544436.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: spellcheck

2010-02-11 Thread Markus Jelsma
Hi,


Did you add spellcheck.extendedResults=true to your query? This will a.o. tell 
you if Solr thinks it has been spelled correctly or not. However, if you have 
specified spellcheck.onlyMorePopular=true, you may get suggestions even if it 
has been spelled correctly.

Don't let the onlyMorePopular directive fool you, it caught many users of 
guard before :)


Cheers,

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Faceting

2010-02-11 Thread Jan Høydahl / Cominvent
Regarding hi-jacking, that was a false alarm. Apple Mail fooled me to believe 
it was part of another thread. Sorry Jose.

I think the "properties" field approach is clean. It relies on index-time 
classification which is where such heavy-lifting should preferrably be done. 
Faceting on a multi-valued string field should work very well for this.

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 01.47, Chris Hostetter wrote:

> 
> : NOTE: Please start a new email thread for a new topic (See 
> : http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)
> 
> FWIW: I'm the most nit-picky person i know about Thread-Hijacking, but i 
> don't see any MIME headers to indicate that Jose did that).
> 
> : > If i follow this path can i then facet on "email" and/or "link" ? For
> : > example combining facet field with facet value params?
> 
> Any indexed field can be faceted on ... it's hard to be sure what exactly 
> your goal is, but if you ultimately want to be able to have a list of 
> search results, and then display facet info like "Number of results 
> containing an email address" and "Number of results containing a URL" then 
> yes: as long as you have a way of extracting that metadata and including 
> it in an indexed field, you can facet on it ... you could use Field 
> Faceting on something like a "properities: field (where all the indexed 
> values are "contains_email" and "containes_url", etc...) or you could use 
> facet queries to check arbitrary criteria (ie: facet.query=has_email:true 
> & facet.query=urls:[* TO *], etc...
> 
> 
> 
> -Hoss



Re: Cannot get like exact searching to work

2010-02-11 Thread Ahmet Arslan
> I might be able to try this out though in general the
> project has a
> policy about only using released code (no trunk/unstable).
> https://issues.apache.org/jira/browse/SOLR-1604
> It looks like the kind of searching I want to do is not
> really
> supported in SOLR by default though. Is that correct?

* or ? operators inside phrases are not supported in Solr by default. In Lucene 
contrib package there is a ComplexPhraseQueryParser that allows them.

> I thought that was what my exact definition was doing
> except I also
> want the exact field to be lowercased and trimmed (which I
> don't want
> for all strings). Can you explain what is wrong with the
> current
> definition so I can fix it?

in your attachment the definition is:

 

I have never seen a field type definition without 
charfilter/tokenizer/tokenfiler chain. Usually string type is used for exact 
match.


> What I really want is the equivalent of a match like this
> along with
> the normal tokenized matching (where the query has been
> lowercased and
> trimmed as well):
> select * from blah where lowercase(column) like '%query%';
> I think this is called a phrase match or something like
> that. 

Can your query consist of more than one words?

> However, wildcards cannot be used at the beginning of query so I
> guess I can live with only being able to startsWith type matching until
> that is fixed. 

With solr.ReversedWildcardFilterFactory it is possible. But it is in 1.4.0.

> For now I have tried to do that using this:
> query = (summary:"my item" || summaryExact:"my item*"^3)
> but I would do this if I could:
> query = (summary:"my item" || summaryExact:"*my item*"^3)

If you use string type for summaryExact you can run this query summaryExact:my\ 
item* It will bring you all documents begins with my item.
 
> The idea is that a "phrase" match would be boosted over the
> normal
> token matches and would show up first in the listing. Let
> me know if
> more examples would help. I am happy to provide them.

More examples will be great. Because boosting phrase match on a tokenized field 
can be achieved by something like "my item"^5 my item
I didn't understand need of * operator. 
Also this query will retrieve documents below:

something my item something
my something item something

We can say that it already behaves %like% query.


  


Re: How to add SpellCheckResponse to Solritas?

2010-02-11 Thread Erik Hatcher
Let me understand the issue... Have you added spellchecking parameters  
to the /itas mapping in solrconfig.xml?   If so, you should be able to  
do /itas?q=mispeled&wt=xml and see the suggestions in the response.   
If you've gotten that far you'll be able to navigate to them using the  
object navigation of $response in the templates.


The output of $response, just to be clear, isn't really JSON, it's a  
toString() that looks similar though.  Or did you convert it to JSON  
in some other fashion?  /itas?q=mispeled&wt=json should also show the  
spelling suggestions.


Erik

On Feb 9, 2010, at 7:30 PM, Jan Høydahl / Cominvent wrote:


Hi,

I'm using the /itas requestHandler, and would like to add spell- 
check suggestions to the output.
I'm having spell-check configured and working in the XML response  
writer, but nothing is output in Velocity. Debugging the JSON  
$response object, I cannot find any representation of spellcheck  
response in there.


Where do I plug that in?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com





Re: Question on Solr Scalability

2010-02-11 Thread Yonik Seeley
On Thu, Feb 11, 2010 at 6:56 AM, abhishes  wrote:
>
> Thanks really useful article.
>
> I am wondering about this statement in the article
>
> "Keep in mind that Solr does not calculate universal term/doc frequencies.
> At a large scale, its not likely  to matter that tf/idf is calculated at the
> shard level - however, if your collection is heavily skewed in its
> distribution across servers, you might take issue with the relevance
> results. Its probably best to randomly distribute documents to your shards"
>
> So if there is no universal tf/idf kept, then how does solr determine the
> rank of two documents which came from different shards in a distributed
> search query?

tf is per document, so it's the same distributed or non-distributed.
idf (inverse document frequency) is the measure of the rareness of a term.
Scoring in distributed search only considers the term rareness within
the shard.  Solr still orders documents from different shards by this
score.

Even after we integrate distributed idf, it will be optional because
it comes with a cost and is often unnecessary.

-Yonik
http://www.lucidimagination.com


Re: Solr and UIMA

2010-02-11 Thread JCodina

Things are done  :-)

now we already have done the UIMA CAS consumer for Solr, 
we are making it public, more news soon.

We have also been developing some filters based on payloads 
One of the filters is to remove words with the payloads in the list the
other one  maintains only these tokens with paylodas in the list.  It works
the same way than the stopsFilterFactory

you can find it at my page:
http://www.barcelonamedia.org/personal/joan.codina/en
http://www.barcelonamedia.org/personal/joan.codina/en 

-- 
View this message in context: 
http://old.nabble.com/Solr-and-UIMA-tp24567504p27544646.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Need a bit of help, Solr 1.4: type "text".

2010-02-11 Thread Dickey, Dan
Sven & Yu-Shan - thank you for your advice.

It doesn't seem to work for me for some reason however,
this is what I was trying to get working last night before sending
My message out.

I'll try to explain in more detail what my setup is like.

I use a multiValued text field as a sort of holder for everything else.
Let's call this field "euts" (acronym for everything under the sun).
Nothing is directly stored into this field.
I use about 20 or so copyField's to put everything else into it.
One of the fields I use is "Description", a "text" field.
In this field I'm storing "Friday the 13th", along with other potential
Text.
I have a copyField like:

The field for euts is:

The field for Description is:


The definition for the text type is straight out of the Solr tarball
>From the example/solr/conf directory.  I tried setting catenateAll="1"
And reindexing, but that didn't work.

Btw - My search query effectively looks like "euts:(Friday the 13th)".
I'm just running this through the solr admin page using the Full
Interface.
(no quotes of course).  This does not match a document that has the
String "Friday the 13th" in its Description.  I've tried it with setting
catenateAll to 1, and the original value of 0.  This is on the index
analyzer.

I've also tried it both ways with the query analyzer (at least I think I
have).  I'm less sure of how the options for the query analyzer should
be
Set.

Also, in the wiki - I found another option for the
WordDelimiterFilterFactory - preserveOriginal.
I tried setting this to 1 with similar results - no match.

And yes, I'm aware that "the" is a stop word and gets thrown away.
That's fine.
After each of these schema.xml changes, I've re-indexed my documents.
It doesn't take long as I'm just working with a small set of about 180
docs
Right now.

Again, any and all help would be greatly appreciated!  Thanks.
-Dan Dickey

-Original Message-
From: Sven Maurmann [mailto:sven.maurm...@kippdata.de] 
Sent: Thursday, February 11, 2010 2:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Need a bit of help, Solr 1.4: type "text".

Hi,

the parameter for WordDelimiterFilterFactory is catenateAll;
you should set it to 1.

Cheers,
 Sven

--On Mittwoch, 10. Februar 2010 16:37 -0800 Yu-Shan Fung 
 wrote:

> Check out the configuration of WordDelimiterFilterFactory in your
> schema.xml.
>
> Depending on your settings, it's probably tokenizaing 13th into "13"
and
> "th". You can also have them concatenated back into a single token,
but I
> can't remember the exact parameter. I think it could be catenateAll.
>
>
>
> On Wed, Feb 10, 2010 at 4:32 PM, Dickey, Dan 
> wrote:
>
>> I'm using the standard "text" type for a field, and part of the data
>> being indexed is "13th", as in "Friday the 13th".
>>
>> I can't seem to get it to match when I'm querying for "Friday the
13th"
>> either quoted or not.
>>
>> One thing that does match is "13 th" if I send the search query with
a
>> space between...
>>
>> Any suggestions?
>>
>> I know this is short on detail, but it's been a long day... time to
get
>> outta here.
>>
>> Thanks for any and all help.
>>
>>-Dan
>>
>>
>>
>>
>> This message contains information which may be confidential and/or
>> privileged. Unless you are the intended recipient (or authorized to
>> receive for the intended recipient), you may not read, use, copy or
>> disclose to anyone the message or any information contained in the
>> message. If you have received the message in error, please advise the
>> sender by reply e-mail and delete the message and any attachment(s)
>> thereto without retaining any copies.
>
>
>
>
> --
> "When nothing seems to help, I go look at a stonecutter hammering away
> at his rock perhaps a hundred times without as much as a crack showing
in
> it. Yet at the hundred and first blow it will split in two, and I know
it
> was not that blow that did it, but all that had gone before." - Jacob
> Riis

This message contains information which may be confidential and/or privileged. 
Unless you are the intended recipient (or authorized to receive for the 
intended recipient), you may not read, use, copy or disclose to anyone the 
message or any information contained in the message. If you have received the 
message in error, please advise the sender by reply e-mail and delete the 
message and any attachment(s) thereto without retaining any copies.


help with facets and searchable fields

2010-02-11 Thread adeelmahmood

hi there
i am trying to get familiar with solr while setting it up on my local pc and
indeing and retrieving some sample data .. a couple of things i am having
trouble with
1 - in my schema if i dont use the copyField to copy data from some fields
to the text field .. they are not searchable .. so if i just have an id and
title field with searchable and sortable attributes set to true .. i cant
search for them but as soon as I add them to the 'text' field with the
copyField functionality .. they become searchable .. 

2 - i have another field called category which can be something like
category 1, category 2, ...
again same idea here .. cateogory field by itself isnt searchable until i
copy it into the text field .. and as i understand all data is simply
appended in the text field .. so after that if i search for some field with
title matching to my query and then if i facet with the category field ..
and if my result lets say was in category 1 .. the facet count is then
returned as 2 .. with Category being the first thing and 2 being the second
thing .. how can i make it so that it consider 'Category 2' as a single
thing

any help is appreciated
-- 
View this message in context: 
http://old.nabble.com/help-with-facets-and-searchable-fields-tp27545136p27545136.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: The Riddle of the Underscore and the Dollar Sign . . .

2010-02-11 Thread Christopher Ball
Unfortunately, the underscore is being quite resilient =(

I tried the solr.MappingCharFilterFactory and know the mapping is working as
I am changing "c" => "q" just fine. But the underscore refuses to go!

I am baffled . . .



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, February 11, 2010 3:11 AM
To: solr-user@lucene.apache.org
Subject: Re: The Riddle of the Underscore and the Dollar Sign . . .

> 1) How can I get rid of underscores('_') without using the
> wordDelimiter
> Filter (which gets rid of other syntax I need)?

Before TokenizerFactory you can apply  that will
replace "_" with " " or "" depending of your needs.

mapping.txt will contain:

"_" => "" or 
"_" => " " 


  




Re: spellcheck

2010-02-11 Thread michaelnazaruk

here simple query:
http://estyledesign:8983/request/select?q=popular&spellcheck=true&qt=keyrequest&spellcheck.extendedResults=true
result:
populars! but popular is correct word! Maybe i must change some properties
in solrconfig! Here my configs for keyrequest:


dismax
  true 
  false 
  true 
  external 


query
spellcheck
mlt


and search component:

textSpell  

  org.apache.solr.spelling.FileBasedSpellChecker
  external
  spellings.txt
  UTF-8
  spellcheckerfile

   
-- 
View this message in context: 
http://old.nabble.com/spellcheck-tp27527425p27547926.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: spellcheck

2010-02-11 Thread michaelnazaruk

here simple query: 
http://estyledesign:8983/request/select?q=popular&spellcheck=true&qt=keyrequest&spellcheck.extendedResults=true
result: 
populars! but popular is correct word! Maybe i must change some properties
in solrconfig! Here my configs for keyrequest: 
 
 
dismax 
  true 
  false 
  true 
  external 
 
 
query 
spellcheck 
mlt 
 
 
and search component: 
 
textSpell   
 
  org.apache.solr.spelling.FileBasedSpellChecker 
  external 
  spellings.txt 
  UTF-8 
  spellcheckerfile 
 
   

-- 
View this message in context: 
http://old.nabble.com/spellcheck-tp27527425p27548036.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: spellcheck

2010-02-11 Thread michaelnazaruk

here simple query: 
http://estyledesign:8983/request/select?q=popular&spellcheck=true&qt=keyrequest&spellcheck.extendedResults=true
result: 
populars! but popular is correct word! Maybe i must change some properties
in solrconfig! Here my configs for keyrequest: 
 
 
dismax 
  true 
  false 
  true 
  external 
 
 
query 
spellcheck 
mlt 
 
 
and search component: 
 
textSpell   
 
  org.apache.solr.spelling.FileBasedSpellChecker 
  external 
  spellings.txt 
  UTF-8 
  spellcheckerfile 
 
   
-- 
View this message in context: 
http://old.nabble.com/spellcheck-tp27527425p27548078.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: Need a bit of help, Solr 1.4: type "text".

2010-02-11 Thread Dickey, Dan
Hmm... I think I'm onto something.
It may be the stop word removal of "the".
When I changed my query analyzer for "text" to set
enablePositionIncrements="false" instead of true,
the query seems to find what I'm expecting.  I'll keep
looking into this.

Is there any information available on what Position Increments are?
-Dan

-Original Message-
From: Dickey, Dan [mailto:dan.dic...@savvis.net] 
Sent: Thursday, February 11, 2010 8:29 AM
To: solr-user@lucene.apache.org
Subject: RE: Need a bit of help, Solr 1.4: type "text".

Sven & Yu-Shan - thank you for your advice.

It doesn't seem to work for me for some reason however,
this is what I was trying to get working last night before sending
My message out.

I'll try to explain in more detail what my setup is like.

I use a multiValued text field as a sort of holder for everything else.
Let's call this field "euts" (acronym for everything under the sun).
Nothing is directly stored into this field.
I use about 20 or so copyField's to put everything else into it.
One of the fields I use is "Description", a "text" field.
In this field I'm storing "Friday the 13th", along with other potential
Text.
I have a copyField like:

The field for euts is:

The field for Description is:


The definition for the text type is straight out of the Solr tarball
>From the example/solr/conf directory.  I tried setting catenateAll="1"
And reindexing, but that didn't work.

Btw - My search query effectively looks like "euts:(Friday the 13th)".
I'm just running this through the solr admin page using the Full
Interface.
(no quotes of course).  This does not match a document that has the
String "Friday the 13th" in its Description.  I've tried it with setting
catenateAll to 1, and the original value of 0.  This is on the index
analyzer.

I've also tried it both ways with the query analyzer (at least I think I
have).  I'm less sure of how the options for the query analyzer should
be
Set.

Also, in the wiki - I found another option for the
WordDelimiterFilterFactory - preserveOriginal.
I tried setting this to 1 with similar results - no match.

And yes, I'm aware that "the" is a stop word and gets thrown away.
That's fine.
After each of these schema.xml changes, I've re-indexed my documents.
It doesn't take long as I'm just working with a small set of about 180
docs
Right now.

Again, any and all help would be greatly appreciated!  Thanks.
-Dan Dickey

-Original Message-
From: Sven Maurmann [mailto:sven.maurm...@kippdata.de] 
Sent: Thursday, February 11, 2010 2:36 AM
To: solr-user@lucene.apache.org
Subject: Re: Need a bit of help, Solr 1.4: type "text".

Hi,

the parameter for WordDelimiterFilterFactory is catenateAll;
you should set it to 1.

Cheers,
 Sven

--On Mittwoch, 10. Februar 2010 16:37 -0800 Yu-Shan Fung 
 wrote:

> Check out the configuration of WordDelimiterFilterFactory in your
> schema.xml.
>
> Depending on your settings, it's probably tokenizaing 13th into "13"
and
> "th". You can also have them concatenated back into a single token,
but I
> can't remember the exact parameter. I think it could be catenateAll.
>
>
>
> On Wed, Feb 10, 2010 at 4:32 PM, Dickey, Dan 
> wrote:
>
>> I'm using the standard "text" type for a field, and part of the data
>> being indexed is "13th", as in "Friday the 13th".
>>
>> I can't seem to get it to match when I'm querying for "Friday the
13th"
>> either quoted or not.
>>
>> One thing that does match is "13 th" if I send the search query with
a
>> space between...
>>
>> Any suggestions?
>>
>> I know this is short on detail, but it's been a long day... time to
get
>> outta here.
>>
>> Thanks for any and all help.
>>
>>-Dan
>>
>>
>>
>>
>> This message contains information which may be confidential and/or
>> privileged. Unless you are the intended recipient (or authorized to
>> receive for the intended recipient), you may not read, use, copy or
>> disclose to anyone the message or any information contained in the
>> message. If you have received the message in error, please advise the
>> sender by reply e-mail and delete the message and any attachment(s)
>> thereto without retaining any copies.
>
>
>
>
> --
> "When nothing seems to help, I go look at a stonecutter hammering away
> at his rock perhaps a hundred times without as much as a crack showing
in
> it. Yet at the hundred and first blow it will split in two, and I know
it
> was not that blow that did it, but all that had gone before." - Jacob
> Riis

This message contains information which may be confidential and/or
privileged. Unless you are the intended recipient (or authorized to
receive for the intended recipient), you may not read, use, copy or
disclose to anyone the message or any information contained in the
message. If you have received the message in error, please advise the
sender by reply e-mail and delete the message and any attachment(s)
thereto without retaining any copies.

This message contains information which may

term frequency vector access?

2010-02-11 Thread Mike Perham
In an UpdateRequestProcessor (processing an AddUpdateCommand), I have
a SolrInputDocument with a field 'content' that has termVectors="true"
in schema.xml.  Is it possible to get access to that field's term
vector in the URP?


Re: Cannot get like exact searching to work

2010-02-11 Thread Aaron Zeckoski
On Thu, Feb 11, 2010 at 1:52 PM, Ahmet Arslan  wrote:
>> What I really want is the equivalent of a match like this
>> along with
>> the normal tokenized matching (where the query has been
>> lowercased and
>> trimmed as well):
>> select * from blah where lowercase(column) like '%query%';
>> I think this is called a phrase match or something like
>> that.
>
> Can your query consist of more than one words?

Yes, and I expect it almost always will (the query string is coming
from a search box on a website).


>> However, wildcards cannot be used at the beginning of query so I
>> guess I can live with only being able to startsWith type matching until
>> that is fixed.
>
> With solr.ReversedWildcardFilterFactory it is possible. But it is in 1.4.0.

OK, so I may need to seriously look at SOLR 1.4 if I want to do a
"*stuff*" search.


>> For now I have tried to do that using this:
>> query = (summary:"my item" || summaryExact:"my item*"^3)
>> but I would do this if I could:
>> query = (summary:"my item" || summaryExact:"*my item*"^3)
>
> If you use string type for summaryExact you can run this query 
> summaryExact:my\ item* It will bring you all documents begins with my item.

Actually it won't. The data I am indexing has extra spaces in front
and is capitalized. I really need to be able to filter it through the
lowercase and trim filter without tokenizing it.
Is there a way to apply filters to the string type (I am pretty sure
there is not)?


>> The idea is that a "phrase" match would be boosted over the
>> normal
>> token matches and would show up first in the listing. Let
>> me know if
>> more examples would help. I am happy to provide them.
>
> More examples will be great. Because boosting phrase match on a tokenized 
> field can be achieved by something like "my item"^5 my item
> I didn't understand need of * operator.
> Also this query will retrieve documents below:
>
> something my item something
> my something item something
>
> We can say that it already behaves %like% query.

This doesn't seem to align with the results I am seeing when I do
searches. Are you saying that if I do a search like this it will boost
the phrase matches while still doing token matches?
q=summary:"my item"^5

or do I have to not use my summary field (the one I copy the other fields into).

-AZ


-- 
Aaron Zeckoski (azeckoski (at) vt.edu)
Senior Research Engineer - CARET - University of Cambridge
https://twitter.com/azeckoski - http://www.linkedin.com/in/azeckoski
http://aaronz-sakai.blogspot.com/ - http://tinyurl.com/azprofile


Re: "after flush: fdx size mismatch" on query durring writes

2010-02-11 Thread Acadaca

Thanks for the help!

Yes, we are doing a commit following the update. We will try
IndexWriter.setInfoStream

Below are our the environments we are testing on:

Ubuntu Hardy, Kernel 2.6.16-xenU i386
Amazon EC2, US East Region
Embedded Jetty
Java 1.6.0_16
Solr 1.4

Server B

Ubuntu Hardy, Kernel 2.6.16-xenU i386
Apache Tomcat 6.0.24
Java 1.6.0_07
Solr 1.4

Michael McCandless-2 wrote:
> 
> Yes, more details would be great...
> 
> Is this easily repeated?
> 
> The exists?=false is particularly spooky.
> 
> It means, somehow, a new segment was being flushed, containing 1285
> docs, but then after closing the doc stores, the stored fields index
> file (_X.fdx) had been deleted.
> 
> Can you turn on IndexWriter.setInfoStream, get this error to happen
> again, and then post the output?  Thanks.
> 
> Mike
> 
> On Wed, Feb 10, 2010 at 12:59 AM, Lance Norskog  wrote:
>> We need more information. How big is the index in disk space? How many
>> documents? How many fields? What's the schema? What OS? What Java
>> version?
>>
>> Do you run this on a local hard disk or is it over an NFS mount?
>>
>> Does this software commit before shutting down?
>>
>> If you run with asserts on do you get errors before this happens.
>>    -ea:org.apache.lucene... as a JVM argument
>>
>> On Tue, Feb 9, 2010 at 5:08 PM, Acadaca  wrote:
>>>
>>> We are using Solr 1.4 in a multi-core setup with replication.
>>>
>>> Whenever we write to the master we get the following exception:
>>>
>>> java.lang.RuntimeException: after flush: fdx size mismatch: 1285 docs vs
>>> 0
>>> length in bytes of _gqg.fdx file exists?=false
>>> at
>>> org.apache.lucene.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:97)
>>> at
>>> org.apache.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java:50)
>>>
>>> Has anyone had any success debugging this one?
>>>
>>> thx.
>>> --
>>> View this message in context:
>>> http://old.nabble.com/%22after-flush%3A-fdx-size-mismatch%22-on-query-durring-writes-tp27524755p27524755.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/%22after-flush%3A-fdx-size-mismatch%22-on-query-durring-writes-tp27524755p27549906.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: spellcheck

2010-02-11 Thread Markus Jelsma
Hi,


Check my earlier reply. You have explicitely set onlyMorePopular to true thus 
you will most likely always get suggestion even if the term was spelled 
correctly. You'll only get no suggestions if the term is spelled correctly and 
it is the most `popular` term.

You can opt for keeping onlyMorePopular set to true but it is then wise to 
enable extendedResults so you can check the correctlySpelled boolean.


cheers,


>here simple query:
>http://estyledesign:8983/request/select?q=popular&spellcheck=true&qt
>=keyrequest&spellcheck.extendedResults=true result:
>populars! but popular is correct word! Maybe i must change some properties
>in solrconfig! Here my configs for keyrequest:
>
>
>dismax
>  true
>  false
>  true
>  external
>
>
>query
>spellcheck
>mlt
>
>
>and search component:
>
>textSpell
>
>  name="classname">org.apache.solr.spelling.FileBasedSpellChecker
>  external
>  spellings.txt
>  UTF-8
>  spellcheckerfile
>
>   
>--
>View this message in context:
> http://old.nabble.com/spellcheck-tp27527425p27548078.html Sent from the
> Solr - User mailing list archive at Nabble.com.

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: term frequency vector access?

2010-02-11 Thread Koji Sekiguchi

Mike Perham wrote:

In an UpdateRequestProcessor (processing an AddUpdateCommand), I have
a SolrInputDocument with a field 'content' that has termVectors="true"
in schema.xml.  Is it possible to get access to that field's term
vector in the URP?

  

You cannot get term vector info of a document before indexing it.
But I think you can index it to RAM based index (InstantiatedIndex for
instance), then you can get them.

Koji

--
http://www.rondhuit.com/en/



Re: term frequency vector access?

2010-02-11 Thread Andrzej Bialecki

On 2010-02-11 17:04, Mike Perham wrote:

In an UpdateRequestProcessor (processing an AddUpdateCommand), I have
a SolrInputDocument with a field 'content' that has termVectors="true"
in schema.xml.  Is it possible to get access to that field's term
vector in the URP?


No, term vectors are created much later, during the process of adding 
the document to a Lucene index (deep inside Lucene IndexWriter & co). 
That's the whole point of SOLR-1536 - certain features become available 
only when the tokenization actually occurs.


Another reason to use SOLR-1536 is when tokenization and analysis is 
costly, e.g. when doing named entity recognition, POS tagging or 
lemmatization. Theoretically you could play the TokenizerChain twice - 
once in URP, so that you can discover and capture features and modify 
the input document accordingly, and then again inside Lucene - but in 
practice this may be too costly.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: The Riddle of the Underscore and the Dollar Sign . . .

2010-02-11 Thread Vauthrin, Laurent
We use the PatternTokenizerFactory.  We have the following in our
schema:

 

And to get rid of '_' we just remove it from the pattern.

-Original Message-
From:
solr-user-return-32434-laurent.vauthrin=disney@lucene.apache.org
[mailto:solr-user-return-32434-laurent.vauthrin=disney@lucene.apache
.org] On Behalf Of Christopher Ball
Sent: Thursday, February 11, 2010 6:35 AM
To: solr-user@lucene.apache.org
Cc: 'Ahmet Arslan'
Subject: RE: The Riddle of the Underscore and the Dollar Sign . . .

Unfortunately, the underscore is being quite resilient =(

I tried the solr.MappingCharFilterFactory and know the mapping is
working as
I am changing "c" => "q" just fine. But the underscore refuses to go!

I am baffled . . .



-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, February 11, 2010 3:11 AM
To: solr-user@lucene.apache.org
Subject: Re: The Riddle of the Underscore and the Dollar Sign . . .

> 1) How can I get rid of underscores('_') without using the
> wordDelimiter
> Filter (which gets rid of other syntax I need)?

Before TokenizerFactory you can apply  that will
replace "_" with " " or "" depending of your needs.

mapping.txt will contain:

"_" => "" or 
"_" => " " 


  




Re: Posting Concurrently to Solr

2010-02-11 Thread Jan Høydahl / Cominvent
You did not say how frequent you need to update the index, if this is batch 
type of operation or if you also have some real-time requirements after the 
initial load.

Your ETL could use SolrJ and the StreamingUpdateSolrServer for high throughput.
You could try multiple threads pushing in parallell if your bottleneck is on 
the client side.
If that's not enough you can split your index into multiple cores/shards to get 
more parallell indexing power.
You don't need to merge them at the end, you can query using the shards 
parameter.

For extreme power for batch indexing, you can look at a map-reduce strategy: 
http://wiki.apache.org/solr/HadoopIndexing

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 11.33, abhishes wrote:

> 
> Hello Everyone,
> 
> If I have a large data set which needs to be indexed, what strategy I can
> take to build the index fast?
> 
> 1. split the input into multiple xml files and then open different shells
> and post each of the split xml file? will this work and help me build index
> faster than 1 large xml file?
> 
> 2. What if I don't want to build the XML files at all. I want to write the
> extraction logic in an ETL tool and then let the ETL tool send the command
> to SOLR. then I run my ETL tool in a multi-threaded manner where each thread
> is extracting the data from the backed and send it to Solr for indexing.
> 
> 3. Use the Core Feature and then populate each core separately, then merge
> the cores.
> 
> Any other approach?
> 
> 
> 
> -- 
> View this message in context: 
> http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Re: Posting Concurrently to Solr

2010-02-11 Thread abhishes
I will run update index once a day.

Regards,
Abhishek

--Original Message--
From: Jan Høydahl / Cominvent
To: solr-user@lucene.apache.org
ReplyTo: solr-user@lucene.apache.org
Subject: Re: Posting Concurrently to Solr
Sent: Feb 11, 2010 22:17

You did not say how frequent you need to update the index, if this is batch 
type of operation or if you also have some real-time requirements after the 
initial load.

Your ETL could use SolrJ and the StreamingUpdateSolrServer for high throughput.
You could try multiple threads pushing in parallell if your bottleneck is on 
the client side.
If that's not enough you can split your index into multiple cores/shards to get 
more parallell indexing power.
You don't need to merge them at the end, you can query using the shards 
parameter.

For extreme power for batch indexing, you can look at a map-reduce strategy: 
http://wiki.apache.org/solr/HadoopIndexing

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 11.33, abhishes wrote:

> 
> Hello Everyone,
> 
> If I have a large data set which needs to be indexed, what strategy I can
> take to build the index fast?
> 
> 1. split the input into multiple xml files and then open different shells
> and post each of the split xml file? will this work and help me build index
> faster than 1 large xml file?
> 
> 2. What if I don't want to build the XML files at all. I want to write the
> extraction logic in an ETL tool and then let the ETL tool send the command
> to SOLR. then I run my ETL tool in a multi-threaded manner where each thread
> is extracting the data from the backed and send it to Solr for indexing.
> 
> 3. Use the Core Feature and then populate each core separately, then merge
> the cores.
> 
> Any other approach?
> 
> 
> 
> -- 
> View this message in context: 
> http://old.nabble.com/Posting-Concurrently-to-Solr-tp27544311p27544311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 



Sent from BlackBerry® on Airtel

Re: Distributed search and haproxy and connection build up

2010-02-11 Thread Tim Underwood
Have you played around with the "option httpclose" or the "option
forceclose" configuration options in HAProxy (both documented here:
http://haproxy.1wt.eu/download/1.3/doc/configuration.txt)?

-Tim

On Wed, Feb 10, 2010 at 10:05 AM, Ian Connor  wrote:
> Thanks,
>
> I bypassed haproxy as a test and it did reduce the number of connections -
> but it did not seem as those these connections were hurting anything.
>
> Ian.
>
> On Tue, Feb 9, 2010 at 11:01 PM, Lance Norskog  wrote:
>
>> This goes through the Apache Commons HTTP client library:
>> http://hc.apache.org/httpclient-3.x/
>>
>> We used 'balance' at another project and did not have any problems.
>>
>> On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor  wrote:
>> > I have been using distributed search with haproxy but noticed that I am
>> > suffering a little from tcp connections building up waiting for the OS
>> level
>> > closing/time out:
>> >
>> > netstat -a
>> > ...
>> > tcp6       1      0 10.0.16.170%34654:53789 10.0.16.181%363574:8893
>> > CLOSE_WAIT
>> > tcp6       1      0 10.0.16.170%34654:43932 10.0.16.181%363574:8890
>> > CLOSE_WAIT
>> > tcp6       1      0 10.0.16.170%34654:43190 10.0.16.181%363574:8895
>> > CLOSE_WAIT
>> > tcp6       0      0 10.0.16.170%346547:8984 10.0.16.181%36357:53770
>> > TIME_WAIT
>> > tcp6       1      0 10.0.16.170%34654:41782 10.0.16.181%363574:
>> > CLOSE_WAIT
>> > tcp6       1      0 10.0.16.170%34654:52169 10.0.16.181%363574:8890
>> > CLOSE_WAIT
>> > tcp6       1      0 10.0.16.170%34654:55947 10.0.16.181%363574:8887
>> > CLOSE_WAIT
>> > tcp6       0      0 10.0.16.170%346547:8984 10.0.16.181%36357:54040
>> > TIME_WAIT
>> > tcp6       1      0 10.0.16.170%34654:40030 10.0.16.160%363574:8984
>> > CLOSE_WAIT
>> > ...
>> >
>> > Digging a little into the haproxy documentation, it seems that they do
>> not
>> > support persistent connections.
>> >
>> > Does solr normally persist the connections between shards (would this
>> > problem happen even without haproxy)?
>> >
>> > Ian.
>> >
>>
>>
>>
>> --
>> Lance Norskog
>> goks...@gmail.com
>>
>
>
>
> --
> Regards,
>
> Ian Connor
>


Re: implementing profanity detector

2010-02-11 Thread Grant Ingersoll

On Jan 28, 2010, at 4:46 PM, Mike Perham wrote:

> We'd like to implement a profanity detector for documents during indexing.
> That is, given a file of profane words, we'd like to be able to mark a
> document as safe or not safe if it contains any of those words so that we
> can have something similar to google's safe search.
> 
> I'm trying to figure out how best to implement this with Solr 1.4:
> 
> - An UpdateRequestProcessor would allow me to dynamically populate a "safe"
> boolean field but requires me to pull out the content, tokenize it and run
> each token through my set of profanities, essentially running the analysis
> pipeline again.  That's a lot of overheard AFAIK.
> 
> - A TokenFilter would allow me to tap into the existing analysis pipeline so
> I get the tokens for free but I can't access the document.
> 
> Any suggestions on how to best implement this?
> 


TeeSinkTokenFilter (Lucene only) would do the trick if you're up for some 
hardcoding b/c it isn't supported in Solr (patch welcome) all that well.  A 
one-off solution shouldn't be too hard to wedge in, but it will involve 
hardcoding some field names in your analyzer, I think.  

Otherwise, I'd do it via copy fields.  Your first field is your main field and 
is analyzed as before.  Your second field does the profanity detection and 
simply outputs a single token at the end, safe/unsafe.

How long are your documents?  The extra copy field is extra work, but in this 
case it should be fast as you should be able to create a pretty streamlined 
analyzer chain for the second task.

Short term, I'd do the copy field approach while maybe, depending on its 
importance to you, working on the first approach.

-Grant




Re: Distributed search and haproxy and connection build up

2010-02-11 Thread Ian Connor
Not yet - but thanks for the link.

I think that the OS also has a timeout that keeps it around even after this
event and with heavy traffic I have seen this build up. Having said all
this, the performance impact after testing was negligible for us but I
thought I would post that haproxy can cause large numbers of connections on
a busy site. Going directly to shards does cut the number of connections
down a lot if someone else finds this to be a problem.

I am looking forward to distribution under 1.5 where the "|" option allows
redundancy in the request. This will solve the persistence problem while
still allowing failover for the shard requests.

Even after 1.5, I would then still advocate haproxy between ruby (or your
http stack) and solr. It is just when Solr is sharding the request it can
keep its connections open and save some resources here.

Ian.


On Thu, Feb 11, 2010 at 11:49 AM, Tim Underwood wrote:

> Have you played around with the "option httpclose" or the "option
> forceclose" configuration options in HAProxy (both documented here:
> http://haproxy.1wt.eu/download/1.3/doc/configuration.txt)?
>
> -Tim
>
> On Wed, Feb 10, 2010 at 10:05 AM, Ian Connor  wrote:
> > Thanks,
> >
> > I bypassed haproxy as a test and it did reduce the number of connections
> -
> > but it did not seem as those these connections were hurting anything.
> >
> > Ian.
> >
> > On Tue, Feb 9, 2010 at 11:01 PM, Lance Norskog 
> wrote:
> >
> >> This goes through the Apache Commons HTTP client library:
> >> http://hc.apache.org/httpclient-3.x/
> >>
> >> We used 'balance' at another project and did not have any problems.
> >>
> >> On Tue, Feb 9, 2010 at 5:54 AM, Ian Connor 
> wrote:
> >> > I have been using distributed search with haproxy but noticed that I
> am
> >> > suffering a little from tcp connections building up waiting for the OS
> >> level
> >> > closing/time out:
> >> >
> >> > netstat -a
> >> > ...
> >> > tcp6   1  0 10.0.16.170%34654:53789 10.0.16.181%363574:8893
> >> > CLOSE_WAIT
> >> > tcp6   1  0 10.0.16.170%34654:43932 10.0.16.181%363574:8890
> >> > CLOSE_WAIT
> >> > tcp6   1  0 10.0.16.170%34654:43190 10.0.16.181%363574:8895
> >> > CLOSE_WAIT
> >> > tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:53770
> >> > TIME_WAIT
> >> > tcp6   1  0 10.0.16.170%34654:41782 10.0.16.181%363574:
> >> > CLOSE_WAIT
> >> > tcp6   1  0 10.0.16.170%34654:52169 10.0.16.181%363574:8890
> >> > CLOSE_WAIT
> >> > tcp6   1  0 10.0.16.170%34654:55947 10.0.16.181%363574:8887
> >> > CLOSE_WAIT
> >> > tcp6   0  0 10.0.16.170%346547:8984 10.0.16.181%36357:54040
> >> > TIME_WAIT
> >> > tcp6   1  0 10.0.16.170%34654:40030 10.0.16.160%363574:8984
> >> > CLOSE_WAIT
> >> > ...
> >> >
> >> > Digging a little into the haproxy documentation, it seems that they do
> >> not
> >> > support persistent connections.
> >> >
> >> > Does solr normally persist the connections between shards (would this
> >> > problem happen even without haproxy)?
> >> >
> >> > Ian.
> >> >
> >>
> >>
> >>
> >> --
> >> Lance Norskog
> >> goks...@gmail.com
> >>
> >
> >
> >
> > --
> > Regards,
> >
> > Ian Connor
> >
>


Re: spellcheck

2010-02-11 Thread michaelnazaruk

I change config, but i get the same result! 



dismax
  false 
  false 
  true 
  external 


query
spellcheck
mlt



-- 
View this message in context: 
http://old.nabble.com/spellcheck-tp27527425p27550755.html
Sent from the Solr - User mailing list archive at Nabble.com.



parabolic type function centered on a date

2010-02-11 Thread Nagelberg, Kallin
Hi everyone,

I'm trying to enhance a more like this search I'm conducting by boosting the 
documents that have a date close to the original. I would like to do something 
like a parabolic function centered on the date (would make tuning a little more 
effective), though a linear function would probably suffice. Has anyone 
attempted this? If so I'd love to hear your strategy and results!

Thanks,
Kallin Nagelberg


Re: How to add SpellCheckResponse to Solritas?

2010-02-11 Thread Jan Høydahl / Cominvent
My problem was that spellcheck component was missing from /itas handler.

With that in place, I could use 
$response.response.spellcheck.suggestions.collation (no idea why I needed 
$response.response?) to pick up the spellcheck.

Now it works quite well: 
http://ec2-79-125-69-12.eu-west-1.compute.amazonaws.com:8983/solr/itas?q=conector

Perhaps I should submit this as my first patch to the Solr project :)

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 15.06, Erik Hatcher wrote:

> Let me understand the issue... Have you added spellchecking parameters to the 
> /itas mapping in solrconfig.xml?   If so, you should be able to do 
> /itas?q=mispeled&wt=xml and see the suggestions in the response.  If you've 
> gotten that far you'll be able to navigate to them using the object 
> navigation of $response in the templates.
> 
> The output of $response, just to be clear, isn't really JSON, it's a 
> toString() that looks similar though.  Or did you convert it to JSON in some 
> other fashion?  /itas?q=mispeled&wt=json should also show the spelling 
> suggestions.
> 
>   Erik
> 
> On Feb 9, 2010, at 7:30 PM, Jan Høydahl / Cominvent wrote:
> 
>> Hi,
>> 
>> I'm using the /itas requestHandler, and would like to add spell-check 
>> suggestions to the output.
>> I'm having spell-check configured and working in the XML response writer, 
>> but nothing is output in Velocity. Debugging the JSON $response object, I 
>> cannot find any representation of spellcheck response in there.
>> 
>> Where do I plug that in?
>> 
>> --
>> Jan Høydahl  - search architect
>> Cominvent AS - www.cominvent.com
>> 
> 



How to setup solr with Struts framework (oc4j servlet)?

2010-02-11 Thread Ching Zheng
thanks.


Re: spellcheck

2010-02-11 Thread Markus Jelsma
Hi,


I see you use an `external` dictionary. I've no idea what that is and how it 
works but it looks like the dictionary believe `populars!` is a term which 
obviously is not equal to `popular`. If this is an external index under your 
manual control; how about adding `popular` to the dictionary? And why is 
`populars!` in a spellcheck dictionary; it sounds like some weird term in a 
dictionary ;)


I know this is not a schema change but perhaps the following might help:
a) remove the old index
b) restarted your application server
c) reindexed your data
d) rebuilt your spellcheck index






>I change config, but i get the same result!
>
>
>
>dismax
>  false
>  false
>  true
>  external
>
>
>query
>spellcheck
>mlt
>
>

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: Cannot get like exact searching to work

2010-02-11 Thread Ahmet Arslan

> > If you use string type for summaryExact you can run
> this query summaryExact:my\ item* It will bring you all
> documents begins with my item.
> 
> Actually it won't. The data I am indexing has extra spaces
> in front
> and is capitalized. I really need to be able to filter it
> through the
> lowercase and trim filter without tokenizing it.
> Is there a way to apply filters to the string type (I am
> pretty sure
> there is not)?

As you said you cannot apply. But you can mimic it with KeywordTokenizer + 
TrimFilter + LowercaseFilter combination.

> This doesn't seem to align with the results I am seeing
> when I do searches. Are you saying that if I do a search like this it
> will boost the phrase matches while still doing token matches?
> q=summary:"my item"^5

Exactly. However you need to add individual terms also : 
&q=summary:("my item"^5 my item)
Can you test this query? I think this is what you are looking for.

Just for your information: About multi-valued fields there is a 
positionIncrementGap value that puts space between multiple fields of this type 
on the same document, with the purpose of preventing false phrase matching 
across fields. 


  


Realtime search and facets with very frequent commits

2010-02-11 Thread Janne Majaranta
Hello,

I have a log search like application which requires indexed log events to be
searchable within a minute
and uses facets and the statscomponent.

Some stats:
- The log events are indexed every 10 seconds with a "commitWithin" of 60
seconds.
- 1M events / day (~75% are updates to previous events).
- Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
for all 14 fields at the same time.
- Heavy use of StatsComponent ( stats over facets of ~36M documents ).


The application is running a single Solr instance. All updates and queries
are sent to the same instance.
Faceting and the StatsComponent are both amazingly fast with that amount of
documents *when* the caches are warm.

The problem I'm now facing is that keeping the caches warm is too heavy
compared to the frequency of updates.
It takes over 60 seconds to warmup the caches to the level where facets and
stats are returned in milliseconds.

I have tested putting a second solr instance on the same server and sending
the updates to that new instance.
Warming up the new small instance is very fast while the large instance has
very hot caches.

I also put a third (empty) solr instance on the same server which passes the
queries to the two instances with the
"shards" parameters. This is mainly because the client app really doesn't
have to know anything about the shards.

The setup was easy to configure and responses are back in milliseconds and
the updates are visible in seconds.
That is, responses in milliseconds over 40M documents and a update frequency
of 15 seconds on a single physical server.
The (lab) server has 16g RAM and it is running win23k.

Also, what I found out is that using the sharded setup I only need half the
memory for the large instance.
When indexing to the large instance the memory usage goes very fast up to
the maximum allocated heap size and never goes down.

My question is, is there a magic switch in SOLR to have that kind of update
frequency while having the caches on fire ?
Or is it just impossible to achieve facet counts and queries in milliseconds
while updating the index every minute ?

The second question is, the setup with a empty SOLR as a "coordinating"
instance, a large SOLR instance with hot caches and a small SOLR instance
with immediate updates,
all on the same physical server, does it sound like a durable solution
(until the small instance gets big) or is it something is braindead ?

And the third question is, would it be a good idea to merge the small and
the large index periodically so that a fresh and empty small instance would
be available
after the merge ?

Any ideas ?

Best Regards,

Janne Majaranta


Re: help with facets and searchable fields

2010-02-11 Thread Jan Høydahl / Cominvent
Can you show us your field definitions and the exact query string you are 
using, and what you expect to see?

--
Jan Høydahl  - search architect
Cominvent AS - www.cominvent.com

On 11. feb. 2010, at 15.31, adeelmahmood wrote:

> 
> hi there
> i am trying to get familiar with solr while setting it up on my local pc and
> indeing and retrieving some sample data .. a couple of things i am having
> trouble with
> 1 - in my schema if i dont use the copyField to copy data from some fields
> to the text field .. they are not searchable .. so if i just have an id and
> title field with searchable and sortable attributes set to true .. i cant
> search for them but as soon as I add them to the 'text' field with the
> copyField functionality .. they become searchable .. 
> 
> 2 - i have another field called category which can be something like
> category 1, category 2, ...
> again same idea here .. cateogory field by itself isnt searchable until i
> copy it into the text field .. and as i understand all data is simply
> appended in the text field .. so after that if i search for some field with
> title matching to my query and then if i facet with the category field ..
> and if my result lets say was in category 1 .. the facet count is then
> returned as 2 .. with Category being the first thing and 2 being the second
> thing .. how can i make it so that it consider 'Category 2' as a single
> thing
> 
> any help is appreciated
> -- 
> View this message in context: 
> http://old.nabble.com/help-with-facets-and-searchable-fields-tp27545136p27545136.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 



RE: The Riddle of the Underscore and the Dollar Sign . . .

2010-02-11 Thread Ahmet Arslan

> Unfortunately, the underscore is
> being quite resilient =(
> 
> I tried the solr.MappingCharFilterFactory and know the
> mapping is working as
> I am changing "c" => "q" just fine. But the underscore
> refuses to go!
> 
> I am baffled . . .

I just activated name="textCharNorm" in example schema.xml and added 
"_" => "xxx" to mapping-ISOLatin1Accent.txt
I verified from http://localhost:8983/solr/admin/analysis.jsp that replacement 
is done without problems. Can you also test analysis.jsp?

May be your documents has underscores having different Unicode values. I know 
three different Unicode valued characters that all look like "-"
If thats the case you need to find their Unicode values and write them into 
mappings.txt.



  


Has anyone done request logging with Solr-Ruby for use in Rails?

2010-02-11 Thread Ian Connor
The idea is that in the log is currently like:

Completed in 1290ms (View: 152, DB: 75) | 200 OK [
http://localhost:3000/search?q=nik+gene+cluster&view=2]

I want to extend it to also track the Solr query times and time spent in
solr-ruby like:

Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [
http://localhost:3000/search?q=nik+gene+cluster&view=2]

Has anyone done such a plug-in or extension already?

-- 
Regards,

Ian Connor


Re: Has anyone done request logging with Solr-Ruby for use in Rails?

2010-02-11 Thread Mat Brown
On Thu, Feb 11, 2010 at 13:07, Ian Connor  wrote:
> The idea is that in the log is currently like:
>
> Completed in 1290ms (View: 152, DB: 75) | 200 OK [
> http://localhost:3000/search?q=nik+gene+cluster&view=2]
>
> I want to extend it to also track the Solr query times and time spent in
> solr-ruby like:
>
> Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [
> http://localhost:3000/search?q=nik+gene+cluster&view=2]
>
> Has anyone done such a plug-in or extension already?
>
> --
> Regards,
>
> Ian Connor
>

Here's a module in Sunspot::Rails that does that. It's written against
RSolr, which is an alternative to solr-ruby, but the concept is the
same:
http://github.com/outoftime/sunspot/blob/master/sunspot_rails/lib/sunspot/rails/solr_logging.rb


Re: How to add SpellCheckResponse to Solritas?

2010-02-11 Thread Erik Hatcher


On Feb 11, 2010, at 12:21 PM, Jan Høydahl / Cominvent wrote:
With that in place, I could use  
$response.response.spellcheck.suggestions.collation (no idea why I  
needed $response.response?) to pick up the spellcheck.


$response.response is needed in the Velocity templates because  
$response is a SolrResponse (most often QueryResponse).  To get just  
the response you have to navigate another level down,  
QueryResponse.getResponse().


Making these context variables a bit less confusing has been something  
I've pondered for a while.  Suggestions welcome for improvement.



Perhaps I should submit this as my first patch to the Solr project :)


Not a bad idea!  Contributions more than welcome.

Erik



"Overwriting" cores with the same core name

2010-02-11 Thread Thomas Koch
Hi,

I'm currently evaluating the following solution: My crawler sends all docs to 
a SOLR core named "WHATEVER". Every 5 minutes a new SOLR core with the same 
name WHATEVER is created, but with a new datadir. The datadir contains a 
timestamp in it's name.
Now I can check for datadirs that are older then the newest one and all these 
can be picked up for submission to katta.

Now there remain two questions:

- When the old core is closed, will there be an implicit commit?
- How to be sure, that no more work is in progress on an old core datadir?

Thanks,

Thomas Koch, http://www.koch.ro


Re: Has anyone done request logging with Solr-Ruby for use in Rails?

2010-02-11 Thread Ian Connor
This seems to allow you to log each query - which is a good start.

I was thinking of something that would add all the ms together and report it
in the "completed at" line so you can get a higher level view of which
requests take the time and where.

Ian.

On Thu, Feb 11, 2010 at 1:13 PM, Mat Brown  wrote:

> On Thu, Feb 11, 2010 at 13:07, Ian Connor  wrote:
> > The idea is that in the log is currently like:
> >
> > Completed in 1290ms (View: 152, DB: 75) | 200 OK [
> > http://localhost:3000/search?q=nik+gene+cluster&view=2]
> >
> > I want to extend it to also track the Solr query times and time spent in
> > solr-ruby like:
> >
> > Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [
> > http://localhost:3000/search?q=nik+gene+cluster&view=2]
> >
> > Has anyone done such a plug-in or extension already?
> >
> > --
> > Regards,
> >
> > Ian Connor
> >
>
> Here's a module in Sunspot::Rails that does that. It's written against
> RSolr, which is an alternative to solr-ruby, but the concept is the
> same:
>
> http://github.com/outoftime/sunspot/blob/master/sunspot_rails/lib/sunspot/rails/solr_logging.rb
>



-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


Re: Has anyone done request logging with Solr-Ruby for use in Rails?

2010-02-11 Thread Mat Brown
Oh - indeed - sorry, didn't read your email closely enough : )

Yeah that would probably involve some pretty crufty monkey patching /
use of globals...

On Thu, Feb 11, 2010 at 13:22, Ian Connor  wrote:
> This seems to allow you to log each query - which is a good start.
>
> I was thinking of something that would add all the ms together and report it
> in the "completed at" line so you can get a higher level view of which
> requests take the time and where.
>
> Ian.
>
> On Thu, Feb 11, 2010 at 1:13 PM, Mat Brown  wrote:
>
>> On Thu, Feb 11, 2010 at 13:07, Ian Connor  wrote:
>> > The idea is that in the log is currently like:
>> >
>> > Completed in 1290ms (View: 152, DB: 75) | 200 OK [
>> > http://localhost:3000/search?q=nik+gene+cluster&view=2]
>> >
>> > I want to extend it to also track the Solr query times and time spent in
>> > solr-ruby like:
>> >
>> > Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [
>> > http://localhost:3000/search?q=nik+gene+cluster&view=2]
>> >
>> > Has anyone done such a plug-in or extension already?
>> >
>> > --
>> > Regards,
>> >
>> > Ian Connor
>> >
>>
>> Here's a module in Sunspot::Rails that does that. It's written against
>> RSolr, which is an alternative to solr-ruby, but the concept is the
>> same:
>>
>> http://github.com/outoftime/sunspot/blob/master/sunspot_rails/lib/sunspot/rails/solr_logging.rb
>>
>
>
>
> --
> Regards,
>
> Ian Connor
> 1 Leighton St #723
> Cambridge, MA 02141
> Call Center Phone: +1 (714) 239 3875 (24 hrs)
> Fax: +1(770) 818 5697
> Skype: ian.connor
>


Re: Has anyone done request logging with Solr-Ruby for use in Rails?

2010-02-11 Thread Ian Connor
...and probably break stuff - that might be why it hasn't been done.

On Thu, Feb 11, 2010 at 1:28 PM, Mat Brown  wrote:

> Oh - indeed - sorry, didn't read your email closely enough : )
>
> Yeah that would probably involve some pretty crufty monkey patching /
> use of globals...
>
> On Thu, Feb 11, 2010 at 13:22, Ian Connor  wrote:
> > This seems to allow you to log each query - which is a good start.
> >
> > I was thinking of something that would add all the ms together and report
> it
> > in the "completed at" line so you can get a higher level view of which
> > requests take the time and where.
> >
> > Ian.
> >
> > On Thu, Feb 11, 2010 at 1:13 PM, Mat Brown  wrote:
> >
> >> On Thu, Feb 11, 2010 at 13:07, Ian Connor  wrote:
> >> > The idea is that in the log is currently like:
> >> >
> >> > Completed in 1290ms (View: 152, DB: 75) | 200 OK [
> >> > http://localhost:3000/search?q=nik+gene+cluster&view=2]
> >> >
> >> > I want to extend it to also track the Solr query times and time spent
> in
> >> > solr-ruby like:
> >> >
> >> > Completed in 1290ms (View: 152, DB: 75, Solr: 334) | 200 OK [
> >> > http://localhost:3000/search?q=nik+gene+cluster&view=2]
> >> >
> >> > Has anyone done such a plug-in or extension already?
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Ian Connor
> >> >
> >>
> >> Here's a module in Sunspot::Rails that does that. It's written against
> >> RSolr, which is an alternative to solr-ruby, but the concept is the
> >> same:
> >>
> >>
> http://github.com/outoftime/sunspot/blob/master/sunspot_rails/lib/sunspot/rails/solr_logging.rb
> >>
> >
> >
> >
> > --
> > Regards,
> >
> > Ian Connor
> > 1 Leighton St #723
> > Cambridge, MA 02141
> > Call Center Phone: +1 (714) 239 3875 (24 hrs)
> > Fax: +1(770) 818 5697
> > Skype: ian.connor
> >
>



-- 
Regards,

Ian Connor
1 Leighton St #723
Cambridge, MA 02141
Call Center Phone: +1 (714) 239 3875 (24 hrs)
Fax: +1(770) 818 5697
Skype: ian.connor


Re: delete via DIH

2010-02-11 Thread Lukas Kahwe Smith

On 10.02.2010, at 16:41, Lukas Kahwe Smith wrote:

> There is a solution to update via DIH, but is there also a way to define a 
> query that fetches id's for documents that should be removed?


Or to phrase the question a bit more open. I have a file with id's of documents 
to delete (one per line). Actually some of the id's need to get a constant 
offset added to them. My plan was to load them into the DB, because thats is 
what I do with the rest of the data (which I need to "massage" a bit before 
sending it to Solr). I could also do the offset computing in the middleware and 
dump out a CSV file or whatever. Should I rather load the list of id's into my 
middleware and then issue single or bulk lists of documents to delete? In case 
the list of items to delete is really long, should I chunk the lists in the 
middleware an issues several seperate delete commands?

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: Realtime search and facets with very frequent commits

2010-02-11 Thread Jason Rutherglen
Janne,

I usually just turn the caches to next to nearly off for frequent commits.

Jason

On Thu, Feb 11, 2010 at 9:35 AM, Janne Majaranta
 wrote:
> Hello,
>
> I have a log search like application which requires indexed log events to be
> searchable within a minute
> and uses facets and the statscomponent.
>
> Some stats:
> - The log events are indexed every 10 seconds with a "commitWithin" of 60
> seconds.
> - 1M events / day (~75% are updates to previous events).
> - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
> for all 14 fields at the same time.
> - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
>
>
> The application is running a single Solr instance. All updates and queries
> are sent to the same instance.
> Faceting and the StatsComponent are both amazingly fast with that amount of
> documents *when* the caches are warm.
>
> The problem I'm now facing is that keeping the caches warm is too heavy
> compared to the frequency of updates.
> It takes over 60 seconds to warmup the caches to the level where facets and
> stats are returned in milliseconds.
>
> I have tested putting a second solr instance on the same server and sending
> the updates to that new instance.
> Warming up the new small instance is very fast while the large instance has
> very hot caches.
>
> I also put a third (empty) solr instance on the same server which passes the
> queries to the two instances with the
> "shards" parameters. This is mainly because the client app really doesn't
> have to know anything about the shards.
>
> The setup was easy to configure and responses are back in milliseconds and
> the updates are visible in seconds.
> That is, responses in milliseconds over 40M documents and a update frequency
> of 15 seconds on a single physical server.
> The (lab) server has 16g RAM and it is running win23k.
>
> Also, what I found out is that using the sharded setup I only need half the
> memory for the large instance.
> When indexing to the large instance the memory usage goes very fast up to
> the maximum allocated heap size and never goes down.
>
> My question is, is there a magic switch in SOLR to have that kind of update
> frequency while having the caches on fire ?
> Or is it just impossible to achieve facet counts and queries in milliseconds
> while updating the index every minute ?
>
> The second question is, the setup with a empty SOLR as a "coordinating"
> instance, a large SOLR instance with hot caches and a small SOLR instance
> with immediate updates,
> all on the same physical server, does it sound like a durable solution
> (until the small instance gets big) or is it something is braindead ?
>
> And the third question is, would it be a good idea to merge the small and
> the large index periodically so that a fresh and empty small instance would
> be available
> after the merge ?
>
> Any ideas ?
>
> Best Regards,
>
> Janne Majaranta
>


Re: Solr/Drupal Integration - Query Question

2010-02-11 Thread jaybytez

So I got it to work by running the drupal cron.php.

I was originally trying to use the exampledocs, indexing that content, and
making that index available to the Drupal solr.

But it might just be that they are different indexes? And that's why I
wasn't get responses.

One quick question, the Drupal/Solr Facets are awesome, the only thing is
the URLs are escaped and seem to cause problems when I click the link.  Is
this most likely an encoding issue or something in Solr that is causing
these links to be created poorly?

For instance:

http://localhost:8080/search/apachesolr_search/drupal?filters=tid%3A1%20tid%3A3%20%28nodeaccess_all%3A0%20OR%20hash%3Ac13a544eb3ac%29

This returns no results and produces the following error in Solr (is this
error related to http://issues.apache.org/jira/browse/SOLR-1231):

Feb 11, 2010 12:18:58 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.lucene.queryParser.ParseException: Cannot parse
'hash:c13a544eb3ac)': Encountered " ")" ") "" at line 1, column 17.
Was expecting one of:

 ...
 ...
 ...
"+" ...
"-" ...
"(" ...
"*" ...
"^" ...
 ...
 ...
 ...
 ...
 ...
"[" ...
"{" ...
 ...

at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108)
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
at org.mortbay.jetty.Server.handle(Server.java:285)
at
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at
org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse
'hash:c13a544eb3ac)': Encountered " ")" ") "" at line 1, column 17.
Was expecting one of:

 ...
 ...
 ...
"+" ...
"-" ...
"(" ...
"*" ...
"^" ...
 ...
 ...
 ...
 ...
 ...
"[" ...
"{" ...
 ...

at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:205)
at
org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
at org.apache.solr.search.QParser.getQuery(QParser.java:131)
at
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:103)
... 22 more
Caused by: org.apache.lucene.queryParser.ParseException: Encountered " ")"
") "" at line 1, column 17.
Was expecting one of:

 ...
 ...
 ...
"+" ...
"-" ...
"(" ...
"*" ...
"^" ...
 ...
 ...
 ...
 ...
 ...
"[" ...
"{" ...
 ...

at
org.apache.lucene.queryParser.QueryParser.generateParseException(QueryParser.java:1846)
at
org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.java:1728)
at
org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1255)
at
org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:200)
... 25 more

Feb 11, 2010 12:18:58 PM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/select
params={spellcheck=true&f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR&facet=true&facet.limit=20&spellcheck.q=drupal&hl.simple.pre=&hl=&version=1.2&fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative&f.created.facet.date.start=2010-02-09T07:01:14Z/HOUR&bf=reci

Re: Realtime search and facets with very frequent commits

2010-02-11 Thread Janne Majaranta
Hey Jason,

Do you use faceting with frequent commits ?
And by turning off the caches you mean setting autowarmcount to zero ?

I did try to turn off autowarming with a 36M documents instance but getting
facets over those documents takes over 10 seconds.
With a warm cache it takes 200ms ...

-Janne


2010/2/11 Jason Rutherglen 

> Janne,
>
> I usually just turn the caches to next to nearly off for frequent commits.
>
> Jason
>
> On Thu, Feb 11, 2010 at 9:35 AM, Janne Majaranta
>  wrote:
> > Hello,
> >
> > I have a log search like application which requires indexed log events to
> be
> > searchable within a minute
> > and uses facets and the statscomponent.
> >
> > Some stats:
> > - The log events are indexed every 10 seconds with a "commitWithin" of 60
> > seconds.
> > - 1M events / day (~75% are updates to previous events).
> > - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
> > for all 14 fields at the same time.
> > - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
> >
> >
> > The application is running a single Solr instance. All updates and
> queries
> > are sent to the same instance.
> > Faceting and the StatsComponent are both amazingly fast with that amount
> of
> > documents *when* the caches are warm.
> >
> > The problem I'm now facing is that keeping the caches warm is too heavy
> > compared to the frequency of updates.
> > It takes over 60 seconds to warmup the caches to the level where facets
> and
> > stats are returned in milliseconds.
> >
> > I have tested putting a second solr instance on the same server and
> sending
> > the updates to that new instance.
> > Warming up the new small instance is very fast while the large instance
> has
> > very hot caches.
> >
> > I also put a third (empty) solr instance on the same server which passes
> the
> > queries to the two instances with the
> > "shards" parameters. This is mainly because the client app really doesn't
> > have to know anything about the shards.
> >
> > The setup was easy to configure and responses are back in milliseconds
> and
> > the updates are visible in seconds.
> > That is, responses in milliseconds over 40M documents and a update
> frequency
> > of 15 seconds on a single physical server.
> > The (lab) server has 16g RAM and it is running win23k.
> >
> > Also, what I found out is that using the sharded setup I only need half
> the
> > memory for the large instance.
> > When indexing to the large instance the memory usage goes very fast up to
> > the maximum allocated heap size and never goes down.
> >
> > My question is, is there a magic switch in SOLR to have that kind of
> update
> > frequency while having the caches on fire ?
> > Or is it just impossible to achieve facet counts and queries in
> milliseconds
> > while updating the index every minute ?
> >
> > The second question is, the setup with a empty SOLR as a "coordinating"
> > instance, a large SOLR instance with hot caches and a small SOLR instance
> > with immediate updates,
> > all on the same physical server, does it sound like a durable solution
> > (until the small instance gets big) or is it something is braindead ?
> >
> > And the third question is, would it be a good idea to merge the small and
> > the large index periodically so that a fresh and empty small instance
> would
> > be available
> > after the merge ?
> >
> > Any ideas ?
> >
> > Best Regards,
> >
> > Janne Majaranta
> >
>


Re: Realtime search and facets with very frequent commits

2010-02-11 Thread Yonik Seeley
On Thu, Feb 11, 2010 at 3:21 PM, Janne Majaranta
 wrote:
> Hey Jason,
>
> Do you use faceting with frequent commits ?
> And by turning off the caches you mean setting autowarmcount to zero ?
>
> I did try to turn off autowarming with a 36M documents instance but getting
> facets over those documents takes over 10 seconds.
> With a warm cache it takes 200ms ...

You can turn off autowarming and do a single static warming query that
does the typical facet request.
If that takes 10 seconds to execute (and populates the caches in the
meantime), you can still commit every minute (or better, use
commitWithin when updating to prevent unnecessary commits)

-Yonik
http://www.lucidimagination.com


Re: Dynamic fields with more than 100 fields inside

2010-02-11 Thread gdeconto


Xavier Schepler wrote:
> 
> for example, "concept_user_*", and I will have maybe more than 200 users 
> using this feature.
> 

I've done tests with many hundred dynamically created fields (ie foo_1 thru
f_400).  generally speaking, I havent noticed any noticeable performance
issues from having that many fields

the only exception: I have noticed performance issues if you try to query 
large numbers of fields (ie q=(foo_1:123 AND foo_2:234 ... AND foo_400:912)
-- 
View this message in context: 
http://old.nabble.com/Dynamic-fields-with-more-than-100-fields-inside-tp27502271p27554402.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Dynamic fields with more than 100 fields inside

2010-02-11 Thread Mat Brown
On Thu, Feb 11, 2010 at 15:41, gdeconto  wrote:
>
>
> Xavier Schepler wrote:
>>
>> for example, "concept_user_*", and I will have maybe more than 200 users
>> using this feature.
>>
>
> I've done tests with many hundred dynamically created fields (ie foo_1 thru
> f_400).  generally speaking, I havent noticed any noticeable performance
> issues from having that many fields
>
> the only exception: I have noticed performance issues if you try to query
> large numbers of fields (ie q=(foo_1:123 AND foo_2:234 ... AND foo_400:912)
> --
> View this message in context: 
> http://old.nabble.com/Dynamic-fields-with-more-than-100-fields-inside-tp27502271p27554402.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Since Lucene is schema-less, my assumption would be that using dynamic
fields wouldn't have any effect on performance within the search stack
itself - presumably the only penalty would be matching a given field
name against the appropriate dynamic field (which I would assume would
be cached/memoized).

Lots of assumptions there, of course. Feel free to correct me if I'm wrong.


Re: Realtime search and facets with very frequent commits

2010-02-11 Thread Otis Gospodnetic
Janne,

The answers to your last 2 questions are both yes.  I've seen that done a few 
times and it works.  I don't have the answer to the always-hot cache question.


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Janne Majaranta 
> To: solr-user@lucene.apache.org
> Sent: Thu, February 11, 2010 12:35:20 PM
> Subject: Realtime search and facets with very frequent commits
> 
> Hello,
> 
> I have a log search like application which requires indexed log events to be
> searchable within a minute
> and uses facets and the statscomponent.
> 
> Some stats:
> - The log events are indexed every 10 seconds with a "commitWithin" of 60
> seconds.
> - 1M events / day (~75% are updates to previous events).
> - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
> for all 14 fields at the same time.
> - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
> 
> 
> The application is running a single Solr instance. All updates and queries
> are sent to the same instance.
> Faceting and the StatsComponent are both amazingly fast with that amount of
> documents *when* the caches are warm.
> 
> The problem I'm now facing is that keeping the caches warm is too heavy
> compared to the frequency of updates.
> It takes over 60 seconds to warmup the caches to the level where facets and
> stats are returned in milliseconds.
> 
> I have tested putting a second solr instance on the same server and sending
> the updates to that new instance.
> Warming up the new small instance is very fast while the large instance has
> very hot caches.
> 
> I also put a third (empty) solr instance on the same server which passes the
> queries to the two instances with the
> "shards" parameters. This is mainly because the client app really doesn't
> have to know anything about the shards.
> 
> The setup was easy to configure and responses are back in milliseconds and
> the updates are visible in seconds.
> That is, responses in milliseconds over 40M documents and a update frequency
> of 15 seconds on a single physical server.
> The (lab) server has 16g RAM and it is running win23k.
> 
> Also, what I found out is that using the sharded setup I only need half the
> memory for the large instance.
> When indexing to the large instance the memory usage goes very fast up to
> the maximum allocated heap size and never goes down.
> 
> My question is, is there a magic switch in SOLR to have that kind of update
> frequency while having the caches on fire ?
> Or is it just impossible to achieve facet counts and queries in milliseconds
> while updating the index every minute ?
> 
> The second question is, the setup with a empty SOLR as a "coordinating"
> instance, a large SOLR instance with hot caches and a small SOLR instance
> with immediate updates,
> all on the same physical server, does it sound like a durable solution
> (until the small instance gets big) or is it something is braindead ?
> 
> And the third question is, would it be a good idea to merge the small and
> the large index periodically so that a fresh and empty small instance would
> be available
> after the merge ?
> 
> Any ideas ?
> 
> Best Regards,
> 
> Janne Majaranta



Re: dismax and multi-language corpus

2010-02-11 Thread Otis Gospodnetic
I don't know, but the other day I did see a NPE related to fields with '-'.  In 
Distributed Search context at least, fields with '-' were causing a NPE.


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Jason Rutherglen 
> To: solr-user@lucene.apache.org
> Sent: Thu, February 11, 2010 12:36:00 AM
> Subject: Re: dismax and multi-language corpus
> 
> > Claudio - fields with '-' in them can be problematic.
> 
> Why's that?
> 
> On Wed, Feb 10, 2010 at 2:38 PM, Otis Gospodnetic
> wrote:
> > Claudio - fields with '-' in them can be problematic.
> >
> > Side comment: do you really want to search across all languages at once?  
> > If 
> not, maybe 3 different dismax configs would make your searches better.
> >
> >  Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >
> >
> > - Original Message 
> >> From: Claudio Martella 
> >> To: solr-user@lucene.apache.org
> >> Sent: Wed, February 10, 2010 3:15:40 PM
> >> Subject: dismax and multi-language corpus
> >>
> >> Hello list,
> >>
> >> I have a corpus with 3 languages, so i setup a text content field (with
> >> no stemming) and 3 text-[en|it|de] fields with specific snowball stemmers.
> >> i copyField the text to my language-away fields. So, I setup this dismax
> >> searchHandler:
> >>
> >>
> >>
> >>   dismax
> >>   title^1.2 content-en^0.8 content-it^0.8
> >> content-de^0.8
> >>   title^1.2 content-en^0.8 content-it^0.8
> >> content-de^0.8
> >>   title^1.2 content-en^0.8 content-it^0.8
> >> content-de^0.8
> >>   0.1
> >>
> >>
> >>
> >>
> >> but i get this error:
> >>
> >> HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Expected
> >> ',' at position 7 in 'content-en'
> >>
> >> type Status report
> >>
> >> message org.apache.lucene.queryParser.ParseException: Expected ',' at
> >> position 7 in 'content-en'
> >>
> >> description The request sent by the client was syntactically incorrect
> >> (org.apache.lucene.queryParser.ParseException: Expected ',' at position
> >> 7 in 'content-en').
> >>
> >> Any idea?
> >>
> >> TIA
> >>
> >> Claudio
> >>
> >> --
> >> Claudio Martella
> >> Digital Technologies
> >> Unit Research & Development - Analyst
> >>
> >> TIS innovation park
> >> Via Siemens 19 | Siemensstr. 19
> >> 39100 Bolzano | 39100 Bozen
> >> Tel. +39 0471 068 123
> >> Fax  +39 0471 068 129
> >> claudio.marte...@tis.bz.it http://www.tis.bz.it
> >>
> >> Short information regarding use of personal data. According to Section 13 
> >> of
> >> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
> >> process your personal data in order to fulfil contractual and fiscal 
> obligations
> >> and also to send you information regarding our services and events. Your
> >> personal data are processed with and without electronic means and by 
> respecting
> >> data subjects' rights, fundamental freedoms and dignity, particularly with
> >> regard to confidentiality, personal identity and the right to personal data
> >> protection. At any time and without formalities you can write an e-mail to
> >> priv...@tis.bz.it in order to object the processing of your personal data 
> >> for
> >> the purpose of sending advertising materials and also to exercise the 
> >> right 
> to
> >> access personal data and other rights referred to in Section 7 of Decree
> >> 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens
> >> Street n. 19, Bolzano. You can find the complete information on the web 
> >> site
> >> www.tis.bz.it.
> >
> >



Re: Realtime search and facets with very frequent commits

2010-02-11 Thread Janne Majaranta
Ok,

Thanks Yonik and Otis.
I already had static warming queries with facets turned on and autowarming
at zero.
There were a lot of other optimizations after that however, so I'll try with
zero autowarming and static warming queries again.

If that doesn't work, I'll go with 3 instances on the same server.

BTW, does it sound like normal that when running updates every minute to a
36M index it takes all the available heap size after about 5 commits
although there is not a single query executed to the index and autowarming
is set to zero ? Just curious.

-Janne


2010/2/11 Otis Gospodnetic 

> Janne,
>
> The answers to your last 2 questions are both yes.  I've seen that done a
> few times and it works.  I don't have the answer to the always-hot cache
> question.
>
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> - Original Message 
> > From: Janne Majaranta 
> > To: solr-user@lucene.apache.org
> > Sent: Thu, February 11, 2010 12:35:20 PM
> > Subject: Realtime search and facets with very frequent commits
> >
> > Hello,
> >
> > I have a log search like application which requires indexed log events to
> be
> > searchable within a minute
> > and uses facets and the statscomponent.
> >
> > Some stats:
> > - The log events are indexed every 10 seconds with a "commitWithin" of 60
> > seconds.
> > - 1M events / day (~75% are updates to previous events).
> > - Faceting over 14 fields ( strings ). Usually TOP5 by numdocs but facets
> > for all 14 fields at the same time.
> > - Heavy use of StatsComponent ( stats over facets of ~36M documents ).
> >
> >
> > The application is running a single Solr instance. All updates and
> queries
> > are sent to the same instance.
> > Faceting and the StatsComponent are both amazingly fast with that amount
> of
> > documents *when* the caches are warm.
> >
> > The problem I'm now facing is that keeping the caches warm is too heavy
> > compared to the frequency of updates.
> > It takes over 60 seconds to warmup the caches to the level where facets
> and
> > stats are returned in milliseconds.
> >
> > I have tested putting a second solr instance on the same server and
> sending
> > the updates to that new instance.
> > Warming up the new small instance is very fast while the large instance
> has
> > very hot caches.
> >
> > I also put a third (empty) solr instance on the same server which passes
> the
> > queries to the two instances with the
> > "shards" parameters. This is mainly because the client app really doesn't
> > have to know anything about the shards.
> >
> > The setup was easy to configure and responses are back in milliseconds
> and
> > the updates are visible in seconds.
> > That is, responses in milliseconds over 40M documents and a update
> frequency
> > of 15 seconds on a single physical server.
> > The (lab) server has 16g RAM and it is running win23k.
> >
> > Also, what I found out is that using the sharded setup I only need half
> the
> > memory for the large instance.
> > When indexing to the large instance the memory usage goes very fast up to
> > the maximum allocated heap size and never goes down.
> >
> > My question is, is there a magic switch in SOLR to have that kind of
> update
> > frequency while having the caches on fire ?
> > Or is it just impossible to achieve facet counts and queries in
> milliseconds
> > while updating the index every minute ?
> >
> > The second question is, the setup with a empty SOLR as a "coordinating"
> > instance, a large SOLR instance with hot caches and a small SOLR instance
> > with immediate updates,
> > all on the same physical server, does it sound like a durable solution
> > (until the small instance gets big) or is it something is braindead ?
> >
> > And the third question is, would it be a good idea to merge the small and
> > the large index periodically so that a fresh and empty small instance
> would
> > be available
> > after the merge ?
> >
> > Any ideas ?
> >
> > Best Regards,
> >
> > Janne Majaranta
>
>


Re: sorting

2010-02-11 Thread Otis Gospodnetic
Claudio,

If I understand correctly, the problem is that you are trying to sort on a 
tokenized text field.  That won't work and for something like "content" field 
that corresponds to the content of a web page, it doesn't even make much sense.

What you may to do is create another *string* field and in it put only the 
first N characters from the field like content.  Then sort on that string 
(untokenized) field.  If N is large enough, you should achieve the same effect 
as sorting on the content field.


Ciao,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Claudio Martella 
> To: solr-user@lucene.apache.org
> Sent: Thu, February 11, 2010 4:17:58 AM
> Subject: sorting
> 
> Hi,
> 
> i defined a requestHandler like this:
> 
> 
> 
>   dismax
>   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
>   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
>   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
>   0.1
> 
> 
> 
> 
> content* fields are tokenized. The content comes from nutch. As it is
> now, solr is complaining about some sorting issues on content* as they
> are tokenized. From my perspective i have not overridden any scoring or
> ordering. Have I?
> 
> 
> As the content comes from nutch with solrindex, what is the best way of
> integrating result ordering based on the graph-based information, and
> not only on the score based on query/content?
> 
> Thanks
> 
> -- 
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
> 
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> claudio.marte...@tis.bz.it http://www.tis.bz.it
> 
> Short information regarding use of personal data. According to Section 13 of 
> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
> process your personal data in order to fulfil contractual and fiscal 
> obligations 
> and also to send you information regarding our services and events. Your 
> personal data are processed with and without electronic means and by 
> respecting 
> data subjects' rights, fundamental freedoms and dignity, particularly with 
> regard to confidentiality, personal identity and the right to personal data 
> protection. At any time and without formalities you can write an e-mail to 
> priv...@tis.bz.it in order to object the processing of your personal data for 
> the purpose of sending advertising materials and also to exercise the right 
> to 
> access personal data and other rights referred to in Section 7 of Decree 
> 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens 
> Street n. 19, Bolzano. You can find the complete information on the web site 
> www.tis.bz.it.



Site search upsells & boosting by content type

2010-02-11 Thread Brandon Konkle
Good afternoon!

I was in the IRC room earlier this morning with a problem, and I'm still having 
difficulty with it.  I'm trying to do a site search upsell so that sponsored 
results can be highlighted and boosted to the top of the results. I need to 
have my default operator set to AND, because if it is set to OR I get rather 
unpredictable results.  For example, out of an index of 300k items, a search 
for "old 97's" yields about 116k because they have the words "old" or "97" in 
them.  With the default operator set to AND, I get 54 results, which would be 
the expected behaviour for the user who wants to find articles and events about 
that band.

Unfortunately, I can't boost certain queries with the default operator set to 
AND, because it adds those terms as a required clause to the search.  I need 
the boosted terms to be an optional clause. I'm trying to do what the docs talk 
about here: 
http://wiki.apache.org/solr/SolrRelevancyCookbook#Boosting_Ranking_Terms

So, my example search is "+(old 97's) id:events.event.88468^100" - which should 
search for the old 97's and optionally boost that individual event if it is 
part of the search results. When I run that search with the default operator 
set to AND, it is parsed into '+(+text:old +PhraseQuery(text:"97 s")) 
+id:events.event.88468^100.0' - making the particular event a required 
component of the search and returning only that 1 result.

When I alter my search to "+(old 97's) || id:events.event.88468^100", it parses 
that to "(+text:old +text:"97 s") id:events.event.88468^100.0", which at first 
appeared to do what I wanted.  With just "old 97's", I get 54 results.  With  
"+(old 97's) || id:events.event.88468^100", I get 54 results with that 
particular event on top.  However, if I try to boost another term, such as 
"+(old 97's) || granada^100" - I get over 300 results because it adds in all of 
the matches for the word "granada".  This is not what I want.  Instead of AND 
or OR, I want AND MAYBE.

This is supported by the Xapian backend that I'm switching from, so I'm really 
hoping there's a way to do this in Solr.  Thank you very much for any help you 
can provide!

-Brandon

Re: dismax and multi-language corpus

2010-02-11 Thread Otis Gospodnetic
Claudio,

Ah, through multilingual indexing/search work (with 
http://www.sematext.com/products/multilingual-indexer/index.html ) I learned 
that cross-language search often doesn't really make sense, unless the search 
involves "universal terms" (e.g. Fiat, BMW, Mercedes, Olivetti, Tomi de Paola, 
Alberto Tomba...).  If the search involved natural language-specific terms, 
then searching in the "foreign" language doesn't work so well and doesn't make 
a ton.  Imagine a search for "ciao ragazzi".  I have no idea what the Italian 
stemmer does with that, but say it turns it into "cia raga" (it doesn't, but 
just imagine).  If this was done with Italian docs at index time, you will find 
the matching docs.  But what happens if "ciao ragazzi" was analyzed by some 
German analyzer?  Different tokens will be created and indexed, so a "ciao 
ragazzi" search won't work.  And this Analyzer would you use to analyze that 
query anyway?  Italian or German?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Claudio Martella 
> To: solr-user@lucene.apache.org
> Sent: Thu, February 11, 2010 3:21:32 AM
> Subject: Re: dismax and multi-language corpus
> 
> I'll try removing the '-'. I do need now to search it. the other option
> would be to request the user what language to query. but in my region we
> use italian and german in the same quantity, so it would turn out in
> querying both the languages all the time. or you meant a more performant
> solution of query both the languages all the time? :)
> 
> 
> Otis Gospodnetic wrote:
> > Claudio - fields with '-' in them can be problematic.
> >
> > Side comment: do you really want to search across all languages at once?  
> > If 
> not, maybe 3 different dismax configs would make your searches better.
> >
> >  Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >
> >
> > - Original Message 
> >  
> >> From: Claudio Martella 
> >> To: solr-user@lucene.apache.org
> >> Sent: Wed, February 10, 2010 3:15:40 PM
> >> Subject: dismax and multi-language corpus
> >>
> >> Hello list,
> >>
> >> I have a corpus with 3 languages, so i setup a text content field (with
> >> no stemming) and 3 text-[en|it|de] fields with specific snowball stemmers.
> >> i copyField the text to my language-away fields. So, I setup this dismax
> >> searchHandler:
> >>
> >>
> >>
> >>   dismax
> >>   title^1.2 content-en^0.8 content-it^0.8
> >> content-de^0.8
> >>   title^1.2 content-en^0.8 content-it^0.8
> >> content-de^0.8
> >>   title^1.2 content-en^0.8 content-it^0.8
> >> content-de^0.8
> >>   0.1
> >>
> >>
> >>
> >>
> >> but i get this error:
> >>
> >> HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Expected
> >> ',' at position 7 in 'content-en'
> >>
> >> type Status report
> >>
> >> message org.apache.lucene.queryParser.ParseException: Expected ',' at
> >> position 7 in 'content-en'
> >>
> >> description The request sent by the client was syntactically incorrect
> >> (org.apache.lucene.queryParser.ParseException: Expected ',' at position
> >> 7 in 'content-en').
> >>
> >> Any idea?
> >>
> >> TIA
> >>
> >> Claudio
> >>
> >> -- 
> >> Claudio Martella
> >> Digital Technologies
> >> Unit Research & Development - Analyst
> >>
> >> TIS innovation park
> >> Via Siemens 19 | Siemensstr. 19
> >> 39100 Bolzano | 39100 Bozen
> >> Tel. +39 0471 068 123
> >> Fax  +39 0471 068 129
> >> claudio.marte...@tis.bz.it http://www.tis.bz.it
> >>
> >> Short information regarding use of personal data. According to Section 13 
> >> of 
> >> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
> >> process your personal data in order to fulfil contractual and fiscal 
> obligations 
> >> and also to send you information regarding our services and events. Your 
> >> personal data are processed with and without electronic means and by 
> respecting 
> >> data subjects' rights, fundamental freedoms and dignity, particularly with 
> >> regard to confidentiality, personal identity and the right to personal 
> >> data 
> >> protection. At any time and without formalities you can write an e-mail to 
> >> priv...@tis.bz.it in order to object the processing of your personal data 
> >> for 
> 
> >> the purpose of sending advertising materials and also to exercise the 
> >> right 
> to 
> >> access personal data and other rights referred to in Section 7 of Decree 
> >> 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens 
> >> Street n. 19, Bolzano. You can find the complete information on the web 
> >> site 
> >> www.tis.bz.it.
> >>
> >
> >
> >  
> 
> 
> -- 
> Claudio Martella
> Digital Technologies
> Unit Research & Development - Analyst
> 
> TIS innovation park
> Via Siemens 19 | Siemensstr. 19
> 39100 Bolzano | 39100 Bozen
> Tel. +39 0471 068 123
> Fax  +39 0471 068 129
> clau

Re: question/suggestion for Solr-236 patch

2010-02-11 Thread Otis Gospodnetic
Gerald,
Your suggestion will likely get lost in the piles of solr-user email.  You 
should add your comments to JIRA-236 directly.

Otis
-
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: gdeconto 
> To: solr-user@lucene.apache.org
> Sent: Wed, February 10, 2010 11:12:03 AM
> Subject: question/suggestion for Solr-236 patch
> 
> 
> I have been able to apply and use the solr-236 patch (field collapsing)
> successfully.
> 
> Very, very cool and powerful.
> 
> My one comment/concern is that the collapseCount and aggregate function
> values in the collapse_counts list only represent the collapsed documents
> (ie the ones that are not shown in results).
> 
> Are there any plans to include the non-collapsed (?) document in the
> collapseCount and aggregate function values (ie so that it includes ALL
> documents, not just the collapsed ones)?  Possibly via some parameter like
> collapse.includeAll?
> 
> I think this would be a great addition to the collapse code (and solr
> functionality) via what I would think is a small change.
> -- 
> View this message in context: 
> http://old.nabble.com/question-suggestion-for-Solr-236-patch-tp27533695p27533695.html
> Sent from the Solr - User mailing list archive at Nabble.com.



Re: Getting max/min dates from solr index

2010-02-11 Thread Otis Gospodnetic
Mark,

Yes, facets will give you that information. Min/max StatsComponent?  See 
http://www.search-lucene.com/?q=StatsComponent

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Mark N 
> To: solr-user@lucene.apache.org
> Sent: Wed, February 10, 2010 8:12:43 AM
> Subject: Getting max/min dates from solr index
> 
> How can we get the max and min date from the Solr index ? I would need these
> dates to draw a graph ( for example timeline graph )
> 
> 
> Also can we use date faceting to show how many documents are indexed every
> month  .
> Consider I need to draw a timeline graph for current year to show how many
> records are indexed for every month  .So i will have months in X axis and no
> of document in Y axis.
> 
> What should be the better approach to design a schema to achieve this
> functionality ?
> 
> 
> Any suggestions would be appreciated
> 
> thanks
> 
> 
> -- 
> Nipen Mark



Problem with Spatial Search

2010-02-11 Thread Emad Mushtaq
>
> Hello,
>
> I have a question related to local solr. For certain locations (latitude,
> longitude), the spatial search does not work. Here is the query I try to
> make which gives me no results:
>
> q=*&qt=geo&sort=geo_distance asc&lat=33.718151&long=73.060547&radius=450
>
> However if I make the same query with radius=449, it gives me results.
>
> Here is part of my solrconfig.xml containing startTier and endTier:
>
> 
>   class="com.pjaol.search.solr.update.LocalUpdateProcessorFactory">
> latitude 
> longitude 
>
> 9 
> 17 
>
>
>
>
>
> What do I need to do to fix this problem?
>
>
> Muhammad Emad Mushtaq
http://www.emadmushtaq.com/


Re: Collating results from multiple indexes

2010-02-11 Thread Otis Gospodnetic
Minor correction re Attivio - their stuff runs on top of Lucene, not Solr.  I 
*think* they are trying to patent this.

 Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Jan Høydahl / Cominvent 
> To: solr-user@lucene.apache.org
> Sent: Mon, February 8, 2010 3:33:41 PM
> Subject: Re: Collating results from multiple indexes
> 
> Hi,
> 
> There is no JOIN functionality in Solr. The common solution is either to 
> accept 
> the high volume update churn, or to add client side code to build a "join" 
> layer 
> on top of the two indices. I know that Attivio (www.attivio.com) have built 
> some 
> kind of JOIN functionality on top of Solr in their AIE product, but do not 
> know 
> the details or the actual performance.
> 
> Why not open a JIRA issue, if there is no such already, to request this as a 
> feature?
> 
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
> 
> On 25. jan. 2010, at 22.01, Aaron McKee wrote:
> 
> > 
> > Is there any somewhat convenient way to collate/integrate fields from 
> > separate 
> indices during result writing, if the indices use the same unique keys? 
> Basically, some sort of cross-index JOIN?
> > 
> > As a bit of background, I have a rather heavyweight dataset of every US 
> business (~25m records, an on-disk index footprint of ~30g, and 5-10 hours to 
> fully index on a decent box). Given the size and relatively stability of the 
> dataset, I generally only update this monthly. However, I have separate 
> advertising-related datasets that need to be updated either hourly or daily 
> (e.g. today's coupon, click revenue remaining, etc.) . These advertiser feeds 
> reference the same keyspace that I use in the main index, but are otherwise 
> significantly lighter weight. Importing and indexing them discretely only 
> takes 
> a couple minutes. Given that Solr/Lucene doesn't support field updating, 
> without 
> having to drop and re-add an entire document, it doesn't seem practical to 
> integrate this data into the main index (the system would be under a constant 
> state of churn, if we did document re-inserts, and the performance impact 
> would 
> probably be debilitating). It may be nice if this data could participate in 
> filtering (e.g. only show advertisers), but it doesn't need to participate in 
> scoring/ranking.
> > 
> > I'm guessing that someone else has had a similar need, at some point?  I 
> > can 
> have our front-end query the smaller indices separately, using the keys 
> returned 
> by the primary index, but would prefer to avoid the extra sequential 
> roundtrips. 
> I'm hoping to also avoid a coding solution, if only to avoid the maintenance 
> overhead as we drop in new builds of Solr, but that's also feasible.
> > 
> > Thank you for your insight,
> > Aaron
> > 



Re: Faceting

2010-02-11 Thread Otis Gospodnetic
Note that UIMA doesn't doe NER itself (as far as I know), but instead relies on 
GATE or OpenNLP or OpenCalais, AFAIK :)

Those interested in UIMA and living close to New York should go to 
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12384559/


Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Jan Høydahl / Cominvent 
> To: solr-user@lucene.apache.org
> Sent: Tue, February 9, 2010 9:57:26 AM
> Subject: Re: Faceting
> 
> NOTE: Please start a new email thread for a new topic (See 
> http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking)
> 
> Your strategy could work. You might want to look into dedicated entity 
> extraction frameworks like
> http://opennlp.sourceforge.net/
> http://nlp.stanford.edu/software/CRF-NER.shtml
> http://incubator.apache.org/uima/index.html
> 
> Or if that is too much work, look at 
> http://issues.apache.org/jira/browse/SOLR-1725 for a way to plug in your 
> entity 
> extraction code into Solr itself using a scripting language.
> 
> --
> Jan Høydahl  - search architect
> Cominvent AS - www.cominvent.com
> 
> On 5. feb. 2010, at 20.10, José Moreira wrote:
> 
> > Hello,
> > 
> > I'm planning to index a 'content' field for search and from that
> > fields text content i would like to facet (probably) according to if
> > the content has e-mails, urls and within urls, url's to pictures,
> > videos and others.
> > 
> > As i'm a relatively new user to Solr, my plan was to regexp the
> > content in my application and add tags to a Solr field according to
> > the content, so for example the content "m...@email.com
> > http://www.site.com"; would have the tags "email, link".
> > 
> > If i follow this path can i then facet on "email" and/or "link" ? For
> > example combining facet field with facet value params?
> > 
> > Best
> > 
> > -- 
> > http://pt.linkedin.com/in/josemoreira
> > josemore...@irc.freenode.net
> > http://djangopeople.net/josemoreira/



Re: dismax and multi-language corpus

2010-02-11 Thread Sven Maurmann

Hi,

this is correct. Usually one does not know, how a stemmer - or
other language specific filters - behaves in the context of a
foreign language.

But there is an exception that sometimes comes to the rescue:
If one has a stable dictionary of terms in all the languages
of interest, then one might put these terms in a synoynm list
and also into a list of protected words for the stemmers. Then
searches for one those terms in any language will return the
documents regardless of their own language.

Of course this does not solve the general problem of cross-language
search, but it helps in certain circumstances.

Cheers,
   Sven

--On Donnerstag, 11. Februar 2010 13:45 -0800 Otis Gospodnetic 
 wrote:



Claudio,

Ah, through multilingual indexing/search work (with
http://www.sematext.com/products/multilingual-indexer/index.html ) I
learned that cross-language search often doesn't really make sense,
unless the search involves "universal terms" (e.g. Fiat, BMW, Mercedes,
Olivetti, Tomi de Paola, Alberto Tomba...).  If the search involved
natural language-specific terms, then searching in the "foreign" language
doesn't work so well and doesn't make a ton.  Imagine a search for "ciao
ragazzi".  I have no idea what the Italian stemmer does with that, but
say it turns it into "cia raga" (it doesn't, but just imagine).  If this
was done with Italian docs at index time, you will find the matching
docs.  But what happens if "ciao ragazzi" was analyzed by some German
analyzer?  Different tokens will be created and indexed, so a "ciao
ragazzi" search won't work.  And this Analyzer would you use to analyze
that query anyway?  Italian or German?

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 

From: Claudio Martella 
To: solr-user@lucene.apache.org
Sent: Thu, February 11, 2010 3:21:32 AM
Subject: Re: dismax and multi-language corpus

I'll try removing the '-'. I do need now to search it. the other option
would be to request the user what language to query. but in my region we
use italian and german in the same quantity, so it would turn out in
querying both the languages all the time. or you meant a more performant
solution of query both the languages all the time? :)


Otis Gospodnetic wrote:
> Claudio - fields with '-' in them can be problematic.
>
> Side comment: do you really want to search across all languages at
> once?  If
not, maybe 3 different dismax configs would make your searches better.
>
>  Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> - Original Message 
>
>> From: Claudio Martella
>> To: solr-user@lucene.apache.org
>> Sent: Wed, February 10, 2010 3:15:40 PM
>> Subject: dismax and multi-language corpus
>>
>> Hello list,
>>
>> I have a corpus with 3 languages, so i setup a text content field
>> (with no stemming) and 3 text-[en|it|de] fields with specific
>> snowball stemmers. i copyField the text to my language-away fields.
>> So, I setup this dismax searchHandler:
>>
>>
>>
>>   dismax
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   title^1.2 content-en^0.8 content-it^0.8
>> content-de^0.8
>>   0.1
>>
>>
>>
>>
>> but i get this error:
>>
>> HTTP Status 400 - org.apache.lucene.queryParser.ParseException:
>> Expected ',' at position 7 in 'content-en'
>>
>> type Status report
>>
>> message org.apache.lucene.queryParser.ParseException: Expected ',' at
>> position 7 in 'content-en'
>>
>> description The request sent by the client was syntactically incorrect
>> (org.apache.lucene.queryParser.ParseException: Expected ',' at
>> position 7 in 'content-en').
>>
>> Any idea?
>>
>> TIA
>>
>> Claudio
>>
>> --
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.marte...@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to
>> Section 13 of  Italian Legislative Decree no. 196 of 30 June 2003, we
>> inform you that we  process your personal data in order to fulfil
>> contractual and fiscal
obligations
>> and also to send you information regarding our services and events.
>> Your  personal data are processed with and without electronic means
>> and by
respecting
>> data subjects' rights, fundamental freedoms and dignity, particularly
>> with  regard to confidentiality, personal identity and the right to
>> personal data  protection. At any time and without formalities you
>> can write an e-mail to  priv...@tis.bz.it in order to object the
>> processing of your personal data for

>> the purpose of sending advertising materials and also to exercise the
>> right
to
>> access personal data and other rights referred to 

How to reindex data without restarting server

2010-02-11 Thread Emad Mushtaq
Hi,

I would like to know if there is a way of reindexing data without restarting
the server. Lets say I make a change in the schema file. That would require
me to reindex data. Is there a solution to this ?

-- 
Muhammad Emad Mushtaq
http://www.emadmushtaq.com/


Re: How to reindex data without restarting server

2010-02-11 Thread Sven Maurmann

Hi,

restarting the Solr server wouldn't help. If you want to re-index
your data you have to pipe it through the whole process again.

In your case it might be a good idea to consider having several
cores holding the different schema definitions. This will not save
you from getting the original data and doing the analysis once
again, but at least you do not have a schema not being consistent
with the data in the index.

If you have a way to find and access the original data from the
unique id in your index, you may create a small program that reads
the data belonging to the id and sends it to the new core for
indexing (just rough toughts depending of the nature of your
situation).

Cheers,
Sven

--On Freitag, 12. Februar 2010 03:40 +0500 Emad Mushtaq 
 wrote:



Hi,

I would like to know if there is a way of reindexing data without
restarting the server. Lets say I make a change in the schema file. That
would require me to reindex data. Is there a solution to this ?

--
Muhammad Emad Mushtaq
http://www.emadmushtaq.com/


Re: How to reindex data without restarting server

2010-02-11 Thread Emad Mushtaq
Thanks for responding to my question.

Let me just put a situation that might arise in future. I decide to add a
new field to the schema. So if I have understood you correctly, "piping it
through the whole process" would mean, that I delete records one by one, and
add the same records again. Basically if I change my schema, the new records
that I would add would be indexed on the new schema? That wouldn't require
server restart?

On Fri, Feb 12, 2010 at 3:49 AM, Sven Maurmann wrote:

> Hi,
>
> restarting the Solr server wouldn't help. If you want to re-index
> your data you have to pipe it through the whole process again.
>
> In your case it might be a good idea to consider having several
> cores holding the different schema definitions. This will not save
> you from getting the original data and doing the analysis once
> again, but at least you do not have a schema not being consistent
> with the data in the index.
>
> If you have a way to find and access the original data from the
> unique id in your index, you may create a small program that reads
> the data belonging to the id and sends it to the new core for
> indexing (just rough toughts depending of the nature of your
> situation).
>
> Cheers,
>Sven
>
>
> --On Freitag, 12. Februar 2010 03:40 +0500 Emad Mushtaq <
> emad.mush...@sigmatec.com.pk> wrote:
>
>  Hi,
>>
>> I would like to know if there is a way of reindexing data without
>> restarting the server. Lets say I make a change in the schema file. That
>> would require me to reindex data. Is there a solution to this ?
>>
>> --
>> Muhammad Emad Mushtaq
>> http://www.emadmushtaq.com/
>>
>


-- 
Muhammad Emad Mushtaq
http://www.emadmushtaq.com/


Re: sorting

2010-02-11 Thread Claudio Martella
Hi,

thanks for your answer. I'm getting crazy for this.

No. I did not define any sorting or scoring explicitly. But solr isn't
working with my requestHandler. It complains about sorting on the
content field.
I agree with you. sorting on content wouldn't make much sense. On my
first post i quoted my requestHandler and I don't think it overrides the
scoring. I just want the default behavior which is evidence. If i can,
and I should as i'm using nutch, i'd add the contribution of link
analysis by nutch.

i don't understand what is wrong with my dismax's requestHandler.


Otis Gospodnetic wrote:
> Claudio,
>
> If I understand correctly, the problem is that you are trying to sort on a 
> tokenized text field.  That won't work and for something like "content" field 
> that corresponds to the content of a web page, it doesn't even make much 
> sense.
>
> What you may to do is create another *string* field and in it put only the 
> first N characters from the field like content.  Then sort on that string 
> (untokenized) field.  If N is large enough, you should achieve the same 
> effect as sorting on the content field.
>
>
> Ciao,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> - Original Message 
>   
>> From: Claudio Martella 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, February 11, 2010 4:17:58 AM
>> Subject: sorting
>>
>> Hi,
>>
>> i defined a requestHandler like this:
>>
>>
>>
>>   dismax
>>   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
>>   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
>>   title^1.2 contentEN^0.8 contentIT^0.8 contentDE^0.8
>>   0.1
>>
>>
>>
>>
>> content* fields are tokenized. The content comes from nutch. As it is
>> now, solr is complaining about some sorting issues on content* as they
>> are tokenized. From my perspective i have not overridden any scoring or
>> ordering. Have I?
>>
>>
>> As the content comes from nutch with solrindex, what is the best way of
>> integrating result ordering based on the graph-based information, and
>> not only on the score based on query/content?
>>
>> Thanks
>>
>> -- 
>> Claudio Martella
>> Digital Technologies
>> Unit Research & Development - Analyst
>>
>> TIS innovation park
>> Via Siemens 19 | Siemensstr. 19
>> 39100 Bolzano | 39100 Bozen
>> Tel. +39 0471 068 123
>> Fax  +39 0471 068 129
>> claudio.marte...@tis.bz.it http://www.tis.bz.it
>>
>> Short information regarding use of personal data. According to Section 13 of 
>> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
>> process your personal data in order to fulfil contractual and fiscal 
>> obligations 
>> and also to send you information regarding our services and events. Your 
>> personal data are processed with and without electronic means and by 
>> respecting 
>> data subjects' rights, fundamental freedoms and dignity, particularly with 
>> regard to confidentiality, personal identity and the right to personal data 
>> protection. At any time and without formalities you can write an e-mail to 
>> priv...@tis.bz.it in order to object the processing of your personal data 
>> for 
>> the purpose of sending advertising materials and also to exercise the right 
>> to 
>> access personal data and other rights referred to in Section 7 of Decree 
>> 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens 
>> Street n. 19, Bolzano. You can find the complete information on the web site 
>> www.tis.bz.it.
>> 
>
>
>   


-- 
Claudio Martella
Digital Technologies
Unit Research & Development - Analyst

TIS innovation park
Via Siemens 19 | Siemensstr. 19
39100 Bolzano | 39100 Bozen
Tel. +39 0471 068 123
Fax  +39 0471 068 129
claudio.marte...@tis.bz.it http://www.tis.bz.it

Short information regarding use of personal data. According to Section 13 of 
Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we 
process your personal data in order to fulfil contractual and fiscal 
obligations and also to send you information regarding our services and events. 
Your personal data are processed with and without electronic means and by 
respecting data subjects' rights, fundamental freedoms and dignity, 
particularly with regard to confidentiality, personal identity and the right to 
personal data protection. At any time and without formalities you can write an 
e-mail to priv...@tis.bz.it in order to object the processing of your personal 
data for the purpose of sending advertising materials and also to exercise the 
right to access personal data and other rights referred to in Section 7 of 
Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
Siemens Street n. 19, Bolzano. You can find the complete information on the web 
site www.tis.bz.it.




Dismax phrase queries

2010-02-11 Thread Jason Rutherglen
I'd like to boost an exact phrase match such as q="video poker" over
q=video poker.  How would I do this using dismax?

I tried pre-processing video poker into, video poker "video poker"
however that just gets munged by dismax into "video poker video
poker"... Which is wrong.

Cheers!


Re: How to reindex data without restarting server

2010-02-11 Thread Joe Calderon
if you use the core model via solr.xml you can reload a core without 
having to to restart the servlet container,

http://wiki.apache.org/solr/CoreAdmin
On 02/11/2010 02:40 PM, Emad Mushtaq wrote:

Hi,

I would like to know if there is a way of reindexing data without restarting
the server. Lets say I make a change in the schema file. That would require
me to reindex data. Is there a solution to this ?

   




RE: The Riddle of the Underscore and the Dollar Sign . . .

2010-02-11 Thread Christopher Ball
I think I am making some progress - the key suggestion was to look at the
analysis.jsp which I foolishly had forgotten =(.

I think it is actually a bug in the ShingleFilterFactory when it is used in
subsequent to another Filter which removes tokens, e.g. StopFilterFactory or
WordDelimiterFactory. The Analyzer clearly shows anytime a token is dropped
the ShingleFilterFactory picks up a mysterious '_'.

For example, I enter "w'w oa". The WordDelimiterFactory removes the "w'w"
token but then the ShingleFilterFactory shows "_ oa". Drop the apostraphy in
to create "ww oa" and the ShingleFilterFactory shows "oa". Same occurs if I
have the StopFilterFactory remove tokens.

Be grateful if anyone else can replicate this behavior.

Christopher

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, February 11, 2010 12:40 PM
To: solr-user@lucene.apache.org
Subject: RE: The Riddle of the Underscore and the Dollar Sign . . .


> Unfortunately, the underscore is
> being quite resilient =(
> 
> I tried the solr.MappingCharFilterFactory and know the
> mapping is working as
> I am changing "c" => "q" just fine. But the underscore
> refuses to go!
> 
> I am baffled . . .

I just activated name="textCharNorm" in example schema.xml and added 
"_" => "xxx" to mapping-ISOLatin1Accent.txt
I verified from http://localhost:8983/solr/admin/analysis.jsp that
replacement is done without problems. Can you also test analysis.jsp?

May be your documents has underscores having different Unicode values. I
know three different Unicode valued characters that all look like "-"
If thats the case you need to find their Unicode values and write them into
mappings.txt.



  




Re: dismax and multi-language corpus

2010-02-11 Thread Jason Rutherglen
That's a bug, IMO...

On Thu, Feb 11, 2010 at 1:30 PM, Otis Gospodnetic
 wrote:
> I don't know, but the other day I did see a NPE related to fields with '-'.  
> In Distributed Search context at least, fields with '-' were causing a NPE.
>
>
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
>
>
>
> - Original Message 
>> From: Jason Rutherglen 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, February 11, 2010 12:36:00 AM
>> Subject: Re: dismax and multi-language corpus
>>
>> > Claudio - fields with '-' in them can be problematic.
>>
>> Why's that?
>>
>> On Wed, Feb 10, 2010 at 2:38 PM, Otis Gospodnetic
>> wrote:
>> > Claudio - fields with '-' in them can be problematic.
>> >
>> > Side comment: do you really want to search across all languages at once?  
>> > If
>> not, maybe 3 different dismax configs would make your searches better.
>> >
>> >  Otis
>> > 
>> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>> > Hadoop ecosystem search :: http://search-hadoop.com/
>> >
>> >
>> >
>> > - Original Message 
>> >> From: Claudio Martella
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Wed, February 10, 2010 3:15:40 PM
>> >> Subject: dismax and multi-language corpus
>> >>
>> >> Hello list,
>> >>
>> >> I have a corpus with 3 languages, so i setup a text content field (with
>> >> no stemming) and 3 text-[en|it|de] fields with specific snowball stemmers.
>> >> i copyField the text to my language-away fields. So, I setup this dismax
>> >> searchHandler:
>> >>
>> >>
>> >>
>> >>   dismax
>> >>   title^1.2 content-en^0.8 content-it^0.8
>> >> content-de^0.8
>> >>   title^1.2 content-en^0.8 content-it^0.8
>> >> content-de^0.8
>> >>   title^1.2 content-en^0.8 content-it^0.8
>> >> content-de^0.8
>> >>   0.1
>> >>
>> >>
>> >>
>> >>
>> >> but i get this error:
>> >>
>> >> HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Expected
>> >> ',' at position 7 in 'content-en'
>> >>
>> >> type Status report
>> >>
>> >> message org.apache.lucene.queryParser.ParseException: Expected ',' at
>> >> position 7 in 'content-en'
>> >>
>> >> description The request sent by the client was syntactically incorrect
>> >> (org.apache.lucene.queryParser.ParseException: Expected ',' at position
>> >> 7 in 'content-en').
>> >>
>> >> Any idea?
>> >>
>> >> TIA
>> >>
>> >> Claudio
>> >>
>> >> --
>> >> Claudio Martella
>> >> Digital Technologies
>> >> Unit Research & Development - Analyst
>> >>
>> >> TIS innovation park
>> >> Via Siemens 19 | Siemensstr. 19
>> >> 39100 Bolzano | 39100 Bozen
>> >> Tel. +39 0471 068 123
>> >> Fax  +39 0471 068 129
>> >> claudio.marte...@tis.bz.it http://www.tis.bz.it
>> >>
>> >> Short information regarding use of personal data. According to Section 13 
>> >> of
>> >> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we
>> >> process your personal data in order to fulfil contractual and fiscal
>> obligations
>> >> and also to send you information regarding our services and events. Your
>> >> personal data are processed with and without electronic means and by
>> respecting
>> >> data subjects' rights, fundamental freedoms and dignity, particularly with
>> >> regard to confidentiality, personal identity and the right to personal 
>> >> data
>> >> protection. At any time and without formalities you can write an e-mail to
>> >> priv...@tis.bz.it in order to object the processing of your personal data 
>> >> for
>> >> the purpose of sending advertising materials and also to exercise the 
>> >> right
>> to
>> >> access personal data and other rights referred to in Section 7 of Decree
>> >> 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens
>> >> Street n. 19, Bolzano. You can find the complete information on the web 
>> >> site
>> >> www.tis.bz.it.
>> >
>> >
>
>


Re: Indexing / querying multiple data types

2010-02-11 Thread Lance Norskog
I gave you bad advice about qt=. Erik Hatcher kindly corrected me:

>> Actually qt selects the request handler.  defType selects the query parser.  
>> qt may implicitly select a query parser of course, but that would depend on 
>> the request handler definition.

On Wed, Feb 10, 2010 at 1:10 PM, Stefan Maric  wrote:
> Lance
>
> after a bit more reading - & cleaning up my configuration (case sensitivity 
> corrected but didn't appear to be affecting the indexing & i don't use the 
> atomID field for querying anyhow)
>
> I've added a docType field when I index my data and now use the fq parameter 
> to filter on that new field
>
>
>
>
>
> -Original Message-
> From: Lance Norskog [mailto:goks...@gmail.com]
> Sent: 10 February 2010 03:28
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing / querying multiple data types
>
>
> A couple of minor problems:
>
> The qt parameter (Que Tee) selects the parser for the q (Q for query)
> parameter. I think you mean 'qf':
>
> http://wiki.apache.org/solr/DisMaxRequestHandler#qf_.28Query_Fields.29
>
> Another problems with atomID, atomId, atomid: Solr field names are
> case-sensitive. I don't know how this plays out.
>
> Now, to the main part:  the  part does not create
> a column named name1.
> The two queries only populate the same namespace of four fields: id,
> atomID, name, description.
>
> If you want data from each entity to have a constant field
> distinguishing it, you have to create a new field with a constant
> value. You do this with the TemplateTransformer.
>
> http://wiki.apache.org/solr/DataImportHandler#TemplateTransformer
>
> Add this as an entity attribute to both entities:
>    transformer="TemplateTransformer"
> and add this as a column to each entity:
>     and then "name2".
>
> You may have to do something else for these to appear in the document.
>
> On Tue, Feb 9, 2010 at 12:41 AM,   wrote:
>> Sven
>>
>> In my data-config.xml I have the following
>>        
>>                
>>                
>>        
>>
>> In my schema.xml I have
>>   > required="true" />
>>   
>>   > required="true" />
>>   
>>
>> And in my solrconfig.xml I have
>>  >        class="org.apache.solr.handler.dataimport.DataImportHandler">
>>    
>>                data-config.xml
>>    
>>  
>>
>>        
>>                
>>                        dismax
>>                        explicit
>>                        0.01
>>                        name^1.5 description^1.0
>>                
>>        
>>
>>        
>>                
>>                        dismax
>>                        explicit
>>                        0.01
>>                        name^1.5 description^1.0
>>                
>>        
>>
>> And the
>>  
>> Has been untouched
>>
>> So when I run
>> http://localhost:7001/solr/select/?q=food&qt=name1
>> I was expecting to get results form the data that had been indexed by 
>> >
>>
>> Regards
>> Stefan Maric
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
> No virus found in this incoming message.
> Checked by AVG - www.avg.com
> Version: 8.5.435 / Virus Database: 271.1.1/2677 - Release Date: 02/09/10 
> 07:35:00
>
>
>



-- 
Lance Norskog
goks...@gmail.com


RE: The Riddle of the Underscore and the Dollar Sign . . .

2010-02-11 Thread Steven A Rowe
Hi Christopher,

ShingleFilter(Factory), by design, inserts underscores for empty positions, so 
that you don't get shingles created from non-contiguous tokens.

It would probably be better to treat empty positions as edges, like an 
end-of-stream followed by a beginning-of-stream, and only output meaningful 
token n-grams, instead of these underscore-things - I can't image what use they 
are.  There should probably also be an option to ignore position gaps and 
generate shingles as if the tokens were really contiguous.

Anybody have a different opinion?

Steve

On 02/11/2010 at 10:03 PM, Christopher Ball wrote:
> I think I am making some progress - the key suggestion was to look at
> the analysis.jsp which I foolishly had forgotten =(.
> 
> I think it is actually a bug in the ShingleFilterFactory when it is used
> in subsequent to another Filter which removes tokens, e.g.
> StopFilterFactory or WordDelimiterFactory. The Analyzer clearly shows
> anytime a token is dropped the ShingleFilterFactory picks up a
> mysterious '_'.
> 
> For example, I enter "w'w oa". The WordDelimiterFactory removes the
> "w'w" token but then the ShingleFilterFactory shows "_ oa". Drop the
> apostraphy in to create "ww oa" and the ShingleFilterFactory shows "oa".
> Same occurs if I have the StopFilterFactory remove tokens.
> 
> Be grateful if anyone else can replicate this behavior.
> 
> Christopher
> 
> -Original Message-
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Thursday, February 11, 2010 12:40 PM
> To: solr-user@lucene.apache.org
> Subject: RE: The Riddle of the Underscore and the Dollar Sign . . .
> 
> 
> > Unfortunately, the underscore is
> > being quite resilient =(
> > 
> > I tried the solr.MappingCharFilterFactory and know the
> > mapping is working as
> > I am changing "c" => "q" just fine. But the underscore
> > refuses to go!
> > 
> > I am baffled . . .
> 
> I just activated name="textCharNorm" in example schema.xml and added
> "_" => "xxx" to mapping-ISOLatin1Accent.txt
> I verified from http://localhost:8983/solr/admin/analysis.jsp that
> replacement is done without problems. Can you also test analysis.jsp?
> 
> May be your documents has underscores having different Unicode values. I
> know three different Unicode valued characters that all look like "-" If
> thats the case you need to find their Unicode values and write them into
> mappings.txt.
> 
> 
> 
> 
>




Re: dismax and multi-language corpus

2010-02-11 Thread Otis Gospodnetic
I agree.  I just didn't have the chance to look at it closely to get enough 
details for filing in JIRA.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Jason Rutherglen 
> To: solr-user@lucene.apache.org
> Sent: Thu, February 11, 2010 10:47:03 PM
> Subject: Re: dismax and multi-language corpus
> 
> That's a bug, IMO...
> 
> On Thu, Feb 11, 2010 at 1:30 PM, Otis Gospodnetic
> wrote:
> > I don't know, but the other day I did see a NPE related to fields with '-'. 
>  In Distributed Search context at least, fields with '-' were causing a NPE.
> >
> >
> > Otis
> > 
> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> > Hadoop ecosystem search :: http://search-hadoop.com/
> >
> >
> >
> > - Original Message 
> >> From: Jason Rutherglen 
> >> To: solr-user@lucene.apache.org
> >> Sent: Thu, February 11, 2010 12:36:00 AM
> >> Subject: Re: dismax and multi-language corpus
> >>
> >> > Claudio - fields with '-' in them can be problematic.
> >>
> >> Why's that?
> >>
> >> On Wed, Feb 10, 2010 at 2:38 PM, Otis Gospodnetic
> >> wrote:
> >> > Claudio - fields with '-' in them can be problematic.
> >> >
> >> > Side comment: do you really want to search across all languages at once? 
>  If
> >> not, maybe 3 different dismax configs would make your searches better.
> >> >
> >> >  Otis
> >> > 
> >> > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> >> > Hadoop ecosystem search :: http://search-hadoop.com/
> >> >
> >> >
> >> >
> >> > - Original Message 
> >> >> From: Claudio Martella
> >> >> To: solr-user@lucene.apache.org
> >> >> Sent: Wed, February 10, 2010 3:15:40 PM
> >> >> Subject: dismax and multi-language corpus
> >> >>
> >> >> Hello list,
> >> >>
> >> >> I have a corpus with 3 languages, so i setup a text content field (with
> >> >> no stemming) and 3 text-[en|it|de] fields with specific snowball 
> >> >> stemmers.
> >> >> i copyField the text to my language-away fields. So, I setup this dismax
> >> >> searchHandler:
> >> >>
> >> >>
> >> >>
> >> >>   dismax
> >> >>   title^1.2 content-en^0.8 content-it^0.8
> >> >> content-de^0.8
> >> >>   title^1.2 content-en^0.8 content-it^0.8
> >> >> content-de^0.8
> >> >>   title^1.2 content-en^0.8 content-it^0.8
> >> >> content-de^0.8
> >> >>   0.1
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> but i get this error:
> >> >>
> >> >> HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Expected
> >> >> ',' at position 7 in 'content-en'
> >> >>
> >> >> type Status report
> >> >>
> >> >> message org.apache.lucene.queryParser.ParseException: Expected ',' at
> >> >> position 7 in 'content-en'
> >> >>
> >> >> description The request sent by the client was syntactically incorrect
> >> >> (org.apache.lucene.queryParser.ParseException: Expected ',' at position
> >> >> 7 in 'content-en').
> >> >>
> >> >> Any idea?
> >> >>
> >> >> TIA
> >> >>
> >> >> Claudio
> >> >>
> >> >> --
> >> >> Claudio Martella
> >> >> Digital Technologies
> >> >> Unit Research & Development - Analyst
> >> >>
> >> >> TIS innovation park
> >> >> Via Siemens 19 | Siemensstr. 19
> >> >> 39100 Bolzano | 39100 Bozen
> >> >> Tel. +39 0471 068 123
> >> >> Fax  +39 0471 068 129
> >> >> claudio.marte...@tis.bz.it http://www.tis.bz.it
> >> >>
> >> >> Short information regarding use of personal data. According to Section 
> >> >> 13 
> of
> >> >> Italian Legislative Decree no. 196 of 30 June 2003, we inform you that 
> >> >> we
> >> >> process your personal data in order to fulfil contractual and fiscal
> >> obligations
> >> >> and also to send you information regarding our services and events. Your
> >> >> personal data are processed with and without electronic means and by
> >> respecting
> >> >> data subjects' rights, fundamental freedoms and dignity, particularly 
> >> >> with
> >> >> regard to confidentiality, personal identity and the right to personal 
> data
> >> >> protection. At any time and without formalities you can write an e-mail 
> >> >> to
> >> >> priv...@tis.bz.it in order to object the processing of your personal 
> >> >> data 
> for
> >> >> the purpose of sending advertising materials and also to exercise the 
> right
> >> to
> >> >> access personal data and other rights referred to in Section 7 of Decree
> >> >> 196/2003. The data controller is TIS Techno Innovation Alto Adige, 
> >> >> Siemens
> >> >> Street n. 19, Bolzano. You can find the complete information on the web 
> site
> >> >> www.tis.bz.it.
> >> >
> >> >
> >
> >