Re: Snipets Solr/nutch

2008-04-10 Thread khirb7

hello every body
 
just one other question, to analyse and modify Solr's snippet, I want to
know if  org.apache.solr.util.HighlightingUtils
is the class generating the snippet and which methode generate them, and
could you please explain me how are they generated in that class and where
exactly to modify it. all that in order to not return the first word
encountered highlighted but to return an other one because of the problem I
explained  in my previous messages

Cheers
-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16603642.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Human Powered Search Module

2008-04-10 Thread Mathieu Lecarme

Sushan Rungta a écrit :

Hello Everybody,

I am a newbie in Lucene and I am from India, currently working for a 
search module for our classifed website search module in 
clickindia.com. I have implemented the basic functionality of solr 
lucen and am pretty happy with the results.


Search in India has its own share of nuances. 'Maruti' is spelt as 
'Maruthi' in most of South India. People spell most of the times 
'Naukri' as 'Naukari'; a loan request would be simply followed in the 
query as 'need money'. These and many more such intricacies are 
typical of Indians and require a special kind of module for the same.


Is there any ready-made solution for the same? Can I get the access of 
words as mentioned above and is used in India, so that I could 
implement it?
Synonyms are easy to handle, but semantic analysis is a bit trickier. 
Weka may help you? http://weka.sourceforge.net


M.


How to custom solr sort?

2008-04-10 Thread shawnliu

I have inherited a new class from the org.apache.solr.schema.StrField and
customed a new sort algorithm by implementing the SortComparatorSource
interface.Then to export the jar file to the solr lib directory, and
configure the schema.xml file.But when I test the new feature, It does't
work at all.Can you give some suggestions?Thanks.
-- 
View this message in context: 
http://www.nabble.com/How-to-custom-solr-sort--tp16607351p16607351.html
Sent from the Solr - User mailing list archive at Nabble.com.



HTMLStripReader and script tags

2008-04-10 Thread Walter Ferrara
I've noticed that passing html to a field using 
HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too.

For example, using a analyzer like:

 
   
 


with a text such as:

title

pre

 var time = new Date();
 ordval= (time.getTime());

post 



Analysis.jsp turns out those tokens:
title
pre
var
time
=
new
Date();
ordval=
(time.getTime());
post

While if the script in the page is commented, everything works fine.
Is this due to design choice? Shouldn't scripts be removed in both cases?
(Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson 
- 2008-03-24 09:59:40)


Walter



help on caching and index files of Solr

2008-04-10 Thread Sagar Khetkade

Hello,

I have a hands on both Lucene and Solr. The difference between
these two search engines are explained to some extend, still having
some query on these. I am in need to know why 

1. Want some information on the difference between caching of Lucene and Solr 
index files.

2. As Solr is built on Lucene, is the index file of Solr similar to that of 
Lucene.

3. Is there any provision in Solr to index the complete repository/directory as 
that of Lucene.

Thanks in advance.

 

_
Technology : Catch up on updates on the latest Gadgets, Reviews, Gaming and 
Tips to use technology etc.
http://computing.in.msn.com/

Re: HTMLStripReader and script tags

2008-04-10 Thread Yonik Seeley
It was the intention to remove script.
I developed HTMLStripReader by just looking at a bunch of real-world HTML.
I hadn't run across script in uppercase, so I didn't do a case
insensitive check.

The code is currently:
if (name.equals("script") || name.equals("style")) {

Should be easy enough to change unless there is a good reason not to.

-Yonik

On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote:
> I've noticed that passing html to a field using
> HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too.
>  For example, using a analyzer like:
>  
>  
>
>  
>  
>
>  with a text such as:
>  
>  title
>  
>  pre
>  
>  var time = new Date();
>  ordval= (time.getTime());
>  
>  post 
>  
>  
>
>  Analysis.jsp turns out those tokens:
>  title
>  pre
>  var
>  time
>  =
>  new
>  Date();
>  ordval=
>  (time.getTime());
>  post
>
>  While if the script in the page is commented, everything works fine.
>  Is this due to design choice? Shouldn't scripts be removed in both cases?
>  (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson -
> 2008-03-24 09:59:40)
>
>  Walter
>
>


Re: HTMLStripReader and script tags

2008-04-10 Thread Yonik Seeley
I've just committed a change to ignore case when comparing tag names.
-Yonik

On Thu, Apr 10, 2008 at 9:03 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> It was the intention to remove script.
>  I developed HTMLStripReader by just looking at a bunch of real-world HTML.
>  I hadn't run across script in uppercase, so I didn't do a case
>  insensitive check.
>
>  The code is currently:
> if (name.equals("script") || name.equals("style")) {
>
>  Should be easy enough to change unless there is a good reason not to.
>
>  -Yonik
>
>
>
>  On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote:
>  > I've noticed that passing html to a field using
>  > HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts 
> too.
>  >  For example, using a analyzer like:
>  >  
>  >  
>  >
>  >  
>  >  
>  >
>  >  with a text such as:
>  >  
>  >  title
>  >  
>  >  pre
>  >  
>  >  var time = new Date();
>  >  ordval= (time.getTime());
>  >  
>  >  post 
>  >  
>  >  
>  >
>  >  Analysis.jsp turns out those tokens:
>  >  title
>  >  pre
>  >  var
>  >  time
>  >  =
>  >  new
>  >  Date();
>  >  ordval=
>  >  (time.getTime());
>  >  post
>  >
>  >  While if the script in the page is commented, everything works fine.
>  >  Is this due to design choice? Shouldn't scripts be removed in both cases?
>  >  (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson -
>  > 2008-03-24 09:59:40)
>  >
>  >  Walter
>  >
>  >
>


Re: Snipets Solr/nutch(maxFragSize?)

2008-04-10 Thread khirb7



khirb7 wrote:
> 
> hello every body
>  
> just one other question, to analyse and modify Solr's snippet, I want to
> know if  org.apache.solr.util.HighlightingUtils
> is the class generating the snippet and which methode generate them, and
> could you please explain me how are they generated in that class and where
> exactly to modify it. all that in order to not return the first word
> encountered highlighted but to return an other one because of the problem
> I explained  in my previous messages
> 
> Cheers
> 
I have done deep search and I found that lucene provide this that methode  :
getBestFragments
highlighter.getBestFragments(tokenStream, text, maxNumFragment, "...");

so with this methode we can precise to lucene to return   maxNumFragment
fragment (with highligted word)of fragsize characters, but there is no
maxFragSize parameter in solr. this would be useful in my case if I want to
highlight not only the first occurrence of a searched word but up to 1
occurrence of the same word. 

cheers




-- 
View this message in context: 
http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16608806.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Snipets Solr/nutch

2008-04-10 Thread Mike Klaas

On 10-Apr-08, at 12:26 AM, khirb7 wrote:


hello every body

just one other question, to analyse and modify Solr's snippet, I  
want to

know if  org.apache.solr.util.HighlightingUtils
is the class generating the snippet and which methode generate them,  
and
could you please explain me how are they generated in that class and  
where

exactly to modify it. all that in order to not return the first word
encountered highlighted but to return an other one because of the  
problem I

explained  in my previous messages


Unfortunately I have not familiar with nutch's snippet generation.

Solr's highlighting is located in  
org.apache.solr.util.HighlightingUtils in version 1.2, in the current  
(trunk) version, it is located in

org.apache.solr.highlight.* package.

Your use case is a little tricky.  The best way to deal with it in my  
opinion is to strip out the header before sending the data to Solr.   
This will improve your highlighting _and_ your search relevance.


-Mike


Re: Multicore Issue with nightly build

2008-04-10 Thread kirk beers
Hi Ryan,

I still can't seem to get my solr cores : core0 and core1 to accept new
documents. I changed the appropriate code in the Perl client to accommodate
the core as you mentioned in the previous email.  I am able to delete
docs.  Is there any thing I might be missing in the basic  core  schema.xml
?  I tried to copy the contents of solr/config/schema.xml into
solr/core0/conf/schema.xml I added the core0 name and noticed that the
Default Search Field was different but I couldn't notice any other
differences that stood out. Once i did this neither core could be queried
but the single core could.

What is the relationship between these 3 schemas ? Do they rely on one
another ? Or are they each independent of one another and perform their own
specific indepenedent functions?

Thanks

K

On Tue, Apr 8, 2008 at 3:11 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:

> from the client side, multicore should behave exactly the same as multi
> single core servers running next to each other.
>
> I'm not familiar with the perl client, but it will need to be configured
> for each core -- rather then one client that talks to multiple cores.
>
> while you install solr at:
> http://host/context
>
> you will access each core at:
> http://host/context/coreX
> http://host/context/coreY
>
> ryan
>
>
>
> On Apr 8, 2008, at 9:51 AM, kirk beers wrote:
>
> > Hello again,
> >
> > I finally managed to add/update solr single core by using Perl CPAN Solr
> > by
> > Timothy Garafola. But I am unable to actually update or add anything to
> > a
> > multicore environment !
> >
> > I was wondering if I am doing something incorrectly or if there is an
> > issue
> > at this point? Should I be editing the schema.xml for the specific core
> > ?
> >
> > Thank you
> >
> > K
> >
> >
> > On Mon, Apr 7, 2008 at 12:54 PM, kirk beers <[EMAIL PROTECTED]> wrote:
> >
> >  Which schema.xml are you referring to ? The core0 schema.xml or the
> > > main
> > > schema.xml ? Because I get the following error when I use :
> > >
> > > camera
> > >
> > > I get this error:
> > >
> > > org.apache.solr.common.SolrException: ERROR:unknown
> > > field 'cat'
> > >   at
> > >
> > > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:245)
> > >   at
> > >
> > > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:66)
> > >   at
> > >
> > > org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:196)
> > >   at
> > >
> > > org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:386)
> > >   at
> > >
> > > org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:65)
> > >   at javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
> > >   at javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
> > >   at
> > >
> > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:269)
> > >   at
> > >
> > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
> > >   at
> > >
> > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:320)
> > >   at
> > >
> > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215)
> > >   at
> > >
> > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188)
> > >   at
> > >
> > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
> > >   at
> > >
> > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174)
> > >   at
> > >
> > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> > >   at
> > >
> > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117)
> > >   at
> > >
> > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108)
> > >   at
> > >
> > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151)
> > >   at
> > >
> > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874)
> > >   at
> > >
> > > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665)
> > >   at
> > >
> > > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528)
> > >   at
> > >
> > > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81)
> > >   at
> > >
> > > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689)
> > >   at java.lang.Thread.run(Thread.java:619)
> > >
> > > =
> > >
> > >
> > >
> > > On Mon, Apr 7, 2008 at 11:50 AM, Thomas Arni <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > >  Please make sure that you do NOT have a field called "category" i

Re: Solr + Complex Legacy Schema -- Best Practices?

2008-04-10 Thread Tkach
I realize this is a really vague sort of question with a lot of what-ifs, so 
feel free to just say we'll just have to try implementing one version, test, 
and see if the results are acceptable. :)

Well, our searches are really more along the lines of searching on product 
"details" (brand/key words/names), so that part's fairly straightforward.  I 
guess the main complication is how product keywords/details are mapped onto 
products through several different kinds of "groups", which makes it harder to 
be able to just flatten them out easily since not all products have records in 
all (or even many) places in the groups.

Basically we have users enter some text in our search box on the site.  It 
could be anything from a product name (such as Diet Coke) to a brand name (such 
as Kraft) to words describing the product (such as cheese).  From there we do a 
search (see details below), and come up with a list of either some categories 
to "drill down into" (such as Diet Sodas or Coke) or else we might have a small 
enough result here to just list the products outright.  Among other things 
we're looking to use Solr/Lucene to tighten things up here a little, letting 
the search engine deal with "grouping" (as in the implicit groups formed when 
you search on something like "pepsi") and taking some of these "category" 
tables out.  We're looking into faceting and spell-checking too, but those are 
more secondary concerns.

There are 6 tables, I'll call them prod_detail, prod_store, prod_dict, 
prod_sku, prod_grp, and prod_brand.  

prod_detail has a series of records (unique across all stores) on each product 
such as size, name, and a couple of text fields for words describing the 
product (we'd be searching on these desc fields and the name).  prod_detail 
does have a nice, neat unique integer key, product_id. It also has an integer 
field that is a foreign key (figuratively, not an actual DB constraint) 
pointing to a record in prod_brand.

prod_brand is a table that associates a brand with a group of products.  It has 
a product_group, store_id, brand_id, and a desc field of words describing the 
brand that we'd want to be able to search.  However it's only unique across the 
combination of product_group and brand_id, and store_id.  The brand_id here 
does correspond to a brand_id in the prod_detail table.  prod_brand has a 
many-to-many mapping with prod_detail (one brand can correspond to many 
products and one product can be in many brands).

prod_store associates a product detail record with details for a given store 
(mostly pricing).  It's keyed off of a product_id and store_id combination.  
This one has no fields we'd be wanting to search directly.  It has a 
many-to-one mapping with prod_detail (many prod_store records use one 
prod_detail).

prod_dict associates some key words/phrases with a group of products based on a 
combination of store_id and product_group_id.  We'd be wanting to search the 
key words here too.

prod_sku associates a product detail record with a product group for a given 
store.  There are no fields here we'd be searching on.  It's just used as a 
lookup for group <-> prod mappings.

Finally, prod_grp associates a product group with a couple of key words for a 
given store.  We'd be wanting to search these key words as well.

Hopefully that makes some sense.  We plan to just do a dump of the information 
regularly (say, daily) and index that.  The question is, given the constraints 
of the data, would we be better off doing a multi-core setup and dealing with 
multiple (potential) hits to the server for lookups, or would making a big join 
(including all sorts of duplicate information) and making the server/client 
code deal with sorting out what goes where?  I'm certainly open to researching 
this more on my own (I know it's a huge, loaded question for a mailing list) if 
anyone can suggest someplace to get a better picture of efficient information 
retrieval.

The more I look at this, I'm kind of thinking it makes more sense to do a 
multi-core since doing a huge join just makes us go from nice, neat data to a 
big join with all sort of duplicates/nulls, back to neat data again, but that's 
just a guess at this point.

I'll have to be sure to check out this DataImportHandler.  It definitely does 
sound close to the sort of thing we're looking at.

- Original Message -
From: "Chris Hostetter" <[EMAIL PROTECTED]>
To: "solr-user" 
Sent: Wednesday, April 9, 2008 1:00:33 AM GMT -06:00 US/Canada Central
Subject: Re: Solr + Complex Legacy Schema -- Best Practices?


: I just was wondering, has anybody dealt with trying to "translate" the 
: data from a big, legacy DB schema to a Solr installation?  What I mean 

there's really no general answer to that question -- it all comes down to 
what you want to query on, and what kinds of results you want to get 
out... if you want your queries to result in lists of "products" then you 
should have one Document per product -- if

Searching for popular phrases or words

2008-04-10 Thread Edwin Koome
Gentlemen

New to Solr and this may have been answered before.

How can i search for popular phrases or words with an
option to include only, for example, technical terms
e.g "Oracle database" rather than common english
phrases?

Please point me in the right direction.

regards,
Eric


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


Re: Return the result only field A or field B is non-zero?

2008-04-10 Thread Chris Hostetter

If every document will definitely have a value for both fields, you can 
do...
q = query
&   fq = -(+fieldA:0 +fieldB:0)

...it's more complicated if some docs don't have any value for one or both 
fields: if the fields are integers (and not floats) then the easiest thing 
to do is use some range queries...

  fq = fieldA:[* TO -1] fieldA:[1 TO *] fieldB:[* TO -1] fieldB:[1 TO *]

...for floats you would have to get really creative with your set logic 
... i'm certain it's possible, i'm just not sure off the top of my head 
what the fq needs to look like.

-Hoss



Re: help on caching and index files of Solr

2008-04-10 Thread Chris Hostetter

Solr is an application that uses the Lucene Java library -- everything 
that exists in Lucene exists in Solr, Solr just adds on top of it, the raw 
Lucene index is in the data directory, Lucene's (minimal) caching is still 
used, additional Solr specific caching is added on top (see the wiki for 
more info)

BTW: If you have more questions please start a new thread...

: Subject: help on caching and index files of Solr
: In-Reply-To: <[EMAIL PROTECTED]>

http://people.apache.org/~hossman/#threadhijack

Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/Thread_hijacking



-Hoss



chaching and indexes in Solr

2008-04-10 Thread Sagar Khetkade
Hello, I have a hands on both Lucene and Solr. The difference betweenthese 
two search engines are explained to some extend, still havingsome query on 
these. I am in need to know 1. The difference between caching of Lucene and 
Solr index files. 2. As Solr is built on Lucene, is the index file of Solr 
similar to that of Lucene. 3. Is there any provision in Solr to index the 
complete repository/directory as that of Lucene. Thanks in advance.
_
Tried the new MSN Messenger? It’s cool! Download now.
http://messenger.msn.com/Download/Default.aspx?mkt=en-in