Re: Snipets Solr/nutch
hello every body just one other question, to analyse and modify Solr's snippet, I want to know if org.apache.solr.util.HighlightingUtils is the class generating the snippet and which methode generate them, and could you please explain me how are they generated in that class and where exactly to modify it. all that in order to not return the first word encountered highlighted but to return an other one because of the problem I explained in my previous messages Cheers -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16603642.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Human Powered Search Module
Sushan Rungta a écrit : Hello Everybody, I am a newbie in Lucene and I am from India, currently working for a search module for our classifed website search module in clickindia.com. I have implemented the basic functionality of solr lucen and am pretty happy with the results. Search in India has its own share of nuances. 'Maruti' is spelt as 'Maruthi' in most of South India. People spell most of the times 'Naukri' as 'Naukari'; a loan request would be simply followed in the query as 'need money'. These and many more such intricacies are typical of Indians and require a special kind of module for the same. Is there any ready-made solution for the same? Can I get the access of words as mentioned above and is used in India, so that I could implement it? Synonyms are easy to handle, but semantic analysis is a bit trickier. Weka may help you? http://weka.sourceforge.net M.
How to custom solr sort?
I have inherited a new class from the org.apache.solr.schema.StrField and customed a new sort algorithm by implementing the SortComparatorSource interface.Then to export the jar file to the solr lib directory, and configure the schema.xml file.But when I test the new feature, It does't work at all.Can you give some suggestions?Thanks. -- View this message in context: http://www.nabble.com/How-to-custom-solr-sort--tp16607351p16607351.html Sent from the Solr - User mailing list archive at Nabble.com.
HTMLStripReader and script tags
I've noticed that passing html to a field using HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too. For example, using a analyzer like: with a text such as: title pre var time = new Date(); ordval= (time.getTime()); post Analysis.jsp turns out those tokens: title pre var time = new Date(); ordval= (time.getTime()); post While if the script in the page is commented, everything works fine. Is this due to design choice? Shouldn't scripts be removed in both cases? (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson - 2008-03-24 09:59:40) Walter
help on caching and index files of Solr
Hello, I have a hands on both Lucene and Solr. The difference between these two search engines are explained to some extend, still having some query on these. I am in need to know why 1. Want some information on the difference between caching of Lucene and Solr index files. 2. As Solr is built on Lucene, is the index file of Solr similar to that of Lucene. 3. Is there any provision in Solr to index the complete repository/directory as that of Lucene. Thanks in advance. _ Technology : Catch up on updates on the latest Gadgets, Reviews, Gaming and Tips to use technology etc. http://computing.in.msn.com/
Re: HTMLStripReader and script tags
It was the intention to remove script. I developed HTMLStripReader by just looking at a bunch of real-world HTML. I hadn't run across script in uppercase, so I didn't do a case insensitive check. The code is currently: if (name.equals("script") || name.equals("style")) { Should be easy enough to change unless there is a good reason not to. -Yonik On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote: > I've noticed that passing html to a field using > HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts too. > For example, using a analyzer like: > > > > > > > with a text such as: > > title > > pre > > var time = new Date(); > ordval= (time.getTime()); > > post > > > > Analysis.jsp turns out those tokens: > title > pre > var > time > = > new > Date(); > ordval= > (time.getTime()); > post > > While if the script in the page is commented, everything works fine. > Is this due to design choice? Shouldn't scripts be removed in both cases? > (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson - > 2008-03-24 09:59:40) > > Walter > >
Re: HTMLStripReader and script tags
I've just committed a change to ignore case when comparing tag names. -Yonik On Thu, Apr 10, 2008 at 9:03 AM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > It was the intention to remove script. > I developed HTMLStripReader by just looking at a bunch of real-world HTML. > I hadn't run across script in uppercase, so I didn't do a case > insensitive check. > > The code is currently: > if (name.equals("script") || name.equals("style")) { > > Should be easy enough to change unless there is a good reason not to. > > -Yonik > > > > On Thu, Apr 10, 2008 at 5:05 AM, Walter Ferrara <[EMAIL PROTECTED]> wrote: > > I've noticed that passing html to a field using > > HTMLStripWhitespaceTokenizerFactory, ends up in having some javascripts > too. > > For example, using a analyzer like: > > > > > > > > > > > > > > with a text such as: > > > > title > > > > pre > > > > var time = new Date(); > > ordval= (time.getTime()); > > > > post > > > > > > > > Analysis.jsp turns out those tokens: > > title > > pre > > var > > time > > = > > new > > Date(); > > ordval= > > (time.getTime()); > > post > > > > While if the script in the page is commented, everything works fine. > > Is this due to design choice? Shouldn't scripts be removed in both cases? > > (Solr Implementation Version: 2008-03-24_09-57-01 ${svnversion} - hudson - > > 2008-03-24 09:59:40) > > > > Walter > > > > >
Re: Snipets Solr/nutch(maxFragSize?)
khirb7 wrote: > > hello every body > > just one other question, to analyse and modify Solr's snippet, I want to > know if org.apache.solr.util.HighlightingUtils > is the class generating the snippet and which methode generate them, and > could you please explain me how are they generated in that class and where > exactly to modify it. all that in order to not return the first word > encountered highlighted but to return an other one because of the problem > I explained in my previous messages > > Cheers > I have done deep search and I found that lucene provide this that methode : getBestFragments highlighter.getBestFragments(tokenStream, text, maxNumFragment, "..."); so with this methode we can precise to lucene to return maxNumFragment fragment (with highligted word)of fragsize characters, but there is no maxFragSize parameter in solr. this would be useful in my case if I want to highlight not only the first occurrence of a searched word but up to 1 occurrence of the same word. cheers -- View this message in context: http://www.nabble.com/Snipets-Solr-nutch-tp16537216p16608806.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Snipets Solr/nutch
On 10-Apr-08, at 12:26 AM, khirb7 wrote: hello every body just one other question, to analyse and modify Solr's snippet, I want to know if org.apache.solr.util.HighlightingUtils is the class generating the snippet and which methode generate them, and could you please explain me how are they generated in that class and where exactly to modify it. all that in order to not return the first word encountered highlighted but to return an other one because of the problem I explained in my previous messages Unfortunately I have not familiar with nutch's snippet generation. Solr's highlighting is located in org.apache.solr.util.HighlightingUtils in version 1.2, in the current (trunk) version, it is located in org.apache.solr.highlight.* package. Your use case is a little tricky. The best way to deal with it in my opinion is to strip out the header before sending the data to Solr. This will improve your highlighting _and_ your search relevance. -Mike
Re: Multicore Issue with nightly build
Hi Ryan, I still can't seem to get my solr cores : core0 and core1 to accept new documents. I changed the appropriate code in the Perl client to accommodate the core as you mentioned in the previous email. I am able to delete docs. Is there any thing I might be missing in the basic core schema.xml ? I tried to copy the contents of solr/config/schema.xml into solr/core0/conf/schema.xml I added the core0 name and noticed that the Default Search Field was different but I couldn't notice any other differences that stood out. Once i did this neither core could be queried but the single core could. What is the relationship between these 3 schemas ? Do they rely on one another ? Or are they each independent of one another and perform their own specific indepenedent functions? Thanks K On Tue, Apr 8, 2008 at 3:11 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: > from the client side, multicore should behave exactly the same as multi > single core servers running next to each other. > > I'm not familiar with the perl client, but it will need to be configured > for each core -- rather then one client that talks to multiple cores. > > while you install solr at: > http://host/context > > you will access each core at: > http://host/context/coreX > http://host/context/coreY > > ryan > > > > On Apr 8, 2008, at 9:51 AM, kirk beers wrote: > > > Hello again, > > > > I finally managed to add/update solr single core by using Perl CPAN Solr > > by > > Timothy Garafola. But I am unable to actually update or add anything to > > a > > multicore environment ! > > > > I was wondering if I am doing something incorrectly or if there is an > > issue > > at this point? Should I be editing the schema.xml for the specific core > > ? > > > > Thank you > > > > K > > > > > > On Mon, Apr 7, 2008 at 12:54 PM, kirk beers <[EMAIL PROTECTED]> wrote: > > > > Which schema.xml are you referring to ? The core0 schema.xml or the > > > main > > > schema.xml ? Because I get the following error when I use : > > > > > > camera > > > > > > I get this error: > > > > > > org.apache.solr.common.SolrException: ERROR:unknown > > > field 'cat' > > > at > > > > > > org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:245) > > > at > > > > > > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:66) > > > at > > > > > > org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:196) > > > at > > > > > > org.apache.solr.handler.XmlUpdateRequestHandler.doLegacyUpdate(XmlUpdateRequestHandler.java:386) > > > at > > > > > > org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:65) > > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) > > > at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) > > > at > > > > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:269) > > > at > > > > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) > > > at > > > > > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:320) > > > at > > > > > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:215) > > > at > > > > > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:188) > > > at > > > > > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213) > > > at > > > > > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:174) > > > at > > > > > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > > > at > > > > > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:117) > > > at > > > > > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:108) > > > at > > > > > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:151) > > > at > > > > > > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:874) > > > at > > > > > > org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(Http11BaseProtocol.java:665) > > > at > > > > > > org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(PoolTcpEndpoint.java:528) > > > at > > > > > > org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(LeaderFollowerWorkerThread.java:81) > > > at > > > > > > org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:689) > > > at java.lang.Thread.run(Thread.java:619) > > > > > > = > > > > > > > > > > > > On Mon, Apr 7, 2008 at 11:50 AM, Thomas Arni <[EMAIL PROTECTED]> > > > wrote: > > > > > > Please make sure that you do NOT have a field called "category" i
Re: Solr + Complex Legacy Schema -- Best Practices?
I realize this is a really vague sort of question with a lot of what-ifs, so feel free to just say we'll just have to try implementing one version, test, and see if the results are acceptable. :) Well, our searches are really more along the lines of searching on product "details" (brand/key words/names), so that part's fairly straightforward. I guess the main complication is how product keywords/details are mapped onto products through several different kinds of "groups", which makes it harder to be able to just flatten them out easily since not all products have records in all (or even many) places in the groups. Basically we have users enter some text in our search box on the site. It could be anything from a product name (such as Diet Coke) to a brand name (such as Kraft) to words describing the product (such as cheese). From there we do a search (see details below), and come up with a list of either some categories to "drill down into" (such as Diet Sodas or Coke) or else we might have a small enough result here to just list the products outright. Among other things we're looking to use Solr/Lucene to tighten things up here a little, letting the search engine deal with "grouping" (as in the implicit groups formed when you search on something like "pepsi") and taking some of these "category" tables out. We're looking into faceting and spell-checking too, but those are more secondary concerns. There are 6 tables, I'll call them prod_detail, prod_store, prod_dict, prod_sku, prod_grp, and prod_brand. prod_detail has a series of records (unique across all stores) on each product such as size, name, and a couple of text fields for words describing the product (we'd be searching on these desc fields and the name). prod_detail does have a nice, neat unique integer key, product_id. It also has an integer field that is a foreign key (figuratively, not an actual DB constraint) pointing to a record in prod_brand. prod_brand is a table that associates a brand with a group of products. It has a product_group, store_id, brand_id, and a desc field of words describing the brand that we'd want to be able to search. However it's only unique across the combination of product_group and brand_id, and store_id. The brand_id here does correspond to a brand_id in the prod_detail table. prod_brand has a many-to-many mapping with prod_detail (one brand can correspond to many products and one product can be in many brands). prod_store associates a product detail record with details for a given store (mostly pricing). It's keyed off of a product_id and store_id combination. This one has no fields we'd be wanting to search directly. It has a many-to-one mapping with prod_detail (many prod_store records use one prod_detail). prod_dict associates some key words/phrases with a group of products based on a combination of store_id and product_group_id. We'd be wanting to search the key words here too. prod_sku associates a product detail record with a product group for a given store. There are no fields here we'd be searching on. It's just used as a lookup for group <-> prod mappings. Finally, prod_grp associates a product group with a couple of key words for a given store. We'd be wanting to search these key words as well. Hopefully that makes some sense. We plan to just do a dump of the information regularly (say, daily) and index that. The question is, given the constraints of the data, would we be better off doing a multi-core setup and dealing with multiple (potential) hits to the server for lookups, or would making a big join (including all sorts of duplicate information) and making the server/client code deal with sorting out what goes where? I'm certainly open to researching this more on my own (I know it's a huge, loaded question for a mailing list) if anyone can suggest someplace to get a better picture of efficient information retrieval. The more I look at this, I'm kind of thinking it makes more sense to do a multi-core since doing a huge join just makes us go from nice, neat data to a big join with all sort of duplicates/nulls, back to neat data again, but that's just a guess at this point. I'll have to be sure to check out this DataImportHandler. It definitely does sound close to the sort of thing we're looking at. - Original Message - From: "Chris Hostetter" <[EMAIL PROTECTED]> To: "solr-user" Sent: Wednesday, April 9, 2008 1:00:33 AM GMT -06:00 US/Canada Central Subject: Re: Solr + Complex Legacy Schema -- Best Practices? : I just was wondering, has anybody dealt with trying to "translate" the : data from a big, legacy DB schema to a Solr installation? What I mean there's really no general answer to that question -- it all comes down to what you want to query on, and what kinds of results you want to get out... if you want your queries to result in lists of "products" then you should have one Document per product -- if
Searching for popular phrases or words
Gentlemen New to Solr and this may have been answered before. How can i search for popular phrases or words with an option to include only, for example, technical terms e.g "Oracle database" rather than common english phrases? Please point me in the right direction. regards, Eric __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Return the result only field A or field B is non-zero?
If every document will definitely have a value for both fields, you can do... q = query & fq = -(+fieldA:0 +fieldB:0) ...it's more complicated if some docs don't have any value for one or both fields: if the fields are integers (and not floats) then the easiest thing to do is use some range queries... fq = fieldA:[* TO -1] fieldA:[1 TO *] fieldB:[* TO -1] fieldB:[1 TO *] ...for floats you would have to get really creative with your set logic ... i'm certain it's possible, i'm just not sure off the top of my head what the fq needs to look like. -Hoss
Re: help on caching and index files of Solr
Solr is an application that uses the Lucene Java library -- everything that exists in Lucene exists in Solr, Solr just adds on top of it, the raw Lucene index is in the data directory, Lucene's (minimal) caching is still used, additional Solr specific caching is added on top (see the wiki for more info) BTW: If you have more questions please start a new thread... : Subject: help on caching and index files of Solr : In-Reply-To: <[EMAIL PROTECTED]> http://people.apache.org/~hossman/#threadhijack Thread Hijacking on Mailing Lists When starting a new discussion on a mailing list, please do not reply to an existing message, instead start a fresh email. Even if you change the subject line of your email, other mail headers still track which thread you replied to and your question is "hidden" in that thread and gets less attention. It makes following discussions in the mailing list archives particularly difficult. See Also: http://en.wikipedia.org/wiki/Thread_hijacking -Hoss
chaching and indexes in Solr
Hello, I have a hands on both Lucene and Solr. The difference betweenthese two search engines are explained to some extend, still havingsome query on these. I am in need to know 1. The difference between caching of Lucene and Solr index files. 2. As Solr is built on Lucene, is the index file of Solr similar to that of Lucene. 3. Is there any provision in Solr to index the complete repository/directory as that of Lucene. Thanks in advance. _ Tried the new MSN Messenger? It’s cool! Download now. http://messenger.msn.com/Download/Default.aspx?mkt=en-in