Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case
Solr 3.3. has a feature "Grouping". Is it practically same as deduplication? Here is my use case for duplicates removal - We have many documents with similar (upto 99%) content. Upon some search queries, almost all of them come up on first page results. Of all these documents, essentially one is original and the other are duplicates. We are able to find the original content on a basis of number of factors - who uploaded it, when, how many viral shares.It is also possible that the duplicates are uploaded earlier (and hence exist in search index) while the original is uploaded later (and gets added later to index). AFAIK, Deduplication targets index time. Is there a means I can specify the original which should be returned and the duplicates which could be removed from coming up.? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
How To Implement Sweet Spot Similarity?
I was wondering if there is *any* article on the web that provides me with implementation details and some sort of analysis on Sweet Spot Similarity? Google shows me all the JIRA commits and comments but no article about actual implementation. What are the various configs that could be done. What are the good approaches for figuring out sweet spots? Can a combination of multiple Similarity Classes be used? Any information would be so appreciated. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
java.io.CharConversionException While Indexing in Solr 3.4
Hi List, I tried Solr 3.4.0 today and while indexing I got the error java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289) My earlier version was Solr 1.4 and this same document went into index successfully. Looking around, I see issue https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the issue. I thought this patch is already applied to Solr 3.4.0. Is there something I am missing? Is there anything else I need to mention? Logs/ My document details etc.? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: java.io.CharConversionException While Indexing in Solr 3.4
Just in case, someone might be intrested here is the log SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char #66641, byte #65289) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x73 (at char #66641, byte #65289) at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) ... 26 more Also, is there a setting so I can change the level of backtrace? This would be helpful in showing the complete stack instead of 26 more ... *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Mon, Sep 19, 2011 at 14:16, Pranav Prakash wrote: > > Hi List, > > I tried Solr 3.4.0 today and while indexing I got the error > java.lang.RuntimeException: [was class java.io.CharConversionException] > Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289) > > My earlier version was Solr 1.4 and this same document went into index > successfully. Looking around, I see issue > https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the > issue. I thought this patch is already applied to Solr 3.4.0. Is there > something I am missing? > > Is there anything else I need to mention? Logs/ My document details etc.? > > *Pranav Prakash* > > "temet nosce" > > Twitter <http://twitter.com/pranavprakash> | Blog<http://blog.myblive.com> | > Google <http://www.google.com/profiles/pranny> >
Re: java.io.CharConversionException While Indexing in Solr 3.4
I managed to resolve this issue. Turns out that the issue was because of a faulty XML file being generated by ruby-solr gem. I had to install libxml-ruby, rsolr and I used rsolr gem instead of ruby-solr. Also, if you face this kind of issue, the test-utf8.sh file included in exampledocs is a good file to test Solr's behavior towards UTF-8 chars. Great wok Solr team, and special thanks to Erik Hatcher. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Mon, Sep 19, 2011 at 15:54, Pranav Prakash wrote: > > Just in case, someone might be intrested here is the log > > SEVERE: java.lang.RuntimeException: [was class > java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char > #66641, byte #65289) > at > com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) > at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) > at > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) > at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) > at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287) > at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146) > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) > at > org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) > at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) > at > org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) > at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) > at > org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) > at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) > at > org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) > at > org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114) > at > org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) > at org.mortbay.jetty.Server.handle(Server.java:326) > at > org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) > at > org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945) > at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756) > at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218) > at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) > at > org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x73 > (at char #66641, byte #65289) > at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313) > at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204) > at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) > at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) > at > com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) > at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) > at > com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) > at > com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) > at > com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) > at > com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) > ... 26 more > > > Also, is there a setting so I can change the level of backtrace? This would > be helpful in showing the complete stack instead of 26 more ... > > *Pranav Prakash* > > "temet nosce" > > Twitter <http://twitter.com/pranavprakash> | Blog<http://blog.myblive.com> | > Google <http://www.google.com/profiles/pranny> > > > On Mon, Sep 19, 2011 at 14:16, Pranav Prakash wrote: > >> >> Hi List, >> >> I tried Solr 3.4.0 today and while indexing I got the error >> java.lang.RuntimeException: [was class java.io.CharConversionException] >> Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289) >
Re: Stemming and other tokenizers
I have a similar use case, but slightly more flexible and straight forward. In my case, I have a field "language" which stores 'en', 'es' or whatever the language of the document is. Then the field 'transcript' stores the actual content which is in the language as described in language field. Following up with the conversation, is this how I am supposed to proceed: 1. Create one field type in my schema per supported language. This would cause me to create ~30 fields. 2. Since, I already know the language of my content, I can skip SOLR-1979 (which is expected in Solr 3.5) The point where I am unclear is, how do I specify at Index time, to use a certain field for a certain language? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Mon, Sep 12, 2011 at 20:55, Jan Høydahl wrote: > Hi, > > Do they? Can you explain the layout of the documents? > > There are two ways to handle multi lingual docs. If all your docs have both > an English and a Norwegian version, you may either split these into two > separate documents, each with the "language" field filled by LangId - which > then also lets you filter by language. Or you may assign a title_en and > title_no to the same document (expand with more fields if you have more > languages per document), and keep it as one document. Your client will then > be adapted to search the language(s) that the user wants. > > If one document has multiple languages within the same field, e.g. "body", > say one paragraph of English and the next is Norwegian, then we currently do > not have any capability in Solr to apply different analysis (tokenization, > stemming etc) to each paragraph. > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 12. sep. 2011, at 11:37, Manish Bafna wrote: > > > What is single document has multiple languages? > > > > On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl > wrote: > > > >> Hi > >> > >> Everybody else use dedicated field per language, so why can't you? > >> Please explain your use case, and perhaps we can better help understand > >> what you're trying to do. > >> Do you always know the query language in advance? > >> > >> -- > >> Jan Høydahl, search solution architect > >> Cominvent AS - www.cominvent.com > >> Solr Training - www.solrtraining.com > >> > >> On 12. sep. 2011, at 08:28, Patrick Sauts wrote: > >> > >>> I can't create one field per language, that is the problem but I'll dig > >> into > >>> it following your indications. > >>> I let you know what I could come out with. > >>> > >>> Patrick. > >>> > >>> 2011/9/11 Jan Høydahl > >>> > >>>> Hi, > >>>> > >>>> You'll not be able to detect language and change stemmer on the same > >> field > >>>> in one go. You need to create one fieldType in your schema per > language > >> you > >>>> want to use, and then use LanguageIdentification (SOLR-1979) to do the > >> magic > >>>> of detecting language and renaming the field. If you set > >>>> langid.override=false, languid.map=true and populate your "language" > >> field > >>>> with the known language, you will probably get the desired effect. > >>>> > >>>> -- > >>>> Jan Høydahl, search solution architect > >>>> Cominvent AS - www.cominvent.com > >>>> Solr Training - www.solrtraining.com > >>>> > >>>> On 10. sep. 2011, at 03:24, Patrick Sauts wrote: > >>>> > >>>>> Hello, > >>>>> > >>>>> > >>>>> > >>>>> I want to implement some king of AutoStemming that will detect the > >>>> language > >>>>> of a field based on a tag at the start of this field like #en# my > field > >>>> is > >>>>> stored on disc but I don't want this tag to be stored. Is there a way > >> to > >>>>> avoid this field to be stored ? > >>>>> > >>>>> To me all the filters and the tokenizers interact only with the > indexed > >>>>> field and not the stored one. > >>>>> > >>>>> Am I wrong ? > >>>>> > >>>>> Is it possible to you to do such a filter. > >>>>> > >>>>> > >>>>> > >>>>> Patrick. > >>>>> > >>>> > >>>> > >> > >> > >
StopWords coming in Top 10 terms despite using StopFilterFactory
Hi List, I included StopFilterFactory and I can see it taking action in the Analyzer Interface. However, when I go to Schema Analyzer, I see those stop words in the top 10 terms. Is this normal? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: StopWords coming in Top 10 terms despite using StopFilterFactory
> You've got CommonGramsFilterFactory and StopFilterFactory both using > stopwords.txt, which is a confusing configuration. Normally you'd want one > or the other, not both ... but if you did legitimately have both, you'd want > them to each use a different wordlist. > Maybe I am wrong. But my intentions of using both of them is - first I want to use phrase queries so used CommonGramsFilterFactory. Secondly, I dont want those stopwords in my index, so I have used StopFilterFactory to remove them. > > The commongrams filter turns each found occurrence of a word in the file > into two tokens - one prepended with the token before it, one appended with > the token after it. If it's the first or last term in a field, it only > produces one token. When it gets to the stopfilter, the combined terms no > longer match what's in stopwords.txt, so no action is taken. > > If I had to guess, what you are seeing in the top 10 terms is the > concatenation of your most common stopword with another word. If it were > English, I would guess that to be "of_the" or something similar. If my > guess is wrong, then I'm not sure what's going on, and some cut/paste of > what you're actually seeing might be in order. term frequencyto 26164and 25804the 25566of 25022a 24918in 24590for 23646n23588 with 23055is 22510 > Did you do delete and do a full reindex after you changed your schema? > Yup I did that a couple of times > > Thanks, > Shawn > > *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com/> | Google <http://www.google.com/profiles/pranny>
Can't use ms() function on non-numeric legacy date field
Hi, I had been trying to boost my recent documents, using what is described here http://wiki.apache.org/solr/FunctionQuery#Date_Boosting My date field looks like However, upon trying to do ms(NOW, created_at) it shows the error Can't use ms() function on non-numeric legacy date field created_at * * *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Suggestions on how to perform infrastructure migration from 1.4 to 3.4?
Hi List, We have our production search infrastructure as - 1 indexing master, 2 serving identical twin slaves. They are all Solr 1.4 beasts. Apart from this we have 1 beast on Solr 3.4, which we have benchmarked against our production setup (against performance and relevancy) and would like to upgrade our production setup. Something like this has not happened before in our organization. I'd like to know opinions from the community about what are ways in which this migration can be performed? Will there be any downtimes, if so for how many hours? What are some of the common issues that might come along? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
How to achieve Indexing @ 270GiB/hr
Greetings, While going through the article 265% indexing speedup with Lucene's concurrent flushing<http://java.dzone.com/news/265-indexing-speedup-lucenes?mz=33057-solr_lucene> I was stunned by the endless possibilities in which Indexing speed could be increased. I'd like to take inputs from everyone over here as to how to achieve this speed. As far as I understand there are two broad ways of feeding data to Solr - 1. Using DataImportHandler 2. Using HTTP to POST docs to Solr. The speeds at which the article describes indexing seems kinda too much to expect using the second approach. Or is it possible using multiple instances feeding docs to Solr? My current setup does the following - 1. Execute SQL queries to create database of documents that needs to be fed. 2. Go through the columns one by one, and create XMLs for them and send it over to Solr in batches of max 500 docs. Even if using DataImportHandler what are the ways this could be optimized? If I am able to solve the problem of indexing data in our current setup, my life would become a lot easier. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Painfully slow indexing
Hi guys, I have set up a Solr instance and upon attempting to index document, the whole process is painfully slow. I will try to put as much info as I can in this mail. Pl. feel free to ask me anything else that might be required. I am sending documents in batches not exceeding 2,000. The size of each of them depends but usually is around 10-15MiB. My indexing script tells me that Solr took T seconds to add N documents of size S. For the same data, the Solr Log add QTime is QT. Some of the sample data are: N ST QT - 390 docs | 3,478,804 Bytes | 14.5s| 2297 852 docs | 6,039,535 Bytes | 25.3s| 4237 1345 docs | 11,147,512 Bytes | 47s | 8543 1147 docs | 9,457,717 Bytes | 44s | 2297 1096 docs | 13,058,204 Bytes | 54.3s | 8782 The time T includes the time of converting an array of Hash objects into XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there is a huge difference between both the time T and QT. After a lot of efforts, I have no clue why these times do not match. The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M -XX:+UseParNewGC I believe my Indexing is getting slow. Relevant portion from my schema file are as follows. On a related note, every document has one dynamic field. Based on this rate, it takes me ~30hrs to do a full index of my database. I would really appreciate kindness of community in order to get this indexing faster. false 10 10 2048 2147483647 300 1000 5 256 10 false true true 1 0 false 10 *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: Painfully slow indexing
Hey guys, Your responses are welcome, but I still haven't gained a lot of improvements *Are you posting through HTTP/SOLRJ?* I am using RSolr gem, which internally uses Ruby HTTP lib to POST document to Solr *Your script time 'T' includes time between sending POST request -to- the response fetched after successful response right??* Correct. It also includes the time taken to convert all those documents from a Ruby Hash to XML. *generate the ready-for-indexing XML documents on a file system* Alain, I have somewhere 6m documents for Indexing. You mean to say that I should convert all of it into one XML file and then index? *are you calling commit after your batches or do an optimize by any chance?* I am not optimizing, but I am performing an autocommit every 10 docs. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Fri, Oct 21, 2011 at 16:32, Simon Willnauer < simon.willna...@googlemail.com> wrote: > On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash wrote: > > Hi guys, > > > > I have set up a Solr instance and upon attempting to index document, the > > whole process is painfully slow. I will try to put as much info as I can > in > > this mail. Pl. feel free to ask me anything else that might be required. > > > > I am sending documents in batches not exceeding 2,000. The size of each > of > > them depends but usually is around 10-15MiB. My indexing script tells me > > that Solr took T seconds to add N documents of size S. For the same data, > > the Solr Log add QTime is QT. Some of the sample data are: > > > > N ST QT > > - > > 390 docs | 3,478,804 Bytes | 14.5s| 2297 > > 852 docs | 6,039,535 Bytes | 25.3s| 4237 > > 1345 docs | 11,147,512 Bytes | 47s | 8543 > > 1147 docs | 9,457,717 Bytes | 44s | 2297 > > 1096 docs | 13,058,204 Bytes | 54.3s | 8782 > > > > The time T includes the time of converting an array of Hash objects into > > XML, POSTing it to Solr and response acknowledged from Solr. Clearly, > there > > is a huge difference between both the time T and QT. After a lot of > efforts, > > I have no clue why these times do not match. > > > > The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M > > -XX:+UseParNewGC > > > > I believe my Indexing is getting slow. Relevant portion from my schema > file > > are as follows. On a related note, every document has one dynamic field. > > Based on this rate, it takes me ~30hrs to do a full index of my database. > > I would really appreciate kindness of community in order to get this > > indexing faster. > > > > > > > > false > > > > > > > > 10 > > > > 10 > > > > > > > > 2048 > > > > 2147483647 > > > > 300 > > > > 1000 > > > > 5 > > > > 256 > > > > 10 > > > > false > > > > > > > > > > > > > > > > true > > > > true > > > > > > > > 1 > > > > 0 > > > > > > > > false > > > > > > > > > > > > > > > > 10 > > > > > > > > > > > > > > *Pranav Prakash* > > > > "temet nosce" > > > > Twitter <http://twitter.com/pranavprakash> | Blog < > http://blog.myblive.com> | > > Google <http://www.google.com/profiles/pranny> > > > > hey, > > are you calling commit after your batches or do an optimize by any chance? > > I would suggest you to stream your documents to solr and try to commit > only if you really need to. Set your RAM Buffer to something between > 256 and 320 MB and remove the maxBufferedDocs setting completely. You > can also experiment with your merge settings a little and 10 merging > threads seem to be a lot. I know you have lots of CPU but IO will be > the bottleneck here. > > simon >
Howto Programatically check if the index is optimized or not?
Hi, After the commit, my optimize usually takes 20 minutes. The thing is that I need to know programatically if the optimization has completed or not. Is there an API call through which I can know the status of optimization? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Highlighting uses lots of memory and eventually slows down Solr
Hi Group, I would like to have highlighting for search and I have the fields indexed with the following schema (Solr 3.4) And the following config 100 20 0.5 [-\w ,/\n\"']{20,200} The problem is that when I turn on highlighting, I face memory issues. The Memory usage on system goes higher and higher until it consumes all the memory (I dont receive OOM errors, there is always like 300 MB free memory). The total memory I have is 48GiB. My Index size is 138GiB and there are about 10m documents in the index. I also get the following warning, but I am not sure how to get it done. WARNING: Deprecated syntax found. should move to My Solr log with highlighting turned on looks something like this [core0] webapp=/solr path=/select params={mm=3<90%25&qf=title^2&hl.simple.pre=&hl.fl=title,transcript,transcript_en&wt=ruby&hl=true&rows=12&defType=dismax&fl=id,title,description&debugQuery=false&start=0&q=asdfghjkl&bf=recip(ms(NOW,created_at),1.88e-11,1,1)&hl.simple.post=&ps=50} Any help on this would be greatly appreciated. Thanks in advance !! *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: Highlighting uses lots of memory and eventually slows down Solr
No respinse !! Bumping it up *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Fri, Dec 9, 2011 at 14:11, Pranav Prakash wrote: > Hi Group, > > I would like to have highlighting for search and I have the fields indexed > with the following schema (Solr 3.4) > > > > > > > > > protected="protwords.txt"/> > ignoreCase="true" expand="true"/> > ignoreCase="true"/> > ="true"/> > generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll > ="0"preserveOriginal="1"/> > > > > > > > > And the following config > > > default="true"> > > 100 > > > > > > 20 > 0.5 > [-\w ,/\n\"']{20,200} > > > default="true"> > > > > > > > > > > > > The problem is that when I turn on highlighting, I face memory issues. The > Memory usage on system goes higher and higher until it consumes all the > memory (I dont receive OOM errors, there is always like 300 MB free > memory). The total memory I have is 48GiB. My Index size is 138GiB and > there are about 10m documents in the index. > > I also get the following warning, but I am not sure how to get it done. > > WARNING: Deprecated syntax found. should move to > > > My Solr log with highlighting turned on looks something like this > > [core0] webapp=/solr path=/select > params={mm=3<90%25&qf=title^2&hl.simple.pre=&hl.fl=title,transcript,transcript_en&wt=ruby&hl=true&rows=12&defType=dismax&fl=id,title,description&debugQuery=false&start=0&q=asdfghjkl&bf=recip(ms(NOW,created_at),1.88e-11,1,1)&hl.simple.post=&ps=50} > > Any help on this would be greatly appreciated. Thanks in advance !! > > *Pranav Prakash* > > "temet nosce" > > Twitter <http://twitter.com/pranavprakash> | Blog<http://blog.myblive.com> | > Google <http://www.google.com/profiles/pranny> >
Something like "featured results" in solr response?
Hi, I believe, there is a feature in Solr, which allows to return a set of "featured" documents for a query. I did read it couple of months back, and now when I have decided to work on it, I somehow can't find it's reference. Here is the description - For a search keyword, apart from the results generated by Solr (which is based on relevancy, score), there is another set of documents which just comes up. It is very much similar to the "sponsored results" feature of Google. Can you guys point me to the appropriate resources for the same? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: Something like "featured results" in solr response?
Thanks a lot :-) This is exactly what I had read back then. However, going through it now, it seems that everytime a document needs to be elevated, it has to be in the config file. Which means that Solr should be restarted. This does not make a lot of sense for a production environment, where Solr restarts are as infrequent as config changes. What could be a sound way to implement this? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> 2012/1/30 Rafał Kuć > Hello! > > Please look at http://wiki.apache.org/solr/QueryElevationComponent. > > -- > Regards, > Rafał Kuć > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > > Hi, > > > I believe, there is a feature in Solr, which allows to return a set of > > "featured" documents for a query. I did read it couple of months back, > and > > now when I have decided to work on it, I somehow can't find it's > reference. > > > Here is the description - For a search keyword, apart from the results > > generated by Solr (which is based on relevancy, score), there is another > > set of documents which just comes up. It is very much similar to the > > "sponsored results" feature of Google. > > > Can you guys point me to the appropriate resources for the same? > > > > *Pranav Prakash* > > > "temet nosce" > > > Twitter <http://twitter.com/pranavprakash> | Blog < > http://blog.myblive.com> | > > Google <http://www.google.com/profiles/pranny> > > > > >
Re: Something like "featured results" in solr response?
Wow, this looks interesting. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Mon, Jan 30, 2012 at 21:16, Erick Erickson wrote: > There's the tricky line: > "If the file exists in the /conf/ directory it will be loaded once at > start-up. If it exists in the data directory, it will be reloaded for > each IndexReader." > > on the page: http://wiki.apache.org/solr/QueryElevationComponent > > Which basically means that if your config file is in the right directory, > it'll be reloaded whenever the index changes, i.e. when a replication > happens in a master/slave setup or when a commit happens on > a single machine used for both indexing and searching. > > Best > Erick > > On Mon, Jan 30, 2012 at 8:31 AM, Pranav Prakash wrote: > > Thanks a lot :-) This is exactly what I had read back then. However, > going > > through it now, it seems that everytime a document needs to be elevated, > it > > has to be in the config file. Which means that Solr should be restarted. > > This does not make a lot of sense for a production environment, where > Solr > > restarts are as infrequent as config changes. > > > > What could be a sound way to implement this? > > > > *Pranav Prakash* > > > > "temet nosce" > > > > Twitter <http://twitter.com/pranavprakash> | Blog < > http://blog.myblive.com> | > > Google <http://www.google.com/profiles/pranny> > > > > > > 2012/1/30 Rafał Kuć > > > >> Hello! > >> > >> Please look at http://wiki.apache.org/solr/QueryElevationComponent. > >> > >> -- > >> Regards, > >> Rafał Kuć > >> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > >> > >> > Hi, > >> > >> > I believe, there is a feature in Solr, which allows to return a set of > >> > "featured" documents for a query. I did read it couple of months back, > >> and > >> > now when I have decided to work on it, I somehow can't find it's > >> reference. > >> > >> > Here is the description - For a search keyword, apart from the results > >> > generated by Solr (which is based on relevancy, score), there is > another > >> > set of documents which just comes up. It is very much similar to the > >> > "sponsored results" feature of Google. > >> > >> > Can you guys point me to the appropriate resources for the same? > >> > >> > >> > *Pranav Prakash* > >> > >> > "temet nosce" > >> > >> > Twitter <http://twitter.com/pranavprakash> | Blog < > >> http://blog.myblive.com> | > >> > Google <http://www.google.com/profiles/pranny> > >> > >> > >> > >> > >> >
Typical Cache Values
Based on the hit ratio of my caches, they seem to be pretty low. Here they are. What are typical values of yours production setup? What are some of the things that can be done to improve the ratios? queryResultCache lookups : 3234602 hits : 496 hitratio : 0.00 inserts : 3234239 evictions : 3230143 size : 4096 warmupTime : 8886 cumulative_lookups : 3465734 cumulative_hits : 526 cumulative_hitratio : 0.00 cumulative_inserts : 3465208 cumulative_evictions : 3457151 documentCache lookups : 17647360 hits : 11935609 hitratio : 0.67 inserts : 5711851 evictions : 5707755 size : 4096 warmupTime : 0 cumulative_lookups : 19009142 cumulative_hits : 12813630 cumulative_hitratio : 0.67 cumulative_inserts : 6195512 cumulative_evictions : 6187460 fieldValueCache lookups : 0 hits : 0 hitratio : 0.00 inserts : 0 evictions : 0 size : 0 warmupTime : 0 cumulative_lookups : 0 cumulative_hits : 0 cumulative_hitratio : 0.00 cumulative_inserts : 0 cumulative_evictions : 0 filterCache lookups : 30059278 hits : 28813869 hitratio : 0.95 inserts : 1245744 evictions : 1245232 size : 512 warmupTime : 28005 cumulative_lookups : 32155745 cumulative_hits : 30845811 cumulative_hitratio : 0.95 cumulative_inserts : 1309934 cumulative_evictions : 1309245 *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: Typical Cache Values
> > * > * > This is not unusual, but there's also not much reason to give this much > memory in your case. This is the cache that is hit when a user pages > through result set. Your numbers would seem to indicate one of two things: > 1> your window is smaller than 2 pages, see solrconfig.xml, > > or > 2> your users are rarely going to the next page. > > this cache isn't doing you much good, but then it's also not using that > much in the way of resources. > > True it is. Although the queryResultWindowSize is 30, I will be reducing it to 4 or so. And yes, we have observed that mostly people don't go beyond the first page > > documentCache > > > > lookups : 17647360 > > hits : 11935609 > > hitratio : 0.67 > > inserts : 5711851 > > evictions : 5707755 > > size : 4096 > > warmupTime : 0 > > cumulative_lookups : 19009142 > > cumulative_hits : 12813630 > > cumulative_hitratio : 0.67 > > cumulative_inserts : 6195512 > > cumulative_evictions : 6187460 > > > > Again, this is actually quite reasonable. This cache > is used to hold document data, and often doesn't have > a great hit ratio. It is necessary though, it saves quite > a bit of disk seeks when servicing a single query. > > > > > fieldValueCache > > > > lookups : 0 > > hits : 0 > > hitratio : 0.00 > > inserts : 0 > > evictions : 0 > > size : 0 > > warmupTime : 0 > > cumulative_lookups : 0 > > cumulative_hits : 0 > > cumulative_hitratio : 0.00 > > cumulative_inserts : 0 > > cumulative_evictions : 0 > > > > Not doing much in the way of faceting, are you? > > No. We don't facet results > > > > filterCache > > > > lookups : 30059278 > > hits : 28813869 > > hitratio : 0.95 > > inserts : 1245744 > > evictions : 1245232 > > size : 512 > > warmupTime : 28005 > > cumulative_lookups : 32155745 > > cumulative_hits : 30845811 > > cumulative_hitratio : 0.95 > > cumulative_inserts : 1309934 > > cumulative_evictions : 1309245 > > > > > > Not a bad hit ratio here, this is where > fq filters are stored. One caution here; > it is better to break out your filter > queries where possible into small chunks. > Rather than write fq=field1:val1 AND field2:val2, > it's better to write fq=field1:val1&fq=field2:val2 > Think of this cache as a map with the query > as the key. If you write the fq the first way above, > subsequent fqs for either half won't use the cache. > That was a great advise. We do use the former approach but going forward we would stick to the latter one. Thanks, Pranav
Deduplication in MLT
I have an implementation of Deduplication as mentioned at http://wiki.apache.org/solr/Deduplication. It is helpful in grouping search results. I would like to achieve the same functionality in my MLT queries, where the result set should include grouped documents. What is a good way to do the same? *Pranav Prakash* "temet nosce"
Questions about Solr MLTHanlder, performance, Indexes
Hi folks, I am new to Solr, and using it for web application. I have been experimenting with it and have a couple of doubts which I was unable to resolve by Google. Our portal allows users to upload content and the fields we use are - title, description, transcript, tags. Now each of the content has certain - hits, downloads, favorites and auto calculated values - rating. We have a master/slave configuration (1 master, 2 slaves). Solr version: 1.4.0 Java version "1.6.0_16" Java(TM) SE Runtime Environment (build 1.6.0_16-b01) Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode) 32GiB RAM and 8 Core Index Size: ~100 GiB One of my use case is to find out related documents given a document ID. I have been using More Like Handler to generate related documents, using DisMax query. Now, I have to filter out certain content from the results solr gives me. So, if for a document id X, solr returns me a list of 20 related documents, I want to apply a filter that these 20 documents should not contain "black listed words". This is fairly straight forward in a direct query using NOT operator. How is it possible to implement a similar behavior in MoreLikeThisHandler? Every week, we perform a full index of all the documents and a nightly incremental indexing. This is done by a script which reads data from MySQL and updates it to Solr. Sometimes it happens that the script fails after updating 60% of the documents. Commit has not been performed at this stage. The next cron executes, it adds some more documents and commits them. So, will this commit involve the current update as well as the last uncommitted updates as well? Are those uncommitted changes (which are stored in a temp file) deleted after some time? Is there a way to clean uncommitted changes? Off lately, Solr has started to perform slow. When Solr is started it goes quick and responds to requests in ~100ms. Gradually (very gradually) it goes on to a limit where avg response time of last 10 queries goes beyond 5000ms, and that is when requests start to pile up. As I am composing this mail, optimize command is being executed which I hope should help, but to what extent, I will need to see. Finally, what happens if the schema of master and slave are different (there exists a field in master which does not exist in slave). I thought that replication would show me some kind of error, but it went on successfully. Thanks, Pranav
Removing duplicate documents from search results
How can I remove very similar documents from search results? My scenario is that there are documents in the index which are almost similar (people submitting same stuff multiple times, sometimes different people submitting same stuff). Now when a search is performed for "keyword", in the top N results, quite frequently, same document comes up multiple times. I want to remove those duplicate (or possible duplicate) documents. Very similar to what Google does when they say "In order to show you most relevant result, duplicates have been removed". How can I achieve this functionality using Solr? Does Solr has an implied or plugin which could help me with it? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: Removing duplicate documents from search results
This approach would definitely work is the two documents are *Exactly* the same. But this is very fragile. Even if one extra space has been added, the whole hash would change. What I am really looking for is some %age similarity between documents, and remove those documents which are more than 95% similar. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Thu, Jun 23, 2011 at 15:16, Omri Cohen wrote: > What you need to do, is to calculate some HASH (using any message digest > algorithm you want, md5, sha-1 and so on), then do some reading on solr > field collapse capabilities. Should not be too complicated.. > > *Omri Cohen* > > > > Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295 > > > > > My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image: > Twitter] <http://www.twitter.com/omricohe> [image: > WordPress]<http://omricohen.me> > Please consider your environmental responsibility. Before printing this > e-mail message, ask yourself whether you really need a hard copy. > IMPORTANT: The contents of this email and any attachments are confidential. > They are intended for the named recipient(s) only. If you have received > this > email by mistake, please notify the sender immediately and do not disclose > the contents to anyone or make copies thereof. > Signature powered by > < > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > > > WiseStamp< > http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer > > > > > > -- Forwarded message -- > From: Pranav Prakash > Date: Thu, Jun 23, 2011 at 12:26 PM > Subject: Removing duplicate documents from search results > To: solr-user@lucene.apache.org > > > How can I remove very similar documents from search results? > > My scenario is that there are documents in the index which are almost > similar (people submitting same stuff multiple times, sometimes different > people submitting same stuff). Now when a search is performed for > "keyword", > in the top N results, quite frequently, same document comes up multiple > times. I want to remove those duplicate (or possible duplicate) documents. > Very similar to what Google does when they say "In order to show you most > relevant result, duplicates have been removed". How can I achieve this > functionality using Solr? Does Solr has an implied or plugin which could > help me with it? > > > *Pranav Prakash* > > "temet nosce" > > Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com > > > | > Google <http://www.google.com/profiles/pranny> >
Re: how to index data in solr form database automatically
Cron is a time-based job scheduler in Unix-like computer operating systems. en.wikipedia.org/wiki/Cron *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Fri, Jun 24, 2011 at 12:26, Romi wrote: > Yeah i am using data-import to get data from database and indexing it. but > what is cron can you please provide a link for it > > - > Thanks & Regards > Romi > -- > View this message in context: > http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103072.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Custom Handler support in Solr-ruby
Hi, I found solr-ruby gem (http://wiki.apache.org/solr/solr-ruby) really inflexible in terms of specifying handler. The Solr::Request::Select class defines handler as "select" and all other classes inherit from this class. And since the methods in Solr::Connection use one of the classes from Solr::Request, I don't see a direct way to use a custom handler (which I have made for MoreLikeThis). Currently, the approach I am using is to create the query URL, do a CURL, parse the response and return it. Even if I'd to extend the classes, I'd end up making a new Solr::Request::CustomSelect which will be similar to Solr::Request::Select except for the flexibility for the user to provide handler, defaulted by 'select'. Then creating different classes each for DisMax and all, which will be derived from Solr::Request::CustomSelect. Isn't this too much of an overhead? Or am I missing something? Also, where can I file bugs to solr-ruby? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Index Version and Epoch Time?
Hi, I am not sure what is the index number value? It looks like an epoch time, but in my case, this points to one month back. However, i can see documents which were added last week, to be in the index. Even after I did a commit, the index number did not change? Isn't it supposed to change on every commit? If not, is there a way to look into the last index time? Also, this page http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a Replication Dashboard. How is this dashboard invoked? Is there any URL which needs to be called? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: Removing duplicate documents from search results
I found the deduplication thing really useful. Although I have not yet started to work on it, as there are some other low hanging fruits I've to capture. Will share my thoughts soon. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> 2011/6/28 François Schiettecatte > Maybe there is a way to get Solr to reject documents that already exist in > the index but I doubt it, maybe someone else with can chime here here. You > could do a search for each document prior to indexing it so see if it is > already in the index, that is probably non-optimal, maybe it is easiest to > check if the document exists in your Riak repository, it no add it and index > it, and drop if it already exists. > > François > > On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote: > > > I am making the Hash from URL, but I can't use this as UniqueKey because > I > > am using UUID as UniqueKey, > > Since I am using SOLR as index engine Only and using Riak(key-value > > storage) as storage engine, I dont want to do the overwrite on duplicate. > > I just need to discard the duplicates. > > > > > > > > 2011/6/28 François Schiettecatte > > > >> Create a hash from the url and use that as the unique key, md5 or sha1 > >> would probably be good enough. > >> > >> Cheers > >> > >> François > >> > >> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote: > >> > >>> I also have the problem of duplicate docs. > >>> I am indexing news articles, Every news article will have the source > URL, > >>> If two news-article has the same URL, only one need to index, > >>> removal of duplicate at index time. > >>> > >>> > >>> > >>> On 23 June 2011 21:24, simon wrote: > >>> > >>>> have you checked out the deduplication process that's available at > >>>> indexing time ? This includes a fuzzy hash algorithm . > >>>> > >>>> http://wiki.apache.org/solr/Deduplication > >>>> > >>>> -Simon > >>>> > >>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash > >> wrote: > >>>>> This approach would definitely work is the two documents are > *Exactly* > >>>> the > >>>>> same. But this is very fragile. Even if one extra space has been > added, > >>>> the > >>>>> whole hash would change. What I am really looking for is some %age > >>>>> similarity between documents, and remove those documents which are > more > >>>> than > >>>>> 95% similar. > >>>>> > >>>>> *Pranav Prakash* > >>>>> > >>>>> "temet nosce" > >>>>> > >>>>> Twitter <http://twitter.com/pranavprakash> | Blog < > >>>> http://blog.myblive.com> | > >>>>> Google <http://www.google.com/profiles/pranny> > >>>>> > >>>>> > >>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen wrote: > >>>>> > >>>>>> What you need to do, is to calculate some HASH (using any message > >> digest > >>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on > >> solr > >>>>>> field collapse capabilities. Should not be too complicated.. > >>>>>> > >>>>>> *Omri Cohen* > >>>>>> > >>>>>> > >>>>>> > >>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | > >>>> +972-3-6036295 > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> > >>>> [image: > >>>>>> Twitter] <http://www.twitter.com/omricohe> [image: > >>>>>> WordPress]<http://omricohen.me> > >>>>>> Please consider your environmental responsibility. Before printing > >> this > >>>>>> e-mail message, ask yourself whether you really need a hard copy. > >>>>>> IMPORTANT: The contents of this email and any attachments are > >>>> confidential. > >>>>>> They are intended for the named recipien
Re: Index Version and Epoch Time?
Hi, I am facing multiple issues with solr and I am not sure what happens in each case. I am quite naive in Solr and there are some scenarios I'd like to discuss with you. We have a huge volume of documents to be indexed. Somewhere about 5 million. We have a full indexer script which essentially picks up all the documents from database and updates into Solr and an incremental script which adds new documents to Solr.. Relevant areas of my config file goes like false false 1 10 ${enable.master:false} startup commit ${enable.slave:false} http://hostname:port/solr/core0/replication Sometimes, while the full indexer script breaks while adding documents to Solr. The script adds the documents and then commits the operation. So, when the script breaks, we have a huge lot of data which has been updated but not committed. Next, the incremental index script executes, and figures out all the new entries, adds them to Solr. It works successfully and commits the operation. - Will the commit by incremental indexer script also commit the previously uncommitted changes made by full indexer script before it broke? Sometimes, while during execution, Solr's avg response time 9avg resp time for last 10 requests, read from log file) goes as high as 9000ms (which I am still unclear why, any ideas how to start hunting for the problem?), so the watchdog process restarts Solr (because it causes a pile of requests queue at application server, which causes app server to crash). On my local environment, I performed the same experiment by adding docs to Solr, killing the process and restarting it. I found that the uncommitted changes were applied and searchable. However, the updates were uncommitted. Could you explain me as to how is this happening, or is there a configuration that can be adjusted for this? Also, what would the index state be if after the restarting Solr, a commit is applied or a commit is not applied? I'd be happy to provide any other information that might be needed. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Tue, Jun 28, 2011 at 20:55, Shalin Shekhar Mangar wrote: > On Tue, Jun 28, 2011 at 4:18 PM, Pranav Prakash wrote: > > > > > I am not sure what is the index number value? It looks like an epoch > time, > > but in my case, this points to one month back. However, i can see > documents > > which were added last week, to be in the index. > > > > The index version shown on the dashboard is the time at which the most > recent index segment was created. I'm not sure why it has a value older > than > a month if a commit has happened after that time. > > > > > Even after I did a commit, the index number did not change? Isn't it > > supposed to change on every commit? If not, is there a way to look into > the > > last index time? > > > > Yeah, it changes after every commit which added/deleted a document. > > > > Also, this page > > http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows > a > > Replication Dashboard. How is this dashboard invoked? Is there any URL > > which > > needs to be called? > > > > > If you have configured replication correctly, the admin dashboard should > show a "Replication" link right next to the "Schema Browser" link. The path > should be /admin/replication/index.jsp > > -- > Regards, > Shalin Shekhar Mangar. >
Dealing with keyword stuffing
I guess most of you have already handled and many of you might still be handling keyword stuffing. Here is my scenario. We have a huge index containing about 6m docs. (Not sure if that is huge :-) And every document contains title, description, tags, content (textual data). People have been doing keyword stuffing on the documents, so when searched for a "query term", the first results are always the ones who are optimized. So, instead of people getting relevant results, they get spam content (highly optimized, keyword stuffed content) as first few results. I have tried a couple of things like providing different boosts to different fields, but almost everything seems to fail. I'd like to know how did you guys fixed this thing? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: Dealing with keyword stuffing
On Thu, Jul 28, 2011 at 08:31, Chris Hostetter wrote: > > : Presumably, they are doing this by increasing tf (term frequency), > : i.e., by repeating keywords multiple times. If so, you can use a custom > : similarity class that caps term frequency, and/or ensures that the > scoring > : increases less than linearly with tf. Please see > In some cases, yes they are repeating keywords multiple times. Stuffing different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr Guide. > > in paticular, using something like SweetSpotSimilarity tuned to know what > values make sense for "good" content in your domain can be useful because > it can actaully penalize docsuments that are too short/long or have term > freqs that are outside of a reasonble expected range. > I am not a Solr expert, But I was thinking in this direction. The ratio of tokens/total_length would be nearer to 1 for a stuffed document, while it would be nearer to 0 for a bogus document. Somewhere between the two lies documents that are more likely to be meaningful. I am not sure how to use SweetSpotSimilarity. I am googling on this, but any useful insights are so much appreciated.
Re: Index
Every indexed document has to have a unique ID associated with it. You may do a search by ID something like http://localhost:/solr/select?q=id:X If you see a result, then the document has been indexed and is searchable. You might also want to check Luke (http://code.google.com/p/luke) to gain more insight about the index. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Fri, Jul 29, 2011 at 03:40, GAURAV PAREEK wrote: > Yes NICK you are correct ? > how can you check whether it has been indexed by solr, and is searchable? > > On Fri, Jul 29, 2011 at 3:27 AM, Nicholas Chase >wrote: > > > Do you mean, how can you check whether it has been indexed by solr, and > is > > searchable? > > > > Nick > > > > > > On 7/28/2011 5:45 PM, GAURAV PAREEK wrote: > > > >> Hi All, > >> > >> How we can check the particular;ar file is not INDEX in solr ? > >> > >> Regards, > >> Gaurav > >> > >> >
Re: Dealing with keyword stuffing
Cool, So I used SweetSpotSimilarity with default params and I see some improvements. However, I could still see some of the 'stuffed' documents coming up in the results. I feel that SweetSpotSimilarity alone is not enough. Going through http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out that there are other things - Pivoted Length Normalization and term frequency normalization that needs fine tuning too. Should I create a custom Similarity Class that overrides all the default behavior? I guess that should help me get more relevant results. Where should I start beginning with it? Pl. do not assume less obvious things, I am still learning !! :-) *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Thu, Jul 28, 2011 at 17:03, Gora Mohanty wrote: > On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash wrote: > [...] > > I am not sure how to use SweetSpotSimilarity. I am googling on this, but > > any useful insights are so much appreciated. > > Replace the existing DefaultSimilarity class in schema.xml (look towards > the bottom of the file) with the SweetSpotSimilarity class, e.g., have a > line > like: > > > Regards, > Gora >
Re: Solr Incremental Indexing
There could be multiple ways of getting this done, and the exact one depends a lot on factors like - what system are you using? How realtime the change has to be reflected back into the system? How is the indexing/replication done? Usually, in cases where the tolerance is about 6hrs (i.e. your DB change wont be reflected in Solr Index for as high as 6hrs), you can set up a cron job to be triggered every 6 hrs. It will see all the changes made between that time, and update Index and commit it. In cases, where a more real time requirement, there could be a trigger in the application (and not at the db level), which would fork a process to update Solr about this change by means of delayed task. If using this approach, it is suggested to use autocommit every N documents, N could be anything depending your app. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Sun, Jul 31, 2011 at 02:32, Alexei Martchenko < ale...@superdownloads.com.br> wrote: > I always have a field in my databases called datelastmodified, so whenever > I > update that record, i set it to getdate() - mssql func - and then get all > latest records order by that field. > > 2011/7/29 Mohammed Lateef Hussain > > > Hi > > > > Need some help in Solr incremental indexing approch. > > > > I have built my Solr index using SolrJ API and now want to update the > index > > whenever any changes has been made in > > database. My requirement is not to use DB triggers to call any update > > events. > > > > I want to update my index on the fly whenever my application updates any > > record in database. > > > > Note: My indexing logic to get the required data from DB is some what > > complex and involves many tables. > > > > Please suggest me how can I proceed here. > > > > Thanks > > Lateef > > > > > > -- > > *Alexei Martchenko* | *CEO* | Superdownloads > ale...@superdownloads.com.br | ale...@martchenko.com.br | (11) > 5083.1018/5080.3535/5080.3533 >
Re: Solr 3.3 crashes after ~18 hours?
What do you mean by it just crashes? Does the process stops execution? Does it takes too long to respond which might result in lots of 503s in your application? Does the system run out of resources? Are you indexing and serving from the same server? It happened once with us that Solr was performing commit and then optimize while the load from app server was at its peak. This caused slow response from search server, which caused requests getting stacked up at app server and causing 503s. Could you look if you have a similar syndrome? *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Tue, Aug 2, 2011 at 15:31, alexander sulz wrote: > Hello folks, > > I'm using the latest stable Solr release -> 3.3 and I encounter strange > phenomena with it. > After about 19 hours it just crashes, but I can't find anything in the > logs, no exceptions, no warnings, > no suspicious info entries.. > > I have an index-job running from 6am to 8pm every 10 minutes. After each > job there is a commit. > An optimize-job is done twice a day at 12:15pm and 9:15pm. > > Does anyone have an idea what could possibly be wrong or where to look for > further debug info? > > regards and thank you > alex >
Re: PivotFaceting in solr 3.3
>From what I know, this is a feature in Solr 4.0 marked as SOLR-792 in JIRA. Is this what you are looking for ? https://issues.apache.org/jira/browse/SOLR-792 *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Wed, Aug 3, 2011 at 10:16, Isha Garg wrote: > Hi All! > > Can anyone tell which patch should I apply to solr 3.3 to enable pivot > faceting in it. > > Thanks in advance! > Isha garg > > > > >
Re: Is optimize needed on slaves if it replicates from optimized master?
That is not true. Replication is roughly a copy of the diff between the >> master and the slave's index. > > In my case, during replication entire index is copied from master to slave, during which the size of index goes a little over double. Then it shrinks to its original size. Am I doing something wrong? How can I get the master to serve only delta index instead of serving whole index and the slaves merging the new and old index? *Pranav Prakash*
How come this query string starts with wildcard?
While going through my error logs of Solr, i found that a user had fired a query - jawapan ujian bulanan thn 4 (bahasa melayu). This was converted to following for autosuggest purposes - jawapan?ujian?bulanan?thn?4?(bahasa?melayu)* by the javascript code. Solr threw the exception Cannot parse 'jawapan?ujian?bulanan?thn?4?(bahasa?melayu)*': '*' or '?' not allowed as first character in WildcardQuery How come this query string begins with wildcard character? When I changed the query to remove brackets, everything went smooth. There were no results, because probably my search index didn't had any. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: Is optimize needed on slaves if it replicates from optimized master?
Very well explained. Thanks. Yes, we do optimize Index before replication. I am not particularly worried about disk space usage. I was more curious of that behavior. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny> On Wed, Aug 10, 2011 at 19:55, Erick Erickson wrote: > This is expected behavior. You might be optimizing > your index on the master after every set of changes, > in which case the entire index is copied. During this > period, the space on disk will at least double, there's no > way around that. > > If you do NOT optimize, then the slave will only copy changed > segments instead of the entire index. Optimizing isn't > usually necessary except periodically (daily, perhaps weekly, > perhaps never actually). > > All that said, depending on how merging happens, you will always > have the possibility of the entire index being copied sometimes > because you'll happen to hit a merge that merges all segments > into one. > > There are some advanced options that can control some parts > of merging, but you need to get to the bottom of why the whole > index is getting copied every time before you go there. I'd bet > you're issuing an optimize. > > Best > Erick > > On Wed, Aug 10, 2011 at 5:30 AM, Pranav Prakash wrote: > > That is not true. Replication is roughly a copy of the diff between the > >>> master and the slave's index. > >> > >> > > In my case, during replication entire index is copied from master to > slave, > > during which the size of index goes a little over double. Then it shrinks > to > > its original size. Am I doing something wrong? How can I get the master > to > > serve only delta index instead of serving whole index and the slaves > merging > > the new and old index? > > > > *Pranav Prakash* > > >
OOM due to JRE Issue (LUCENE-1566)
Hi, This might probably have been discussed long time back, but I got this error recently in one of my production slaves. SEVERE: java.lang.OutOfMemoryError: OutOfMemoryError likely caused by the Sun VM Bug described in https://issues.apache.org/jira/browse/LUCENE-1566; try calling FSDirectory.setReadChunkSize with a a value smaller than the current chunk size (2147483647) I am currently using Solr1.4. Going through JIRA Issue comments, I found that this patch applies to 2.9 or above. We are also planning an upgrade to Solr 3.3. Is this patch included in 3.3 so as to I don't have to manually apply the patch? What are the other workarounds of the problem? Thanks in adv. *Pranav Prakash* "temet nosce" Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> | Google <http://www.google.com/profiles/pranny>
Re: OOM due to JRE Issue (LUCENE-1566)
> > > AFAIK, solr 1.4 is on Lucene 2.9.1 so this patch is already applied to > the version you are using. > maybe you can provide the stacktrace and more deatails about your > problem and report back? > Unfortunately, I have only this much information with me. However following is my speficiations, if they are any helpful :- /usr/bin/java -d64 -Xms5000M -Xmx5000M -XX:+UseParallelGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$GC_LOGFILE -XX:+CMSPermGenSweepingEnabled -Dsolr.solr.home=multicore -Denable.slave=true -jar start.jar 32GiB RAM Any thoughts? Will a switch to ConcurrentGC help in any means?
Top 5 high freq words - UpdateProcessorChain or DIH Script?
Hi, I want to store top 5 high frequency non-stopwords words. I use DIH to import data. Now I have two approaches - 1. Use DIH JavaScript to find top 5 frequency words and put them in a copy field. The copy field will then stem it and remove stop words based on appropriate tokenizers. 2. Write a custom function for the same and add it to UpdateRequestProcessor Chain. Which of the two would be better suited? I find the first approach rather simple, but the issue is that I won't be having access to stop words/synonyms etc at the DIH time. In the second approach, if I add it to UpdateRequestProcessor Chain and insert the function after StopWordsFilterFactory and DuplicateRemoveFilterFactory, should be rather good way of doing this? -- *Pranav Prakash* "temet nosce"
DIH XML configs for multi environment
The DIH XML config file has to be specified dataSource. In my case, and possibly with many others, the logon credentials as well as mysql server paths would differ based on environments (dev, stag, prod). I don't want to end up coming with three different DIH config files, three different handlers and so on. What is a good way to deal with this? *Pranav Prakash* "temet nosce"
Re: DIH XML configs for multi environment
That's cool. Is there something similar for Jetty as well? We use Jetty! *Pranav Prakash* "temet nosce" On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar < rahul.warawde...@gmail.com> wrote: > Hi Pranav, > > If you are using Tomcat to host Solr, you can define your data source in > context.xml file under tomcat configuration. > You have to refer to this datasource with the same name in all the 3 > environments from DIH data-config.xml. > This context.xml file will vary across 3 environments having different > credentials for dev, stag and prod. > > eg > DIH data-config.xml will refer to the datasource as listed below > type="JdbcDataSource" readOnly="true" /> > > context.xml file which is located under "//conf" folder will > have the resource entry as follows >type="" username="X" password="X" > driverClassName="" > url="" > maxActive="8" > /> > > On Wed, Jul 11, 2012 at 1:31 PM, Pranav Prakash wrote: > > > The DIH XML config file has to be specified dataSource. In my case, and > > possibly with many others, the logon credentials as well as mysql server > > paths would differ based on environments (dev, stag, prod). I don't want > to > > end up coming with three different DIH config files, three different > > handlers and so on. > > > > What is a good way to deal with this? > > > > > > *Pranav Prakash* > > > > "temet nosce" > > > > > > -- > Thanks and Regards > Rahul A. Warawdekar >
How To apply transformation in DIH for multivalued numeric field?
I have a multivalued integer field and a multivalued string field defined in my schema as The DIH entity and field defn for the same goes as The value for field community_tags comes correctly as an array of strings. However the value of field community_tag_ids is not proper [B@390c0a18 I tried chaining NumberFormatTransformer with formatStyle="number" but that throws DataImportHandlerException: Failed to apply NumberFormat on column. Could it be due to NULL values from database or because the value is not proper? How do we handle NULL in this case? *Pranav Prakash* "temet nosce"
Re: DIH XML configs for multi environment
That approach would work for core dependent parameters. In my case, the params are environment dependent. I think a simpler approach would be to pass the url param as JVM options, and these XMLs get it from there. I haven't tried it yet. *Pranav Prakash* "temet nosce" On Tue, Jul 17, 2012 at 5:09 PM, Markus Klose wrote: > Hi > > There is one more approach using the property mechanism. > > You could specify the datasource like this: > > > And you can specifiy the properties in the solr.xml in your core > configuration like this: > > > > > > > > Viele Grüße aus Augsburg > > Markus Klose > SHI Elektronische Medien GmbH > > > Adresse: Curt-Frenzel-Str. 12, 86167 Augsburg > > Tel.: 0821 7482633 26 > Tel.: 0821 7482633 0 (Zentrale) > Mobil:0176 56516869 > Fax: 0821 7482633 29 > > E-Mail: markus.kl...@shi-gmbh.com > Internet: http://www.shi-gmbh.com > > Registergericht Augsburg HRB 17382 > Geschäftsführer: Peter Spiske > USt.-ID: DE 182167335 > > > > > > -Ursprüngliche Nachricht- > Von: Rahul Warawdekar [mailto:rahul.warawde...@gmail.com] > Gesendet: Mittwoch, 11. Juli 2012 11:21 > An: solr-user@lucene.apache.org > Betreff: Re: DIH XML configs for multi environment > > http://wiki.eclipse.org/Jetty/Howto/Configure_JNDI_Datasource > http://docs.codehaus.org/display/JETTY/DataSource+Examples > > > On Wed, Jul 11, 2012 at 2:30 PM, Pranav Prakash wrote: > > > That's cool. Is there something similar for Jetty as well? We use Jetty! > > > > *Pranav Prakash* > > > > "temet nosce" > > > > > > > > On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar < > > rahul.warawde...@gmail.com> wrote: > > > > > Hi Pranav, > > > > > > If you are using Tomcat to host Solr, you can define your data > > > source in context.xml file under tomcat configuration. > > > You have to refer to this datasource with the same name in all the 3 > > > environments from DIH data-config.xml. > > > This context.xml file will vary across 3 environments having > > > different credentials for dev, stag and prod. > > > > > > eg > > > DIH data-config.xml will refer to the datasource as listed below > > > > > type="JdbcDataSource" readOnly="true" /> > > > > > > context.xml file which is located under "//conf" folder > > > will have the resource entry as follows > > >> > type="" username="X" password="X" > > > driverClassName="" > > > url="" > > > maxActive="8" > > > /> > > > > > > On Wed, Jul 11, 2012 at 1:31 PM, Pranav Prakash > > wrote: > > > > > > > The DIH XML config file has to be specified dataSource. In my > > > > case, and possibly with many others, the logon credentials as well > > > > as mysql > > server > > > > paths would differ based on environments (dev, stag, prod). I > > > > don't > > want > > > to > > > > end up coming with three different DIH config files, three > > > > different handlers and so on. > > > > > > > > What is a good way to deal with this? > > > > > > > > > > > > *Pranav Prakash* > > > > > > > > "temet nosce" > > > > > > > > > > > > > > > > -- > > > Thanks and Regards > > > Rahul A. Warawdekar > > > > > > > > > -- > Thanks and Regards > Rahul A. Warawdekar >
Re: How To apply transformation in DIH for multivalued numeric field?
I had tried with splitBy for numeric field, but that also did not worked for me. However I got rid of group_concat and it was all good to go. Thanks a lot!! I really had a difficult time understanding this behavior. *Pranav Prakash* "temet nosce" On Thu, Jul 19, 2012 at 1:34 AM, Dyer, James wrote: > Don't you want to specify "splitBy" for the integer field too? > > Actually though, you shouldn't need to use GROUP_CONCAT and > RegexTransformer at all. DIH is designed to handle "1>many" relations > between parent and child entities by populating all the child fields as > multi-valued automatically. I guess your approach leads to a lot fewer > rows getting sent from your db to Solr though. > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > > > -Original Message- > From: Pranav Prakash [mailto:pra...@gmail.com] > Sent: Wednesday, July 18, 2012 2:38 PM > To: solr-user@lucene.apache.org > Subject: How To apply transformation in DIH for multivalued numeric field? > > I have a multivalued integer field and a multivalued string field defined > in my schema as > > type="integer" > indexed="true" > stored="true" > multiValued="true" > omitNorms="true" /> > type="text" > indexed="true" > termVectors="true" > stored="true" > multiValued="true" > omitNorms="true" /> > > > The DIH entity and field defn for the same goes as > >dataSource="app" > onError="skip" > transformer="RegexTransformer" > query="..."> > > transformer="RegexTransformer" > query="SELECT > group_concat(a.id SEPARATOR ',') AS community_tag_ids, > group_concat(a.title SEPARATOR ',') AS community_tags > FROM tags a JOIN tag_dets b ON a.id = b.tag_id > WHERE b.doc_id = ${document.id}" > > > > > > > > The value for field community_tags comes correctly as an array of strings. > However the value of field community_tag_ids is not proper > > > [B@390c0a18 > > > I tried chaining NumberFormatTransformer with formatStyle="number" but that > throws DataImportHandlerException: Failed to apply NumberFormat on column. > Could it be due to NULL values from database or because the value is not > proper? How do we handle NULL in this case? > > > *Pranav Prakash* > > "temet nosce" > >
Re: can solr admin tab statistics be customized... how can this be achived.
You can checkout Solr source code, do the patch work in admin JSP files and use it as your custom Solr Instance. *Pranav Prakash* "temet nosce" On Fri, Jul 20, 2012 at 12:14 PM, yayati wrote: > > > Hi, > > I want to compute my own stats in addition to solr default stats. How can i > enhance statistics in solr? How this thing can be achieved.. Solr compute > stats as cumulative, is there is any way to get per instant stats...?? > > Thanks... waiting for good replies.. > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/can-solr-admin-tab-statistics-be-customized-how-can-this-be-achived-tp3996128.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: DIH XML configs for multi environment
Jerry, Glad it worked for you. I will also do the same thing. This seems easier for me, as I have a solr start shell script, which sets the JVM params for master/slave, Xmx and so on according to the environment. Setting a jdbc connect url in the start script is convenient than changing the configs. *Pranav Prakash* "temet nosce" On Tue, Jul 24, 2012 at 1:17 AM, jerry.min...@gmail.com < jerry.min...@gmail.com> wrote: > Pranav, > > Sorry, I should have checked my response a little better as I > misspelled your name and, mentioned that I tried what Marcus suggested > then described something totally different. > I didn't try using the property mechanism as Marcus suggested as I am > not using a solr.xml file. > > What you mentioned in your post on Wed, Jul 18, 2012 at 3:46 PM will > work as I have done it successfully. > That is I created a JVM variable to contain the connect URLs for each > of my environments and one of those to set the URL parameter of the > dataSource entity > in my data config files. > > Best, > Jerry > > > On Mon, Jul 23, 2012 at 3:34 PM, jerry.min...@gmail.com > wrote: > > Pranay, > > > > I tried two similar approaches to resolve this in my system which is > > Solr 4.0 running in Tomcat 7.x on Ubuntu 9.10. > > > > My preference was to use an alias for each of my database environments > > as a JVM parameter because it makes more sense to me that the database > > connection be stored in the data config file rather than in a Tomcat > > configuration or startup file. > > Because of preference, I first attempted the following: > > 1. Set a JVM environment variable 'solr.dbEnv' to the represent the > > database environment that should be accessed. For example, in my dev > > environment, the JVM environment variable was set as -Dsolr.dbEnv=dev. > > 2. In the data config file I had 3 data sources. Each data source had > > a name that matched one of the database environment aliases. > > 3. In the entity of my data config file "dataSource" parameter was set > > as follows dataSource=${solr.dbEnv}. > > > > Unfortunately, this fails to work. Setting "dataSource" parameter in > > the data config file does not override the default. The default > > appears to be the first data source defined in the data config file. > > > > Second, I tried what Marcus suggested. > > > > That is, I created a JVM variable to contain the connect URLs for each > > of my environments. > > I use that variable to set the URL parameter of the dataSource entity > > in the data config file. > > > > This works well. > > > > > > Best, > > Jerry Mindek > > > > Unfortunately, the first option did not work. It seemed as though > > On Wed, Jul 18, 2012 at 3:46 PM, Pranav Prakash > wrote: > >> That approach would work for core dependent parameters. In my case, the > >> params are environment dependent. I think a simpler approach would be to > >> pass the url param as JVM options, and these XMLs get it from there. > >> > >> I haven't tried it yet. > >> > >> *Pranav Prakash* > >> > >> "temet nosce" > >> > >> > >> > >> On Tue, Jul 17, 2012 at 5:09 PM, Markus Klose wrote: > >> > >>> Hi > >>> > >>> There is one more approach using the property mechanism. > >>> > >>> You could specify the datasource like this: > >>> > >>> > >>> And you can specifiy the properties in the solr.xml in your core > >>> configuration like this: > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> Viele Grüße aus Augsburg > >>> > >>> Markus Klose > >>> SHI Elektronische Medien GmbH > >>> > >>> > >>> Adresse: Curt-Frenzel-Str. 12, 86167 Augsburg > >>> > >>> Tel.: 0821 7482633 26 > >>> Tel.: 0821 7482633 0 (Zentrale) > >>> Mobil:0176 56516869 > >>> Fax: 0821 7482633 29 > >>> > >>> E-Mail: markus.kl...@shi-gmbh.com > >>> Internet: http://www.shi-gmbh.com > >>> > >>> Registergericht Augsburg HRB 17382 > >>> Geschäftsführer: Peter Spiske > >>> USt.-ID: DE 182167335 > >>> > >>> > >>> > >>> > >>> > >>> -Ursprüngliche Nachricht- > >>> Von: Rahul War
Exact match on few fields, fuzzy on others
Hi Folks, I am using Solr 3.4 and my document schema has attributes - title, transcript, author_name. Presently, I am using DisMax to search for a user query across transcript. I would also like to do an exact search on author_name so that for a query "Albert Einstein", I would want to get all the documents which contain Albert or Einstein in transcript and also those documents which have author_name exactly as 'Albert Einstein'. Can we do this by dismax query parser? The schema for both the fields are below: -- *Pranav Prakash* "temet nosce"
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
I am experiencing similar problem related to encoding. In my case, the char like " (double quote) is also garbaled. I believe this is because the encoding in my MySQL table is latin1 and in the JDBC it is being specified as UTF-8. Is there a way to specify latin1 charset in JDBC? probably that would resolve this. *Pranav Prakash* "temet nosce" On Sat, Sep 8, 2012 at 3:16 AM, Shawn Heisey wrote: > On 9/6/2012 6:54 PM, kiran chitturi wrote: > >> The error i am getting is 'org.apache.solr.common.**SolrException: >> Invalid >> Date String: '1345743552'. >> >> I think it was being saved as a string in DB, so i will use the >> DateFormatTransformer. >> > > To go along with all the other replies that you have gotten: I import > from MySQL with a unix format date field. It's a bigint, not a string, but > a quick test on MySQL 5.1 shows that the function works with strings too. > This is how my SELECT handles that field - I have MySQL convert it before > it gets to Solr: > > from_unixtime(`d`.`post_date`) AS `pd` > > When it comes to the character set issues, this is how I have defined the > driver in the dataimport config. The character set in the database is utf8. > >driver="com.mysql.jdbc.Driver" > encoding="UTF-8" > url="jdbc:mysql://${**dataimporter.request.dbHost}:** > 3306/${dataimporter.request.**dbSchema}?**zeroDateTimeBehavior=** > convertToNull" > batchSize="-1" > user="" > password=""/> > > Thanks, > Shawn > >
Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0
The character is actually - “ and not " *Pranav Prakash* "temet nosce" On Mon, Sep 10, 2012 at 2:45 PM, Pranav Prakash wrote: > I am experiencing similar problem related to encoding. In my case, the > char like " (double quote) > is also garbaled. > > I believe this is because the encoding in my MySQL table is latin1 and in > the JDBC it is being specified as UTF-8. Is there a way to specify latin1 > charset in JDBC? probably that would resolve this. > > > *Pranav Prakash* > > "temet nosce" > > > > > On Sat, Sep 8, 2012 at 3:16 AM, Shawn Heisey wrote: > >> On 9/6/2012 6:54 PM, kiran chitturi wrote: >> >>> The error i am getting is 'org.apache.solr.common.**SolrException: >>> Invalid >>> Date String: '1345743552'. >>> >>> I think it was being saved as a string in DB, so i will use the >>> DateFormatTransformer. >>> >> >> To go along with all the other replies that you have gotten: I import >> from MySQL with a unix format date field. It's a bigint, not a string, but >> a quick test on MySQL 5.1 shows that the function works with strings too. >> This is how my SELECT handles that field - I have MySQL convert it before >> it gets to Solr: >> >> from_unixtime(`d`.`post_date`) AS `pd` >> >> When it comes to the character set issues, this is how I have defined the >> driver in the dataimport config. The character set in the database is utf8. >> >> > driver="com.mysql.jdbc.Driver" >> encoding="UTF-8" >> url="jdbc:mysql://${**dataimporter.request.dbHost}:** >> 3306/${dataimporter.request.**dbSchema}?**zeroDateTimeBehavior=** >> convertToNull" >> batchSize="-1" >> user="" >> password=""/> >> >> Thanks, >> Shawn >> >> >
Re: DIH import from MySQL results in garbage text for special chars
I am seeing the garbage text in browser, Luke Index Toolbox and everywhere it is the same. My servlet container is Jetty which is the out-of-box one. Many other special chars are getting indexed and stored properly, only few characters causes pain. *Pranav Prakash* "temet nosce" On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson wrote: > Is your _browser_ set to handle the appropriate character set? Or whatever > you're using to inspect your data? How about your servlet container? > > > > Best > Erick > > On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash wrote: > > Hi Folks, > > > > I am attempting to import documents to Solr from MySQL using DIH. One of > > the field contains the text - “Future of Mobile Value Added Services > (VAS) > > in Australia” .Notice the character “ and ”. > > > > When I am importing, it gets stored as - “Future of Mobile Value Added > > Services (VAS) in Australiaâ€�. > > > > The datasource config clearly mentions use of UTF-8 as follows: > > > >> driver="com.mysql.jdbc.Driver" > > url="jdbc:mysql://localhost/ohapp_devel" > > user="username" > > useUnicode="true" > > characterEncoding="UTF-8" > > password="password" > > zeroDateTimeBehavior="convertToNull" > > name="app" /> > > > > > > A plain SQL Select statement on the MySQL Console gives appropriate > text. I > > even tried using following scriptTransformer to get rid of this char, but > > it was of no particular use in my case. > > > > function gsub(source, pattern, replacement) { > > var match, result; > > if (!((pattern != null) && (replacement != null))) { > > return source; > > } > > result = ''; > > while (source.length > 0) { > > if ((match = source.match(pattern))) { > > result += source.slice(0, match.index); > > result += replacement; > > source = source.slice(match.index + match[0].length); > > } else { > > result += source; > > source = ''; > > } > > } > > return result; > > } > > > > function fixQuotes(c){ > > c = gsub(c, /\342\200(?:\234|\235)/,'"'); > > c = gsub(c, /\342\200(?:\230|\231)/,"'"); > > c = gsub(c, /\342\200\223/,"-"); > > c = gsub(c, /\342\200\246/,"..."); > > c = gsub(c, /\303\242\342\202\254\342\204\242/,"'"); > > c = gsub(c, /\303\242\342\202\254\302\235/,'"'); > > c = gsub(c, /\303\242\342\202\254\305\223/,'"'); > > c = gsub(c, /\303\242\342\202\254"/,'-'); > > c = gsub(c, /\342\202\254\313\234/,'"'); > > c = gsub(c, /“/, '"'); > > return c; > > } > > > > function cleanFields(row){ > > var fieldsToClean = ['title', 'description']; > > for(i =0; i< fieldsToClean.length; i++){ > > var old_text = String(row.get(fieldsToClean[i])); > > row.put(fieldsToClean[i], fixQuotes(old_text) ); > > } > > return row; > > } > > > > My understanding goes that this must be a very common problem. It also > > occurs with human names which have these chars. What is an appropriate > way > > to get the appropriate text indexed and searchable? The fieldtype where > > this is stored goes as follows > > > > > > > > > > > > > > > > > >> protected="protwords.txt"/> > > > synonyms="synonyms.txt" > > ignoreCase="true" > > expand="true" /> > > > words="stopwords_en.txt" > > ignoreCase="true" /> > > > words="stopwords_en.txt" > > ignoreCase="true" /> > > > generateWordParts="1" > > generateNumberParts="1" > > catenateWords="1" > > catenateNumbers="1" > > catenateAll="0" > > preserveOriginal="1" /> > > > > > > > > > > *Pranav Prakash* > > > > "temet nosce" >
Re: DIH import from MySQL results in garbage text for special chars
I looked at the HEX codes of the texts. The hex code in MySQL is different from that which is stored in the index. The hex code in index is longer than the hex code in MySQL, this leads me to the fact that somewhere in between smething is messing up, *Pranav Prakash* "temet nosce" On Fri, Sep 21, 2012 at 11:19 AM, Pranav Prakash wrote: > I am seeing the garbage text in browser, Luke Index Toolbox and everywhere > it is the same. My servlet container is Jetty which is the out-of-box one. > Many other special chars are getting indexed and stored properly, only few > characters causes pain. > > *Pranav Prakash* > > "temet nosce" > > > > > On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson > wrote: > >> Is your _browser_ set to handle the appropriate character set? Or whatever >> you're using to inspect your data? How about your servlet container? >> >> >> >> Best >> Erick >> >> On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash wrote: >> > Hi Folks, >> > >> > I am attempting to import documents to Solr from MySQL using DIH. One of >> > the field contains the text - “Future of Mobile Value Added Services >> (VAS) >> > in Australia” .Notice the character “ and ”. >> > >> > When I am importing, it gets stored as - “Future of Mobile Value Added >> > Services (VAS) in Australiaâ€�. >> > >> > The datasource config clearly mentions use of UTF-8 as follows: >> > >> > > > driver="com.mysql.jdbc.Driver" >> > url="jdbc:mysql://localhost/ohapp_devel" >> > user="username" >> > useUnicode="true" >> > characterEncoding="UTF-8" >> > password="password" >> > zeroDateTimeBehavior="convertToNull" >> > name="app" /> >> > >> > >> > A plain SQL Select statement on the MySQL Console gives appropriate >> text. I >> > even tried using following scriptTransformer to get rid of this char, >> but >> > it was of no particular use in my case. >> > >> > function gsub(source, pattern, replacement) { >> > var match, result; >> > if (!((pattern != null) && (replacement != null))) { >> > return source; >> > } >> > result = ''; >> > while (source.length > 0) { >> > if ((match = source.match(pattern))) { >> > result += source.slice(0, match.index); >> > result += replacement; >> > source = source.slice(match.index + match[0].length); >> > } else { >> > result += source; >> > source = ''; >> > } >> > } >> > return result; >> > } >> > >> > function fixQuotes(c){ >> > c = gsub(c, /\342\200(?:\234|\235)/,'"'); >> > c = gsub(c, /\342\200(?:\230|\231)/,"'"); >> > c = gsub(c, /\342\200\223/,"-"); >> > c = gsub(c, /\342\200\246/,"..."); >> > c = gsub(c, /\303\242\342\202\254\342\204\242/,"'"); >> > c = gsub(c, /\303\242\342\202\254\302\235/,'"'); >> > c = gsub(c, /\303\242\342\202\254\305\223/,'"'); >> > c = gsub(c, /\303\242\342\202\254"/,'-'); >> > c = gsub(c, /\342\202\254\313\234/,'"'); >> > c = gsub(c, /“/, '"'); >> > return c; >> > } >> > >> > function cleanFields(row){ >> > var fieldsToClean = ['title', 'description']; >> > for(i =0; i< fieldsToClean.length; i++){ >> > var old_text = String(row.get(fieldsToClean[i])); >> > row.put(fieldsToClean[i], fixQuotes(old_text) ); >> > } >> > return row; >> > } >> > >> > My understanding goes that this must be a very common problem. It also >> > occurs with human names which have these chars. What is an appropriate >> way >> > to get the appropriate text indexed and searchable? The fieldtype where >> > this is stored goes as follows >> > >> > >> > >> > >> > >> > >> > >> > >> > > language="English" >> > protected="protwords.txt"/> >> > > > synonyms="synonyms.txt" >> > ignoreCase="true" >> > expand="true" /> >> > > > words="stopwords_en.txt" >> > ignoreCase="true" /> >> > > > words="stopwords_en.txt" >> > ignoreCase="true" /> >> > > > generateWordParts="1" >> > generateNumberParts="1" >> > catenateWords="1" >> > catenateNumbers="1" >> > catenateAll="0" >> > preserveOriginal="1" /> >> > >> > >> > >> > >> > *Pranav Prakash* >> > >> > "temet nosce" >> > >
Re: DIH import from MySQL results in garbage text for special chars
The output of Show variables goes like this. I have verified with the hex values and they are different in MySQL and Solr. | Variable_name| Value | +--++ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results| latin1 | | character_set_server | latin1 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ *Pranav Prakash* "temet nosce" On Wed, Sep 26, 2012 at 6:45 PM, Gora Mohanty wrote: > On 21 September 2012 11:19, Pranav Prakash wrote: > > > I am seeing the garbage text in browser, Luke Index Toolbox and > everywhere > > it is the same. My servlet container is Jetty which is the out-of-box > one. > > Many other special chars are getting indexed and stored properly, only > few > > characters causes pain. > > > > Could you double-check the encoding on the mysql side? > What is the output of > > mysql> SHOW VARIABLES LIKE 'character\_set\_%'; > > Regards, > Gora >