from:"Pranav Prakash"

Solr 3.3. Grouping vs DeDuplication and Deduplication Use Case

2011-08-29 Thread Pranav Prakash

Solr 3.3. has a feature "Grouping". Is it practically same as deduplication?

Here is my use case for duplicates removal -

We have many documents with similar (upto 99%) content. Upon some search
queries, almost all of them come up on first page results. Of all these
documents, essentially one is original and the other are duplicates. We are
able to find the original content on a basis of number of factors - who
uploaded it, when, how many viral shares.It is also possible that the
duplicates are uploaded earlier (and hence exist in search index) while the
original is uploaded later (and gets added later to index).

AFAIK, Deduplication targets index time. Is there a means I can specify the
original which should be returned and the duplicates which could be removed
from coming up.?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

How To Implement Sweet Spot Similarity?

2011-09-16 Thread Pranav Prakash

I was wondering if there is *any* article on the web that provides me with
implementation details and some sort of analysis on Sweet Spot Similarity?
Google shows me all the JIRA commits and comments but no article about
actual implementation. What are the various configs that could be done. What
are the good approaches for figuring out sweet spots? Can a combination of
multiple Similarity Classes be used?

Any information would be so appreciated.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

java.io.CharConversionException While Indexing in Solr 3.4

2011-09-19 Thread Pranav Prakash

Hi List,

I tried Solr 3.4.0 today and while indexing I got the error
java.lang.RuntimeException: [was class java.io.CharConversionException]
Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289)

My earlier version was Solr 1.4 and this same document went into index
successfully. Looking around, I see issue
https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the
issue. I thought this patch is already applied to Solr 3.4.0. Is there
something I am missing?

Is there anything else I need to mention? Logs/ My document details etc.?

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: java.io.CharConversionException While Indexing in Solr 3.4

2011-09-19 Thread Pranav Prakash

Just in case, someone might be intrested here is the log

SEVERE: java.lang.RuntimeException: [was class
java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char
#66641, byte #65289)
 at
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
 at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
 at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287)
at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146)
 at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
 at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
 at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
 at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
 at
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
 at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
 at
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
at
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
 at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
at org.mortbay.jetty.Server.handle(Server.java:326)
 at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
at
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
 at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
 at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
at
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
 at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x73
(at char #66641, byte #65289)
 at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313)
at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204)
 at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
 at
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
 at
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
at
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
 at
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
 ... 26 more

Also, is there a setting so I can change the level of backtrace? This would
be helpful in showing the complete stack instead of 26 more ...

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

On Mon, Sep 19, 2011 at 14:16, Pranav Prakash  wrote:

>
> Hi List,
>
> I tried Solr 3.4.0 today and while indexing I got the error
> java.lang.RuntimeException: [was class java.io.CharConversionException]
> Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289)
>
> My earlier version was Solr 1.4 and this same document went into index
> successfully. Looking around, I see issue
> https://issues.apache.org/jira/browse/SOLR-2381 which seems to fix the
> issue. I thought this patch is already applied to Solr 3.4.0. Is there
> something I am missing?
>
> Is there anything else I need to mention? Logs/ My document details etc.?
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog<http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
>

Re: java.io.CharConversionException While Indexing in Solr 3.4

2011-09-20 Thread Pranav Prakash

I managed to resolve this issue. Turns out that the issue was because of a
faulty XML file being generated by ruby-solr gem. I had to install
libxml-ruby, rsolr and I used rsolr gem instead of ruby-solr.

Also, if you face this kind of issue, the test-utf8.sh file included in
exampledocs is a good file to test Solr's behavior towards UTF-8 chars.

Great wok Solr team, and special thanks to Erik Hatcher.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Mon, Sep 19, 2011 at 15:54, Pranav Prakash  wrote:

>
> Just in case, someone might be intrested here is the log
>
> SEVERE: java.lang.RuntimeException: [was class
> java.io.CharConversionException] Invalid UTF-8 middle byte 0x73 (at char
> #66641, byte #65289)
>  at
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
>  at
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>  at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287)
> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146)
>  at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
> at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>  at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>  at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>  at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>  at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>  at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
>  at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
> at
> org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
>  at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:756)
> at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
>  at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
> at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
>  at
> org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
> Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x73
> (at char #66641, byte #65289)
>  at com.ctc.wstx.io.UTF8Reader.reportInvalidOther(UTF8Reader.java:313)
> at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:204)
>  at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
> at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
>  at
> com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
> at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
>  at
> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
> at
> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
>  at
> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
> at
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
>  ... 26 more
>
>
> Also, is there a setting so I can change the level of backtrace? This would
> be helpful in showing the complete stack instead of 26 more ...
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog<http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
>
>
> On Mon, Sep 19, 2011 at 14:16, Pranav Prakash  wrote:
>
>>
>> Hi List,
>>
>> I tried Solr 3.4.0 today and while indexing I got the error
>> java.lang.RuntimeException: [was class java.io.CharConversionException]
>> Invalid UTF-8 middle byte 0x73 (at char #66611, byte #65289)
>

Re: Stemming and other tokenizers

2011-09-20 Thread Pranav Prakash

I have a similar use case, but slightly more flexible and straight forward.
In my case, I have a field "language" which stores 'en', 'es' or whatever
the language of the document is. Then the field 'transcript' stores the
actual content which is in the language as described in language field.
Following up with the conversation, is this how I am supposed to proceed:

   1. Create one field type in my schema per supported language. This would
   cause me to create ~30 fields.
   2. Since, I already know the language of my content, I can skip SOLR-1979
   (which is expected in Solr 3.5)

The point where I am unclear is, how do I specify at Index time, to use a
certain field for a certain language?

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Mon, Sep 12, 2011 at 20:55, Jan Høydahl  wrote:

> Hi,
>
> Do they? Can you explain the layout of the documents?
>
> There are two ways to handle multi lingual docs. If all your docs have both
> an English and a Norwegian version, you may either split these into two
> separate documents, each with the "language" field filled by LangId - which
> then also lets you filter by language. Or you may assign a title_en and
> title_no to the same document (expand with more fields if you have more
> languages per document), and keep it as one document. Your client will then
> be adapted to search the language(s) that the user wants.
>
> If one document has multiple languages within the same field, e.g. "body",
> say one paragraph of English and the next is Norwegian, then we currently do
> not have any capability in Solr to apply different analysis (tokenization,
> stemming etc) to each paragraph.
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 12. sep. 2011, at 11:37, Manish Bafna wrote:
>
> > What is single document has multiple languages?
> >
> > On Mon, Sep 12, 2011 at 2:23 PM, Jan Høydahl 
> wrote:
> >
> >> Hi
> >>
> >> Everybody else use dedicated field per language, so why can't you?
> >> Please explain your use case, and perhaps we can better help understand
> >> what you're trying to do.
> >> Do you always know the query language in advance?
> >>
> >> --
> >> Jan Høydahl, search solution architect
> >> Cominvent AS - www.cominvent.com
> >> Solr Training - www.solrtraining.com
> >>
> >> On 12. sep. 2011, at 08:28, Patrick Sauts wrote:
> >>
> >>> I can't create one field per language, that is the problem but I'll dig
> >> into
> >>> it following your indications.
> >>> I let you know what I could come out with.
> >>>
> >>> Patrick.
> >>>
> >>> 2011/9/11 Jan Høydahl 
> >>>
> >>>> Hi,
> >>>>
> >>>> You'll not be able to detect language and change stemmer on the same
> >> field
> >>>> in one go. You need to create one fieldType in your schema per
> language
> >> you
> >>>> want to use, and then use LanguageIdentification (SOLR-1979) to do the
> >> magic
> >>>> of detecting language and renaming the field. If you set
> >>>> langid.override=false, languid.map=true and populate your "language"
> >> field
> >>>> with the known language, you will probably get the desired effect.
> >>>>
> >>>> --
> >>>> Jan Høydahl, search solution architect
> >>>> Cominvent AS - www.cominvent.com
> >>>> Solr Training - www.solrtraining.com
> >>>>
> >>>> On 10. sep. 2011, at 03:24, Patrick Sauts wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>>
> >>>>>
> >>>>> I want to implement some king of AutoStemming that will detect the
> >>>> language
> >>>>> of a field based on a tag at the start of this field like #en# my
> field
> >>>> is
> >>>>> stored on disc but I don't want this tag to be stored. Is there a way
> >> to
> >>>>> avoid this field to be stored ?
> >>>>>
> >>>>> To me all the filters and the tokenizers interact only with the
> indexed
> >>>>> field and not the stored one.
> >>>>>
> >>>>> Am I wrong ?
> >>>>>
> >>>>> Is it possible to you to do such a filter.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Patrick.
> >>>>>
> >>>>
> >>>>
> >>
> >>
>
>

StopWords coming in Top 10 terms despite using StopFilterFactory

2011-09-22 Thread Pranav Prakash

Hi List,

I included StopFilterFactory and I  can see it taking action in the Analyzer
Interface. However, when I go to Schema Analyzer, I see those stop words in
the top 10 terms. Is this normal?
















*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: StopWords coming in Top 10 terms despite using StopFilterFactory

2011-09-23 Thread Pranav Prakash

> You've got CommonGramsFilterFactory and StopFilterFactory both using
> stopwords.txt, which is a confusing configuration.  Normally you'd want one
> or the other, not both ... but if you did legitimately have both, you'd want
> them to each use a different wordlist.
>

Maybe I am wrong. But my intentions of using both of them is - first I want
to use phrase queries so used CommonGramsFilterFactory. Secondly, I dont
want those stopwords in my index, so I have used StopFilterFactory to remove
them.



>
> The commongrams filter turns each found occurrence of a word in the file
> into two tokens - one prepended with the token before it, one appended with
> the token after it.  If it's the first or last term in a field, it only
> produces one token.  When it gets to the stopfilter, the combined terms no
> longer match what's in stopwords.txt, so no action is taken.
>
> If I had to guess, what you are seeing in the top 10 terms is the
> concatenation of your most common stopword with another word.  If it were
> English, I would guess that to be "of_the" or something similar.  If my
> guess is wrong, then I'm not sure what's going on, and some cut/paste of
> what you're actually seeing might be in order.


term frequencyto 26164and 25804the 25566of 25022a 24918in 24590for 23646n23588
with 23055is 22510



>  Did you do delete and do a full reindex after you changed your schema?
>

Yup I did that a couple of times


>
> Thanks,
> Shawn
>
>
*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com/>
 | Google <http://www.google.com/profiles/pranny>

Can't use ms() function on non-numeric legacy date field

2011-09-27 Thread Pranav Prakash

Hi, I had been trying to boost my recent documents, using what is described
here http://wiki.apache.org/solr/FunctionQuery#Date_Boosting

My date field looks like




However, upon trying to do ms(NOW, created_at) it shows the error
Can't use ms() function on non-numeric legacy date field created_at
*
*
*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Suggestions on how to perform infrastructure migration from 1.4 to 3.4?

2011-09-30 Thread Pranav Prakash

Hi List,

We have our production search infrastructure as - 1 indexing master, 2
serving identical twin slaves. They are all Solr 1.4 beasts. Apart from this
we have 1 beast on Solr 3.4, which we have benchmarked against our
production setup (against performance and relevancy) and would like to
upgrade our production setup. Something like this has not happened before in
our organization. I'd like to know opinions from the community about what
are ways in which this migration can be performed? Will there be any
downtimes, if so for how many hours? What are some of the common issues that
might come along?

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

How to achieve Indexing @ 270GiB/hr

2011-10-04 Thread Pranav Prakash

Greetings,

While going through the article 265% indexing speedup with Lucene's
concurrent 
flushing<http://java.dzone.com/news/265-indexing-speedup-lucenes?mz=33057-solr_lucene>
I
was stunned by the endless possibilities in which Indexing speed could be
increased.

I'd like to take inputs from everyone over here as to how to achieve this
speed. As far as I understand there are two broad ways of feeding data to
Solr -

   1. Using DataImportHandler
   2. Using HTTP to POST docs to Solr.

The speeds at which the article describes indexing seems kinda too much to
expect using the second approach. Or is it possible using multiple instances
feeding docs to Solr?

My current setup does the following -

   1. Execute SQL queries to create database of documents that needs to be
   fed.
   2. Go through the columns one by one, and create XMLs for them and send
   it over to Solr in batches of max 500 docs.


Even if using DataImportHandler what are the ways this could be optimized?
If I am able to solve the problem of indexing data in our current setup, my
life would become a lot easier.


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Painfully slow indexing

2011-10-19 Thread Pranav Prakash

Hi guys,

I have set up a Solr instance and upon attempting to index document, the
whole process is painfully slow. I will try to put as much info as I can in
this mail. Pl. feel free to ask me anything else that might be required.

I am sending documents in batches not exceeding 2,000. The size of each of
them depends but usually is around 10-15MiB. My indexing script tells me
that Solr took T seconds to add N documents of size S. For the same data,
the Solr Log add QTime is QT. Some of the sample data are:

   N ST   QT
-
 390 docs  |   3,478,804 Bytes   | 14.5s|  2297
 852 docs  |   6,039,535 Bytes   | 25.3s|  4237
1345 docs | 11,147,512 Bytes   |  47s  |  8543
1147 docs |   9,457,717 Bytes   |  44s  |  2297
1096 docs | 13,058,204 Bytes   |  54.3s   |   8782

The time T includes the time of converting an array of Hash objects into
XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there
is a huge difference between both the time T and QT. After a lot of efforts,
I have no clue why these times do not match.

The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
-XX:+UseParNewGC

I believe my Indexing is getting slow. Relevant portion from my schema file
are as follows. On a related note, every document has one dynamic field.
Based on this rate, it takes me ~30hrs to do a full index of my database.
I would really appreciate kindness of community in order to get this
indexing faster.



false



10

10

 

2048

2147483647

300

1000

5

256

10

false







true

true



 1

0



false







 10






*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Painfully slow indexing

2011-10-24 Thread Pranav Prakash

Hey guys,

Your responses are welcome, but I still haven't gained a lot of improvements

*Are you posting through HTTP/SOLRJ?*
I am using RSolr gem, which internally uses Ruby HTTP lib to POST document
to Solr

*Your script time 'T' includes time between sending POST request -to-
the response fetched after successful response right??*
Correct. It also includes the time taken to convert all those documents from
a Ruby Hash to XML.


 *generate the ready-for-indexing XML documents on a file system*
Alain, I have somewhere 6m documents for Indexing. You mean to say that I
should convert all of it into one XML file and then index?

*are you calling commit after your batches or do an optimize by any chance?*
I am not optimizing, but I am performing an autocommit every 10 docs.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Fri, Oct 21, 2011 at 16:32, Simon Willnauer <
simon.willna...@googlemail.com> wrote:

> On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash  wrote:
> > Hi guys,
> >
> > I have set up a Solr instance and upon attempting to index document, the
> > whole process is painfully slow. I will try to put as much info as I can
> in
> > this mail. Pl. feel free to ask me anything else that might be required.
> >
> > I am sending documents in batches not exceeding 2,000. The size of each
> of
> > them depends but usually is around 10-15MiB. My indexing script tells me
> > that Solr took T seconds to add N documents of size S. For the same data,
> > the Solr Log add QTime is QT. Some of the sample data are:
> >
> >   N ST   QT
> > -
> >  390 docs  |   3,478,804 Bytes   | 14.5s|  2297
> >  852 docs  |   6,039,535 Bytes   | 25.3s|  4237
> > 1345 docs | 11,147,512 Bytes   |  47s  |  8543
> > 1147 docs |   9,457,717 Bytes   |  44s  |  2297
> > 1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
> >
> > The time T includes the time of converting an array of Hash objects into
> > XML, POSTing it to Solr and response acknowledged from Solr. Clearly,
> there
> > is a huge difference between both the time T and QT. After a lot of
> efforts,
> > I have no clue why these times do not match.
> >
> > The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
> > -XX:+UseParNewGC
> >
> > I believe my Indexing is getting slow. Relevant portion from my schema
> file
> > are as follows. On a related note, every document has one dynamic field.
> > Based on this rate, it takes me ~30hrs to do a full index of my database.
> > I would really appreciate kindness of community in order to get this
> > indexing faster.
> >
> > 
> >
> > false
> >
> > 
> >
> > 10
> >
> > 10
> >
> >  
> >
> > 2048
> >
> > 2147483647
> >
> > 300
> >
> > 1000
> >
> > 5
> >
> > 256
> >
> > 10
> >
> > false
> >
> > 
> >
> > 
> >
> > 
> >
> > true
> >
> > true
> >
> > 
> >
> >  1
> >
> > 0
> >
> > 
> >
> > false
> >
> > 
> >
> > 
> >
> > 
> >
> >  10
> >
> > 
> >
> > 
> >
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
> >
>
> hey,
>
> are you calling commit after your batches or do an optimize by any chance?
>
> I would suggest you to stream your documents to solr and try to commit
> only if you really need to. Set your RAM Buffer to something between
> 256 and 320 MB and remove the maxBufferedDocs setting completely. You
> can also experiment with your merge settings a little and 10 merging
> threads seem to be a lot. I know you have lots of CPU but IO will be
> the bottleneck here.
>
> simon
>

Howto Programatically check if the index is optimized or not?

2011-11-15 Thread Pranav Prakash

Hi,

After the commit, my optimize usually takes 20 minutes. The thing is that I
need to know programatically if the optimization has completed or not. Is
there an API call through which I can know the status of optimization?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Highlighting uses lots of memory and eventually slows down Solr

2011-12-09 Thread Pranav Prakash

Hi Group,

I would like to have highlighting for search and I have the fields indexed
with the following schema (Solr 3.4)


 

















And the following config


 
 
100



 
20
0.5
[-\w ,/\n\"']{20,200}



 
 









The problem is that when I turn on highlighting, I face memory issues. The
Memory usage on system goes higher and higher until it consumes all the
memory (I dont receive OOM errors, there is always like 300 MB free
memory). The total memory I have is 48GiB. My Index size is 138GiB and
there are about 10m documents in the index.

I also get the following warning, but I am not sure how to get it done.

WARNING: Deprecated syntax found.  should move to


My Solr log with highlighting turned on looks something like this

[core0] webapp=/solr path=/select
params={mm=3<90%25&qf=title^2&hl.simple.pre=&hl.fl=title,transcript,transcript_en&wt=ruby&hl=true&rows=12&defType=dismax&fl=id,title,description&debugQuery=false&start=0&q=asdfghjkl&bf=recip(ms(NOW,created_at),1.88e-11,1,1)&hl.simple.post=&ps=50}

Any help on this would be greatly appreciated. Thanks in advance !!

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Highlighting uses lots of memory and eventually slows down Solr

2011-12-19 Thread Pranav Prakash

No respinse !! Bumping it up

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Fri, Dec 9, 2011 at 14:11, Pranav Prakash  wrote:

> Hi Group,
>
> I would like to have highlighting for search and I have the fields indexed
> with the following schema (Solr 3.4)
>
> 
>  
> 
> 
> 
> 
> 
>  protected="protwords.txt"/>
>  ignoreCase="true" expand="true"/>
>  ignoreCase="true"/>
>  ="true"/>
>  generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll
> ="0"preserveOriginal="1"/>
> 
> 
>
> 
>
> 
>
> And the following config
>
> 
>   default="true">
>  
> 100
> 
> 
>  >
>  
> 20
> 0.5
> [-\w ,/\n\"']{20,200}
> 
> 
>  default="true">
>  
>  
> 
> 
> 
> 
> 
> 
> 
> 
>
> The problem is that when I turn on highlighting, I face memory issues. The
> Memory usage on system goes higher and higher until it consumes all the
> memory (I dont receive OOM errors, there is always like 300 MB free
> memory). The total memory I have is 48GiB. My Index size is 138GiB and
> there are about 10m documents in the index.
>
> I also get the following warning, but I am not sure how to get it done.
>
> WARNING: Deprecated syntax found.  should move to
> 
>
> My Solr log with highlighting turned on looks something like this
>
>  [core0] webapp=/solr path=/select
> params={mm=3<90%25&qf=title^2&hl.simple.pre=&hl.fl=title,transcript,transcript_en&wt=ruby&hl=true&rows=12&defType=dismax&fl=id,title,description&debugQuery=false&start=0&q=asdfghjkl&bf=recip(ms(NOW,created_at),1.88e-11,1,1)&hl.simple.post=&ps=50}
>
> Any help on this would be greatly appreciated. Thanks in advance !!
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog<http://blog.myblive.com> |
> Google <http://www.google.com/profiles/pranny>
>

Something like "featured results" in solr response?

2012-01-30 Thread Pranav Prakash

Hi,

I believe, there is a feature in Solr, which allows to return a set of
"featured" documents for a query. I did read it couple of months back, and
now when I have decided to work on it, I somehow can't find it's reference.

Here is the description - For a search keyword, apart from the results
generated by Solr (which is based on relevancy, score), there is another
set of documents which just comes up. It is very much similar to the
"sponsored results" feature of Google.

Can you guys point me to the appropriate resources for the same?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Something like "featured results" in solr response?

2012-01-30 Thread Pranav Prakash

Thanks a lot :-) This is exactly what I had read back then. However, going
through it now, it seems that everytime a document needs to be elevated, it
has to be in the config file. Which means that Solr should be restarted.
This does not make a lot of sense for a production environment, where Solr
restarts are as infrequent as config changes.

What could be a sound way to implement this?

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


2012/1/30 Rafał Kuć 

> Hello!
>
> Please look at http://wiki.apache.org/solr/QueryElevationComponent.
>
> --
> Regards,
>  Rafał Kuć
>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
>
> > Hi,
>
> > I believe, there is a feature in Solr, which allows to return a set of
> > "featured" documents for a query. I did read it couple of months back,
> and
> > now when I have decided to work on it, I somehow can't find it's
> reference.
>
> > Here is the description - For a search keyword, apart from the results
> > generated by Solr (which is based on relevancy, score), there is another
> > set of documents which just comes up. It is very much similar to the
> > "sponsored results" feature of Google.
>
> > Can you guys point me to the appropriate resources for the same?
>
>
> > *Pranav Prakash*
>
> > "temet nosce"
>
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
>
>
>
>
>

Re: Something like "featured results" in solr response?

2012-01-30 Thread Pranav Prakash

Wow, this looks interesting.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Mon, Jan 30, 2012 at 21:16, Erick Erickson wrote:

> There's the tricky line:
> "If the file exists in the /conf/ directory it will be loaded once at
> start-up. If it exists in the data directory, it will be reloaded for
> each IndexReader."
>
> on the page: http://wiki.apache.org/solr/QueryElevationComponent
>
> Which basically means that if your config file is in the right directory,
> it'll be reloaded whenever the index changes, i.e. when a replication
> happens in a master/slave setup or when a commit happens on
> a single machine used for both indexing  and searching.
>
> Best
> Erick
>
> On Mon, Jan 30, 2012 at 8:31 AM, Pranav Prakash  wrote:
> > Thanks a lot :-) This is exactly what I had read back then. However,
> going
> > through it now, it seems that everytime a document needs to be elevated,
> it
> > has to be in the config file. Which means that Solr should be restarted.
> > This does not make a lot of sense for a production environment, where
> Solr
> > restarts are as infrequent as config changes.
> >
> > What could be a sound way to implement this?
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> > Twitter <http://twitter.com/pranavprakash> | Blog <
> http://blog.myblive.com> |
> > Google <http://www.google.com/profiles/pranny>
> >
> >
> > 2012/1/30 Rafał Kuć 
> >
> >> Hello!
> >>
> >> Please look at http://wiki.apache.org/solr/QueryElevationComponent.
> >>
> >> --
> >> Regards,
> >>  Rafał Kuć
> >>  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> >>
> >> > Hi,
> >>
> >> > I believe, there is a feature in Solr, which allows to return a set of
> >> > "featured" documents for a query. I did read it couple of months back,
> >> and
> >> > now when I have decided to work on it, I somehow can't find it's
> >> reference.
> >>
> >> > Here is the description - For a search keyword, apart from the results
> >> > generated by Solr (which is based on relevancy, score), there is
> another
> >> > set of documents which just comes up. It is very much similar to the
> >> > "sponsored results" feature of Google.
> >>
> >> > Can you guys point me to the appropriate resources for the same?
> >>
> >>
> >> > *Pranav Prakash*
> >>
> >> > "temet nosce"
> >>
> >> > Twitter <http://twitter.com/pranavprakash> | Blog <
> >> http://blog.myblive.com> |
> >> > Google <http://www.google.com/profiles/pranny>
> >>
> >>
> >>
> >>
> >>
>

Typical Cache Values

2012-02-07 Thread Pranav Prakash

Based on the hit ratio of my caches, they seem to be pretty low. Here they
are. What are typical values of yours production setup? What are some of
the things that can be done to improve the ratios?

queryResultCache

lookups : 3234602
hits : 496
hitratio : 0.00
inserts : 3234239
evictions : 3230143
size : 4096
warmupTime : 8886
cumulative_lookups : 3465734
cumulative_hits : 526
cumulative_hitratio : 0.00
cumulative_inserts : 3465208
cumulative_evictions : 3457151


documentCache

lookups : 17647360
hits : 11935609
hitratio : 0.67
inserts : 5711851
evictions : 5707755
size : 4096
warmupTime : 0
cumulative_lookups : 19009142
cumulative_hits : 12813630
cumulative_hitratio : 0.67
cumulative_inserts : 6195512
cumulative_evictions : 6187460


fieldValueCache

lookups : 0
hits : 0
hitratio : 0.00
inserts : 0
evictions : 0
size : 0
warmupTime : 0
cumulative_lookups : 0
cumulative_hits : 0
cumulative_hitratio : 0.00
cumulative_inserts : 0
cumulative_evictions : 0


filterCache

lookups : 30059278
hits : 28813869
hitratio : 0.95
inserts : 1245744
evictions : 1245232
size : 512
warmupTime : 28005
cumulative_lookups : 32155745
cumulative_hits : 30845811
cumulative_hitratio : 0.95
cumulative_inserts : 1309934
cumulative_evictions : 1309245




*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Typical Cache Values

2012-02-07 Thread Pranav Prakash

>
> *
> *
> This is not unusual, but there's also not much reason to give this much
> memory in your case. This is the cache that is hit when a user pages
> through result set. Your numbers would seem to indicate one of two things:
> 1> your window is smaller than 2 pages, see solrconfig.xml,
>
> or
> 2> your users are rarely going to the next page.
>
> this cache isn't doing you much good, but then it's also not using that
> much in the way of resources.
>
>

True it is. Although the queryResultWindowSize is 30, I will be reducing it
to 4 or so. And yes, we have observed that mostly people don't go beyond
the first page



> > documentCache
> >
> > lookups : 17647360
> > hits : 11935609
> > hitratio : 0.67
> > inserts : 5711851
> > evictions : 5707755
> > size : 4096
> > warmupTime : 0
> > cumulative_lookups : 19009142
> > cumulative_hits : 12813630
> > cumulative_hitratio : 0.67
> > cumulative_inserts : 6195512
> > cumulative_evictions : 6187460
> >
>
> Again, this is actually quite reasonable. This cache
> is used to hold document data, and often doesn't have
> a great hit ratio. It is necessary though, it saves quite
> a bit of disk seeks when servicing a single query.
>
> >
> > fieldValueCache
> >
> > lookups : 0
> > hits : 0
> > hitratio : 0.00
> > inserts : 0
> > evictions : 0
> > size : 0
> > warmupTime : 0
> > cumulative_lookups : 0
> > cumulative_hits : 0
> > cumulative_hitratio : 0.00
> > cumulative_inserts : 0
> > cumulative_evictions : 0
> >
>
> Not doing much in the way of faceting, are you?
>
>
No. We don't facet results


> >
> > filterCache
> >
> > lookups : 30059278
> > hits : 28813869
> > hitratio : 0.95
> > inserts : 1245744
> > evictions : 1245232
> > size : 512
> > warmupTime : 28005
> > cumulative_lookups : 32155745
> > cumulative_hits : 30845811
> > cumulative_hitratio : 0.95
> > cumulative_inserts : 1309934
> > cumulative_evictions : 1309245
> >
> >
>
> Not a bad hit ratio here, this is where
> fq filters are stored. One caution here;
> it is better to break out your filter
> queries where possible into small chunks.
> Rather than write fq=field1:val1 AND field2:val2,
> it's better to write fq=field1:val1&fq=field2:val2
> Think of this cache as a map with the query
> as the key. If you write the fq the first way above,
> subsequent fqs for either half won't use the cache.
>

That was a great advise. We do use the former approach but going forward we
would stick to the latter one.

Thanks,

Pranav

Deduplication in MLT

2012-06-12 Thread Pranav Prakash

I have an implementation of Deduplication as mentioned at
http://wiki.apache.org/solr/Deduplication. It is helpful in grouping search
results. I would like to achieve the same functionality in my MLT queries,
where the result set should include grouped documents. What is a good way
to do the same?


*Pranav Prakash*

"temet nosce"

Questions about Solr MLTHanlder, performance, Indexes

2011-06-20 Thread Pranav Prakash

Hi folks,

I am new to Solr, and using it for web application. I have been
experimenting with it and have a couple of doubts which I was unable to
resolve by Google. Our portal allows users to upload content and the fields
we use are - title, description, transcript, tags. Now each of the content
has certain - hits, downloads, favorites and auto calculated values -
rating. We have a master/slave configuration (1 master, 2 slaves).

Solr version: 1.4.0
Java version "1.6.0_16"
Java(TM) SE Runtime Environment (build 1.6.0_16-b01)
Java HotSpot(TM) 64-Bit Server VM (build 14.2-b01, mixed mode)
32GiB RAM and 8 Core
Index Size: ~100 GiB


One of my use case is to find out related documents given a document ID. I
have been using More Like Handler to generate related documents, using
DisMax query. Now, I have to filter out certain content from the results
solr gives me. So, if for a document id X, solr returns me a list of 20
related documents, I want to apply a filter that these 20 documents should
not contain "black listed words". This is fairly straight forward in a
direct query using NOT operator. How is it possible to implement a similar
behavior in MoreLikeThisHandler?

Every week, we perform a full index of all the documents and
a nightly incremental indexing. This is done by a script which reads data
from MySQL and updates it to Solr. Sometimes it happens that the script
fails after updating 60% of the documents. Commit has not been performed at
this stage. The next cron executes, it adds some more documents and commits
them. So, will this commit involve the current update as well as the last
uncommitted updates as well? Are those uncommitted changes (which are stored
in a temp file) deleted after some time? Is there a way to clean uncommitted
changes?

Off lately, Solr has started to perform slow. When Solr is started it goes
quick and responds to requests in ~100ms. Gradually (very gradually) it goes
on to a limit where avg response time of last 10 queries goes beyond 5000ms,
and that is when requests start to pile up. As I am composing this mail,
optimize command is being executed which I hope should help, but to what
extent, I will need to see.

Finally, what happens if the schema of master and slave are different (there
exists a field in master which does not exist in slave). I thought that
replication would show me some kind of error, but it went on successfully.

Thanks,

Pranav

Removing duplicate documents from search results

2011-06-23 Thread Pranav Prakash

How can I remove very similar documents from search results?

My scenario is that there are documents in the index which are almost
similar (people submitting same stuff multiple times, sometimes different
people submitting same stuff). Now when a search is performed for "keyword",
in the top N results, quite frequently, same document comes up multiple
times. I want to remove those duplicate (or possible duplicate) documents.
Very similar to what Google does when they say "In order to show you most
relevant result, duplicates have been removed". How can I achieve this
functionality using Solr? Does Solr has an implied or plugin which could
help me with it?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Removing duplicate documents from search results

2011-06-23 Thread Pranav Prakash

This approach would definitely work is the two documents are *Exactly* the
same. But this is very fragile. Even if one extra space has been added, the
whole hash would change. What I am really looking for is some %age
similarity between documents, and remove those documents which are more than
95% similar.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Thu, Jun 23, 2011 at 15:16, Omri Cohen  wrote:

> What you need to do, is to calculate some HASH (using any message digest
> algorithm you want, md5, sha-1 and so on), then do some reading on solr
> field collapse capabilities. Should not be too complicated..
>
> *Omri Cohen*
>
>
>
> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 | +972-3-6036295
>
>
>
>
> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric> [image:
> Twitter] <http://www.twitter.com/omricohe> [image:
> WordPress]<http://omricohen.me>
>  Please consider your environmental responsibility. Before printing this
> e-mail message, ask yourself whether you really need a hard copy.
> IMPORTANT: The contents of this email and any attachments are confidential.
> They are intended for the named recipient(s) only. If you have received
> this
> email by mistake, please notify the sender immediately and do not disclose
> the contents to anyone or make copies thereof.
> Signature powered by
> <
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >
> WiseStamp<
> http://www.wisestamp.com/email-install?utm_source=extension&utm_medium=email&utm_campaign=footer
> >
>
>
>
> -- Forwarded message --
> From: Pranav Prakash 
> Date: Thu, Jun 23, 2011 at 12:26 PM
> Subject: Removing duplicate documents from search results
> To: solr-user@lucene.apache.org
>
>
> How can I remove very similar documents from search results?
>
> My scenario is that there are documents in the index which are almost
> similar (people submitting same stuff multiple times, sometimes different
> people submitting same stuff). Now when a search is performed for
> "keyword",
> in the top N results, quite frequently, same document comes up multiple
> times. I want to remove those duplicate (or possible duplicate) documents.
> Very similar to what Google does when they say "In order to show you most
> relevant result, duplicates have been removed". How can I achieve this
> functionality using Solr? Does Solr has an implied or plugin which could
> help me with it?
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com
> >
> |
> Google <http://www.google.com/profiles/pranny>
>

Re: how to index data in solr form database automatically

2011-06-24 Thread Pranav Prakash

Cron is a time-based job scheduler in Unix-like computer operating systems.
en.wikipedia.org/wiki/Cron

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Fri, Jun 24, 2011 at 12:26, Romi  wrote:

> Yeah i am using data-import to get data from database and indexing it. but
> what is cron can you please provide a link for it
>
> -
> Thanks & Regards
> Romi
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-index-data-in-solr-form-database-automatically-tp3102893p3103072.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Custom Handler support in Solr-ruby

2011-06-28 Thread Pranav Prakash

Hi,

I found solr-ruby gem (http://wiki.apache.org/solr/solr-ruby) really
inflexible in terms of specifying handler. The Solr::Request::Select class
defines handler as "select" and all other classes inherit from this class.
And since the methods in Solr::Connection use one of the classes from
Solr::Request, I don't see a direct way to use a custom handler (which I
have made for MoreLikeThis). Currently, the approach I am using is to create
the query URL, do a CURL, parse the response and return it.

Even if I'd to extend the classes, I'd end up making a new
Solr::Request::CustomSelect which will be similar to Solr::Request::Select
except for the flexibility for the user to provide handler, defaulted by
'select'. Then creating different classes each for DisMax and all, which
will be derived from Solr::Request::CustomSelect. Isn't this too much of an
overhead? Or am I missing something?

Also, where can I file bugs to solr-ruby?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Index Version and Epoch Time?

2011-06-28 Thread Pranav Prakash

Hi,

I am not sure what is the index number value? It looks like an epoch time,
but in my case, this points to one month back. However, i can see documents
which were added last week, to be in the index.

Even after I did a commit, the index number did not change? Isn't it
supposed to change on every commit? If not, is there a way to look into the
last index time?

Also, this page
http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows a
Replication Dashboard. How is this dashboard invoked? Is there any URL which
needs to be called?


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Removing duplicate documents from search results

2011-06-28 Thread Pranav Prakash

I found the deduplication thing really useful. Although I have not yet
started to work on it, as there are some other low hanging fruits I've to
capture. Will share my thoughts soon.


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


2011/6/28 François Schiettecatte 

> Maybe there is a way to get Solr to reject documents that already exist in
> the index but I doubt it, maybe someone else with can chime here here. You
> could do a search for each document prior to indexing it so see if it is
> already in the index, that is probably non-optimal, maybe it is easiest to
> check if the document exists in your Riak repository, it no add it and index
> it, and drop if it already exists.
>
> François
>
> On Jun 28, 2011, at 8:24 AM, Mohammad Shariq wrote:
>
> > I am making the Hash from URL, but I can't use this as UniqueKey because
> I
> > am using UUID as UniqueKey,
> > Since I am using SOLR as  index engine Only and using Riak(key-value
> > storage) as storage engine, I dont want to do the overwrite on duplicate.
> > I just need to discard the duplicates.
> >
> >
> >
> > 2011/6/28 François Schiettecatte 
> >
> >> Create a hash from the url and use that as the unique key, md5 or sha1
> >> would probably be good enough.
> >>
> >> Cheers
> >>
> >> François
> >>
> >> On Jun 28, 2011, at 7:29 AM, Mohammad Shariq wrote:
> >>
> >>> I also have the problem of duplicate docs.
> >>> I am indexing news articles, Every news article will have the source
> URL,
> >>> If two news-article has the same URL, only one need to index,
> >>> removal of duplicate at index time.
> >>>
> >>>
> >>>
> >>> On 23 June 2011 21:24, simon  wrote:
> >>>
> >>>> have you checked out the deduplication process that's available at
> >>>> indexing time ? This includes a fuzzy hash algorithm .
> >>>>
> >>>> http://wiki.apache.org/solr/Deduplication
> >>>>
> >>>> -Simon
> >>>>
> >>>> On Thu, Jun 23, 2011 at 5:55 AM, Pranav Prakash 
> >> wrote:
> >>>>> This approach would definitely work is the two documents are
> *Exactly*
> >>>> the
> >>>>> same. But this is very fragile. Even if one extra space has been
> added,
> >>>> the
> >>>>> whole hash would change. What I am really looking for is some %age
> >>>>> similarity between documents, and remove those documents which are
> more
> >>>> than
> >>>>> 95% similar.
> >>>>>
> >>>>> *Pranav Prakash*
> >>>>>
> >>>>> "temet nosce"
> >>>>>
> >>>>> Twitter <http://twitter.com/pranavprakash> | Blog <
> >>>> http://blog.myblive.com> |
> >>>>> Google <http://www.google.com/profiles/pranny>
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 23, 2011 at 15:16, Omri Cohen  wrote:
> >>>>>
> >>>>>> What you need to do, is to calculate some HASH (using any message
> >> digest
> >>>>>> algorithm you want, md5, sha-1 and so on), then do some reading on
> >> solr
> >>>>>> field collapse capabilities. Should not be too complicated..
> >>>>>>
> >>>>>> *Omri Cohen*
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Co-founder @ yotpo.com | o...@yotpo.com | +972-50-7235198 |
> >>>> +972-3-6036295
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> My profiles: [image: LinkedIn] <http://www.linkedin.com/in/omric>
> >>>> [image:
> >>>>>> Twitter] <http://www.twitter.com/omricohe> [image:
> >>>>>> WordPress]<http://omricohen.me>
> >>>>>> Please consider your environmental responsibility. Before printing
> >> this
> >>>>>> e-mail message, ask yourself whether you really need a hard copy.
> >>>>>> IMPORTANT: The contents of this email and any attachments are
> >>>> confidential.
> >>>>>> They are intended for the named recipien

Re: Index Version and Epoch Time?

2011-06-28 Thread Pranav Prakash

Hi,

I am facing multiple issues with solr and I am not sure what happens in each
case. I am quite naive in Solr and there are some scenarios I'd like to
discuss with you.

We have a huge volume of documents to be indexed. Somewhere about 5 million.
We have a full indexer script which essentially picks up all the documents
from database and updates into Solr and an incremental script which adds new
documents to Solr.. Relevant areas of my config file goes like

false

false

1

10

${enable.master:false}
startup
commit

${enable.slave:false}
http://hostname:port/solr/core0/replication

Sometimes, while the full indexer script breaks while adding documents to
Solr. The script adds the documents and then commits the operation. So, when
the script breaks, we have a huge lot of data which has been updated but not
committed. Next, the incremental index script executes, and figures out all
the new entries, adds them to Solr. It works successfully and commits the
operation.

   - Will the commit by incremental indexer script also commit the
   previously uncommitted changes made by full indexer script before it broke?

Sometimes, while during execution, Solr's avg response time 9avg resp time
for last 10 requests, read from log file) goes as high as 9000ms (which I am
still unclear why, any ideas how to start hunting for the problem?), so the
watchdog process restarts Solr (because it causes a pile of requests queue
at application server, which causes app server to crash). On my local
environment, I performed the same experiment by adding docs to Solr, killing
the process and restarting it. I found that the uncommitted changes were
applied and searchable. However, the updates were uncommitted. Could you
explain me as to how is this happening, or is there a configuration that can
be adjusted for this? Also, what would the index state be if after the
restarting Solr, a commit is applied or a commit is not applied?

I'd be happy to provide any other information that might be needed.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

On Tue, Jun 28, 2011 at 20:55, Shalin Shekhar Mangar  wrote:

> On Tue, Jun 28, 2011 at 4:18 PM, Pranav Prakash  wrote:
>
> >
> > I am not sure what is the index number value? It looks like an epoch
> time,
> > but in my case, this points to one month back. However, i can see
> documents
> > which were added last week, to be in the index.
> >
>
> The index version shown on the dashboard is the time at which the most
> recent index segment was created. I'm not sure why it has a value older
> than
> a month if a commit has happened after that time.
>
> >
> > Even after I did a commit, the index number did not change? Isn't it
> > supposed to change on every commit? If not, is there a way to look into
> the
> > last index time?
> >
>
> Yeah, it changes after every commit which added/deleted a document.
>
>
> > Also, this page
> > http://wiki.apache.org/solr/SolrReplication#Replication_Dashboard shows
> a
> > Replication Dashboard. How is this dashboard invoked? Is there any URL
> > which
> > needs to be called?
> >
> >
> If you have configured replication correctly, the admin dashboard should
> show a "Replication" link right next to the "Schema Browser" link. The path
> should be /admin/replication/index.jsp
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Dealing with keyword stuffing

2011-07-27 Thread Pranav Prakash

I guess most of you have already handled and many of you might still be
handling keyword stuffing. Here is my scenario. We have a huge index
containing about 6m docs. (Not sure if that is huge :-) And every document
contains title, description, tags, content (textual data). People have been
doing keyword stuffing on the documents, so when searched for a "query
term", the first results are always the ones who are optimized.

So, instead of people getting relevant results, they get spam content
(highly optimized, keyword stuffed content) as first few results. I have
tried a couple of things like providing different boosts to different
fields, but almost everything seems to fail.

I'd like to know how did you guys fixed this thing?

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Dealing with keyword stuffing

2011-07-28 Thread Pranav Prakash

On Thu, Jul 28, 2011 at 08:31, Chris Hostetter wrote:

>
> : Presumably, they are doing this by increasing tf (term frequency),
> : i.e., by repeating keywords multiple times. If so, you can use a custom
> : similarity class that caps term frequency, and/or ensures that the
> scoring
> : increases less than linearly with tf. Please see
>

In some cases, yes they are repeating keywords multiple times. Stuffing
different combinations - Solr, Solr Lucene, Solr Search, Solr Apache, Solr
Guide.

>
> in paticular, using something like SweetSpotSimilarity tuned to know what
> values make sense for "good" content in your domain can be useful because
> it can actaully penalize docsuments that are too short/long or have term
> freqs that are outside of a reasonble expected range.
>

I am not a Solr expert, But I was thinking in this direction. The ratio of
tokens/total_length would be nearer to 1 for a stuffed document, while it
would be nearer to 0 for a bogus document. Somewhere between the two lies
documents that are more likely to be meaningful. I am not sure how to use
SweetSpotSimilarity. I am googling on this, but any useful insights are so
much appreciated.

Re: Index

2011-07-28 Thread Pranav Prakash

Every indexed document has to have a unique ID associated with it. You may
do a search by ID something like

http://localhost:/solr/select?q=id:X If you see a result, then the
document has been indexed and is searchable.

You might also want to check Luke (http://code.google.com/p/luke) to gain
more insight about the index.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

On Fri, Jul 29, 2011 at 03:40, GAURAV PAREEK wrote:

> Yes NICK you are correct ?
> how can you check whether it has been indexed by solr, and is searchable?
>
> On Fri, Jul 29, 2011 at 3:27 AM, Nicholas Chase  >wrote:
>
> > Do you mean, how can you check whether it has been indexed by solr, and
> is
> > searchable?
> >
> >   Nick
> >
> >
> > On 7/28/2011 5:45 PM, GAURAV PAREEK wrote:
> >
> >> Hi All,
> >>
> >> How we can check the particular;ar file is not INDEX in solr ?
> >>
> >> Regards,
> >> Gaurav
> >>
> >>
>

Re: Dealing with keyword stuffing

2011-07-29 Thread Pranav Prakash

Cool, So I used SweetSpotSimilarity with default params and I see some
improvements. However, I could still see some of the 'stuffed' documents
coming up in the results. I feel that SweetSpotSimilarity alone is not
enough. Going through
http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf I figure out
that there are other things - Pivoted Length Normalization and term
frequency normalization that needs fine tuning too.

Should I create a custom Similarity Class that overrides all the default
behavior? I guess that should help me get more relevant results. Where
should I start beginning with it? Pl. do not assume less obvious things, I
am still learning !! :-)

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

On Thu, Jul 28, 2011 at 17:03, Gora Mohanty  wrote:

> On Thu, Jul 28, 2011 at 3:48 PM, Pranav Prakash  wrote:
> [...]
> > I am not sure how to use SweetSpotSimilarity. I am googling on this, but
> > any useful insights are so much appreciated.
>
> Replace the existing DefaultSimilarity class in schema.xml (look towards
> the bottom of the file) with the SweetSpotSimilarity class, e.g., have a
> line
> like:
>  
>
> Regards,
> Gora
>

Re: Solr Incremental Indexing

2011-07-31 Thread Pranav Prakash

There could be multiple ways of getting this done, and the exact one depends
a lot on factors like - what system are you using? How realtime the change
has to be reflected back into the system? How is the indexing/replication
done?

Usually, in cases where the tolerance is about 6hrs (i.e. your DB change
wont be reflected in Solr Index for as high as 6hrs), you can set up a cron
job to be triggered every 6 hrs. It will see all the changes made between
that time, and update Index and commit it.

In cases, where a more real time requirement, there could be a trigger in
the application (and not at the db level), which would fork a process to
update Solr about this change by means of delayed task. If using this
approach, it is suggested to use autocommit every N documents, N could be
anything depending your app.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

On Sun, Jul 31, 2011 at 02:32, Alexei Martchenko <
ale...@superdownloads.com.br> wrote:

> I always have a field in my databases called datelastmodified, so whenever
> I
> update that record, i set it to getdate() - mssql func - and then get all
> latest records order by that field.
>
> 2011/7/29 Mohammed Lateef Hussain 
>
> > Hi
> >
> > Need some help in Solr incremental indexing approch.
> >
> > I have built my Solr index using SolrJ API and now want to update the
> index
> > whenever any changes has been made in
> > database. My requirement is not to use DB triggers to call any update
> > events.
> >
> > I want to update my index on the fly whenever my application updates any
> > record in database.
> >
> > Note: My indexing logic to get the required data from DB is some what
> > complex and involves many tables.
> >
> > Please suggest me how can I proceed here.
> >
> > Thanks
> > Lateef
> >
>
>
>
> --
>
> *Alexei Martchenko* | *CEO* | Superdownloads
> ale...@superdownloads.com.br | ale...@martchenko.com.br | (11)
> 5083.1018/5080.3535/5080.3533
>

Re: Solr 3.3 crashes after ~18 hours?

2011-08-02 Thread Pranav Prakash

What do you mean by it just crashes? Does the process stops execution? Does
it takes too long to respond which might result in lots of 503s in your
application? Does the system run out of resources?

Are you indexing and serving from the same server? It happened once with us
that Solr was performing commit and then optimize while the load from app
server was at its peak. This caused slow response from search server, which
caused requests getting stacked up at app server and causing 503s. Could you
look if you have a similar syndrome?

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

On Tue, Aug 2, 2011 at 15:31, alexander sulz  wrote:

> Hello folks,
>
> I'm using the latest stable Solr release -> 3.3 and I encounter strange
> phenomena with it.
> After about 19 hours it just crashes, but I can't find anything in the
> logs, no exceptions, no warnings,
> no suspicious info entries..
>
> I have an index-job running from 6am to 8pm every 10 minutes. After each
> job there is a commit.
> An optimize-job is done twice a day at 12:15pm and 9:15pm.
>
> Does anyone have an idea what could possibly be wrong or where to look for
> further debug info?
>
> regards and thank you
>  alex
>

Re: PivotFaceting in solr 3.3

2011-08-02 Thread Pranav Prakash

>From what I know, this is a feature in Solr 4.0 marked as SOLR-792 in JIRA.
Is this what you are looking for ?

https://issues.apache.org/jira/browse/SOLR-792


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Wed, Aug 3, 2011 at 10:16, Isha Garg  wrote:

> Hi All!
>
>  Can anyone tell which patch should I apply to solr 3.3 to enable pivot
> faceting in it.
>
> Thanks in advance!
> Isha garg
>
>
>
>
>

Re: Is optimize needed on slaves if it replicates from optimized master?

2011-08-10 Thread Pranav Prakash

That is not true. Replication is roughly a copy of the diff between the
>> master and the slave's index.
>
>
In my case, during replication entire index is copied from master to slave,
during which the size of index goes a little over double. Then it shrinks to
its original size. Am I doing something wrong? How can I get the master to
serve only delta index instead of serving whole index and the slaves merging
the new and old index?

*Pranav Prakash*

How come this query string starts with wildcard?

2011-08-10 Thread Pranav Prakash

While going through my error logs of Solr, i found that a user had fired a
query - jawapan ujian bulanan thn 4 (bahasa melayu). This was converted to
following for autosuggest purposes -
jawapan?ujian?bulanan?thn?4?(bahasa?melayu)* by the javascript code. Solr
threw the exception

Cannot parse 'jawapan?ujian?bulanan?thn?4?(bahasa?melayu)*': '*' or
'?' not allowed as first character in WildcardQuery

How come this query string begins with wildcard character?

When I changed the query to remove brackets, everything went smooth.
There were no results, because probably my search index didn't had
any.


*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: Is optimize needed on slaves if it replicates from optimized master?

2011-08-10 Thread Pranav Prakash

Very well explained. Thanks. Yes, we do optimize Index before replication. I
am not particularly worried about disk space usage. I was more curious of
that behavior.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>


On Wed, Aug 10, 2011 at 19:55, Erick Erickson wrote:

> This is expected behavior. You might be optimizing
> your index on the master after every set of changes,
> in which case the entire index is copied. During this
> period, the space on disk will at least double, there's no
> way around that.
>
> If you do NOT optimize, then the slave will only copy changed
> segments instead of the entire index. Optimizing isn't
> usually necessary except periodically (daily, perhaps weekly,
> perhaps never actually).
>
> All that said, depending on how merging happens, you will always
> have the possibility of the entire index being copied sometimes
> because you'll happen to hit a merge that merges all segments
> into one.
>
> There are some advanced options that can control some parts
> of merging, but you need to get to the bottom of why the whole
> index is getting copied every time before you go there. I'd bet
> you're issuing an optimize.
>
> Best
> Erick
>
> On Wed, Aug 10, 2011 at 5:30 AM, Pranav Prakash  wrote:
> > That is not true. Replication is roughly a copy of the diff between the
> >>> master and the slave's index.
> >>
> >>
> > In my case, during replication entire index is copied from master to
> slave,
> > during which the size of index goes a little over double. Then it shrinks
> to
> > its original size. Am I doing something wrong? How can I get the master
> to
> > serve only delta index instead of serving whole index and the slaves
> merging
> > the new and old index?
> >
> > *Pranav Prakash*
> >
>

OOM due to JRE Issue (LUCENE-1566)

2011-08-16 Thread Pranav Prakash

Hi,

This might probably have been discussed long time back, but I got this error
recently in one of my production slaves.

SEVERE: java.lang.OutOfMemoryError: OutOfMemoryError likely caused by the
Sun VM Bug described in https://issues.apache.org/jira/browse/LUCENE-1566;
try calling FSDirectory.setReadChunkSize with a a value smaller than the
current chunk size (2147483647)

I am currently using Solr1.4. Going through JIRA Issue comments, I found
that this patch applies to 2.9 or above. We are also planning an upgrade to
Solr 3.3. Is this patch included in 3.3 so as to I don't have to manually
apply the patch?

What are the other workarounds of the problem?

Thanks in adv.

*Pranav Prakash*

"temet nosce"

Twitter <http://twitter.com/pranavprakash> | Blog <http://blog.myblive.com> |
Google <http://www.google.com/profiles/pranny>

Re: OOM due to JRE Issue (LUCENE-1566)

2011-08-16 Thread Pranav Prakash

>
>
> AFAIK, solr 1.4 is on Lucene 2.9.1 so this patch is already applied to
> the version you are using.
> maybe you can provide the stacktrace and more deatails about your
> problem and report back?
>

Unfortunately, I have only this much information with me. However following
is my speficiations, if they are any helpful :-

/usr/bin/java -d64 -Xms5000M -Xmx5000M -XX:+UseParallelGC -verbose:gc
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Xloggc:$GC_LOGFILE
-XX:+CMSPermGenSweepingEnabled -Dsolr.solr.home=multicore
 -Denable.slave=true -jar start.jar

32GiB RAM


Any thoughts? Will a switch to ConcurrentGC help in any means?

Top 5 high freq words - UpdateProcessorChain or DIH Script?

2012-07-08 Thread Pranav Prakash

Hi,

I want to store top 5 high frequency non-stopwords words. I use DIH to
import data. Now I have two approaches -

   1. Use DIH JavaScript to find top 5 frequency words and put them in a
   copy field. The copy field will then stem it and remove stop words based on
   appropriate tokenizers.
   2. Write a custom function for the same and add it to
   UpdateRequestProcessor Chain.

Which of the two would be better suited? I find the first approach rather
simple, but the issue is that I won't be having access to stop
words/synonyms etc at the DIH time.

In the second approach, if I add it to UpdateRequestProcessor Chain and
insert the function after StopWordsFilterFactory and
DuplicateRemoveFilterFactory, should be rather good way of doing this?

--
*Pranav Prakash*

"temet nosce"

DIH XML configs for multi environment

2012-07-11 Thread Pranav Prakash

The DIH XML config file has to be specified dataSource. In my case, and
possibly with many others, the logon credentials as well as mysql server
paths would differ based on environments (dev, stag, prod). I don't want to
end up coming with three different DIH config files, three different
handlers and so on.

What is a good way to deal with this?


*Pranav Prakash*

"temet nosce"

Re: DIH XML configs for multi environment

2012-07-11 Thread Pranav Prakash

That's cool. Is there something similar for Jetty as well? We use Jetty!

*Pranav Prakash*

"temet nosce"



On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar <
rahul.warawde...@gmail.com> wrote:

> Hi Pranav,
>
> If you are using Tomcat to host Solr, you can define your data source in
> context.xml file under tomcat configuration.
> You have to refer to this datasource with the same name in all the 3
> environments from DIH data-config.xml.
> This context.xml file will vary across 3 environments having different
> credentials for dev, stag and prod.
>
> eg
> DIH data-config.xml will refer to the datasource as listed below
>  type="JdbcDataSource" readOnly="true" />
>
> context.xml file which is located under "//conf" folder will
> have the resource entry as follows
>type="" username="X" password="X"
>     driverClassName=""
> url=""
> maxActive="8"
> />
>
> On Wed, Jul 11, 2012 at 1:31 PM, Pranav Prakash  wrote:
>
> > The DIH XML config file has to be specified dataSource. In my case, and
> > possibly with many others, the logon credentials as well as mysql server
> > paths would differ based on environments (dev, stag, prod). I don't want
> to
> > end up coming with three different DIH config files, three different
> > handlers and so on.
> >
> > What is a good way to deal with this?
> >
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
>
>
>
> --
> Thanks and Regards
> Rahul A. Warawdekar
>

How To apply transformation in DIH for multivalued numeric field?

2012-07-18 Thread Pranav Prakash

I have a multivalued integer field and a multivalued string field defined
in my schema as





The DIH entity and field defn for the same goes as



 


  



The value for field community_tags comes correctly as an array of strings.
However the value of field community_tag_ids is not proper


[B@390c0a18


I tried chaining NumberFormatTransformer with formatStyle="number" but that
throws DataImportHandlerException: Failed to apply NumberFormat on column.
Could it be due to NULL values from database or because the value is not
proper? How do we handle NULL in this case?


*Pranav Prakash*

"temet nosce"

Re: DIH XML configs for multi environment

2012-07-18 Thread Pranav Prakash

That approach would work for core dependent parameters. In my case, the
params are environment dependent. I think a simpler approach would be to
pass the url param as JVM options, and these XMLs get it from there.

I haven't tried it yet.

*Pranav Prakash*

"temet nosce"



On Tue, Jul 17, 2012 at 5:09 PM, Markus Klose  wrote:

> Hi
>
> There is one more approach using the property mechanism.
>
> You could specify the datasource like this:
> 
>
>  And you can specifiy the properties in the solr.xml in your core
> configuration like this:
>
> 
> 
> 
> 
>
>
> Viele Grüße aus Augsburg
>
> Markus Klose
> SHI Elektronische Medien GmbH
>
>
> Adresse: Curt-Frenzel-Str. 12, 86167 Augsburg
>
> Tel.:   0821 7482633 26
> Tel.:   0821 7482633 0 (Zentrale)
> Mobil:0176 56516869
> Fax:   0821 7482633 29
>
> E-Mail: markus.kl...@shi-gmbh.com
> Internet: http://www.shi-gmbh.com
>
> Registergericht Augsburg HRB 17382
> Geschäftsführer: Peter Spiske
> USt.-ID: DE 182167335
>
>
>
>
>
> -Ursprüngliche Nachricht-
> Von: Rahul Warawdekar [mailto:rahul.warawde...@gmail.com]
> Gesendet: Mittwoch, 11. Juli 2012 11:21
> An: solr-user@lucene.apache.org
> Betreff: Re: DIH XML configs for multi environment
>
> http://wiki.eclipse.org/Jetty/Howto/Configure_JNDI_Datasource
> http://docs.codehaus.org/display/JETTY/DataSource+Examples
>
>
> On Wed, Jul 11, 2012 at 2:30 PM, Pranav Prakash  wrote:
>
> > That's cool. Is there something similar for Jetty as well? We use Jetty!
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
> >
> >
> >
> > On Wed, Jul 11, 2012 at 1:49 PM, Rahul Warawdekar <
> > rahul.warawde...@gmail.com> wrote:
> >
> > > Hi Pranav,
> > >
> > > If you are using Tomcat to host Solr, you can define your data
> > > source in context.xml file under tomcat configuration.
> > > You have to refer to this datasource with the same name in all the 3
> > > environments from DIH data-config.xml.
> > > This context.xml file will vary across 3 environments having
> > > different credentials for dev, stag and prod.
> > >
> > > eg
> > > DIH data-config.xml will refer to the datasource as listed below
> > >  > > type="JdbcDataSource" readOnly="true" />
> > >
> > > context.xml file which is located under "//conf" folder
> > > will have the resource entry as follows
> > >> > type="" username="X" password="X"
> > > driverClassName=""
> > > url=""
> > > maxActive="8"
> > >     />
> > >
> > > On Wed, Jul 11, 2012 at 1:31 PM, Pranav Prakash 
> > wrote:
> > >
> > > > The DIH XML config file has to be specified dataSource. In my
> > > > case, and possibly with many others, the logon credentials as well
> > > > as mysql
> > server
> > > > paths would differ based on environments (dev, stag, prod). I
> > > > don't
> > want
> > > to
> > > > end up coming with three different DIH config files, three
> > > > different handlers and so on.
> > > >
> > > > What is a good way to deal with this?
> > > >
> > > >
> > > > *Pranav Prakash*
> > > >
> > > > "temet nosce"
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks and Regards
> > > Rahul A. Warawdekar
> > >
> >
>
>
>
> --
> Thanks and Regards
> Rahul A. Warawdekar
>

Re: How To apply transformation in DIH for multivalued numeric field?

2012-07-18 Thread Pranav Prakash

I had tried with splitBy for numeric field, but that also did not worked
for me. However I got rid of group_concat and it was all good to go.

Thanks a lot!! I really had a difficult time understanding this behavior.


*Pranav Prakash*

"temet nosce"



On Thu, Jul 19, 2012 at 1:34 AM, Dyer, James wrote:

> Don't you want to specify "splitBy" for the integer field too?
>
> Actually though, you shouldn't need to use GROUP_CONCAT and
> RegexTransformer at all.  DIH is designed to handle "1>many" relations
> between parent and child entities by populating all the child fields as
> multi-valued automatically.  I guess your approach leads to a lot fewer
> rows getting sent from your db to Solr though.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: Pranav Prakash [mailto:pra...@gmail.com]
> Sent: Wednesday, July 18, 2012 2:38 PM
> To: solr-user@lucene.apache.org
> Subject: How To apply transformation in DIH for multivalued numeric field?
>
> I have a multivalued integer field and a multivalued string field defined
> in my schema as
>
>  type="integer"
> indexed="true"
> stored="true"
> multiValued="true"
> omitNorms="true" />
>  type="text"
> indexed="true"
> termVectors="true"
> stored="true"
> multiValued="true"
> omitNorms="true" />
>
>
> The DIH entity and field defn for the same goes as
>
>dataSource="app"
>   onError="skip"
>   transformer="RegexTransformer"
>   query="...">
>
>   transformer="RegexTransformer"
> query="SELECT
> group_concat(a.id SEPARATOR ',') AS community_tag_ids,
> group_concat(a.title SEPARATOR ',') AS community_tags
> FROM tags a JOIN tag_dets b ON a.id = b.tag_id
> WHERE b.doc_id = ${document.id}" >
> 
> 
>   
>
> 
>
> The value for field community_tags comes correctly as an array of strings.
> However the value of field community_tag_ids is not proper
>
> 
> [B@390c0a18
> 
>
> I tried chaining NumberFormatTransformer with formatStyle="number" but that
> throws DataImportHandlerException: Failed to apply NumberFormat on column.
> Could it be due to NULL values from database or because the value is not
> proper? How do we handle NULL in this case?
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
>

Re: can solr admin tab statistics be customized... how can this be achived.

2012-07-23 Thread Pranav Prakash

You can checkout Solr source code, do the patch work in admin JSP files and
use it as your custom Solr Instance.


*Pranav Prakash*

"temet nosce"



On Fri, Jul 20, 2012 at 12:14 PM, yayati  wrote:

>
>
> Hi,
>
> I want to compute my own stats in addition to solr default stats. How can i
> enhance statistics in solr? How this thing can be achieved.. Solr compute
> stats as cumulative, is there is any way to get per instant stats...??
>
> Thanks... waiting for good replies..
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/can-solr-admin-tab-statistics-be-customized-how-can-this-be-achived-tp3996128.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: DIH XML configs for multi environment

2012-07-25 Thread Pranav Prakash

Jerry,

Glad it worked for you. I will also do the same thing. This seems easier
for me, as I have a solr start shell script, which sets the JVM params for
master/slave, Xmx and so on according to the environment. Setting a jdbc
connect url in the start script is convenient than changing the configs.

*Pranav Prakash*

"temet nosce"



On Tue, Jul 24, 2012 at 1:17 AM, jerry.min...@gmail.com <
jerry.min...@gmail.com> wrote:

> Pranav,
>
> Sorry, I should have checked my response a little better as I
> misspelled your name and, mentioned that I tried what Marcus suggested
> then described something totally different.
> I didn't try using the property mechanism as Marcus suggested as I am
> not using a solr.xml file.
>
> What you mentioned in your post on Wed, Jul 18, 2012 at 3:46 PM will
> work as I have done it successfully.
> That is I created a JVM variable to contain the connect URLs for each
> of my environments and one of those to set the URL parameter of the
> dataSource entity
> in my data config files.
>
> Best,
> Jerry
>
>
> On Mon, Jul 23, 2012 at 3:34 PM, jerry.min...@gmail.com
>  wrote:
> > Pranay,
> >
> > I tried two similar approaches to resolve this in my system which is
> > Solr 4.0 running in Tomcat 7.x on Ubuntu 9.10.
> >
> > My preference was to use an alias for each of my database environments
> > as a JVM parameter because it makes more sense to me that the database
> > connection be stored in the data config file rather than in a Tomcat
> > configuration or startup file.
> > Because of preference, I first attempted the following:
> > 1. Set a JVM environment variable 'solr.dbEnv' to the represent the
> > database environment that should be accessed. For example, in my dev
> > environment, the JVM environment variable was set as -Dsolr.dbEnv=dev.
> > 2. In the data config file I had 3 data sources. Each data source had
> > a name that matched one of the database environment aliases.
> > 3. In the entity of my data config file "dataSource" parameter was set
> > as follows dataSource=${solr.dbEnv}.
> >
> > Unfortunately, this fails to work. Setting "dataSource" parameter in
> > the data config file does not override the default. The default
> > appears to be the first data source defined in the data config file.
> >
> > Second, I tried what Marcus suggested.
> >
> > That is, I created a JVM variable to contain the connect URLs for each
> > of my environments.
> > I use that variable to set the URL parameter of the dataSource entity
> > in the data config file.
> >
> > This works well.
> >
> >
> > Best,
> > Jerry Mindek
> >
> > Unfortunately, the first option did not work. It seemed as though
> > On Wed, Jul 18, 2012 at 3:46 PM, Pranav Prakash 
> wrote:
> >> That approach would work for core dependent parameters. In my case, the
> >> params are environment dependent. I think a simpler approach would be to
> >> pass the url param as JVM options, and these XMLs get it from there.
> >>
> >> I haven't tried it yet.
> >>
> >> *Pranav Prakash*
> >>
> >> "temet nosce"
> >>
> >>
> >>
> >> On Tue, Jul 17, 2012 at 5:09 PM, Markus Klose  wrote:
> >>
> >>> Hi
> >>>
> >>> There is one more approach using the property mechanism.
> >>>
> >>> You could specify the datasource like this:
> >>> 
> >>>
> >>>  And you can specifiy the properties in the solr.xml in your core
> >>> configuration like this:
> >>>
> >>> 
> >>> 
> >>> 
> >>> 
> >>>
> >>>
> >>> Viele Grüße aus Augsburg
> >>>
> >>> Markus Klose
> >>> SHI Elektronische Medien GmbH
> >>>
> >>>
> >>> Adresse: Curt-Frenzel-Str. 12, 86167 Augsburg
> >>>
> >>> Tel.:   0821 7482633 26
> >>> Tel.:   0821 7482633 0 (Zentrale)
> >>> Mobil:0176 56516869
> >>> Fax:   0821 7482633 29
> >>>
> >>> E-Mail: markus.kl...@shi-gmbh.com
> >>> Internet: http://www.shi-gmbh.com
> >>>
> >>> Registergericht Augsburg HRB 17382
> >>> Geschäftsführer: Peter Spiske
> >>> USt.-ID: DE 182167335
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> -Ursprüngliche Nachricht-
> >>> Von: Rahul War

Exact match on few fields, fuzzy on others

2012-08-01 Thread Pranav Prakash

Hi Folks,

I am using Solr 3.4 and my document schema has attributes - title,
transcript, author_name. Presently, I am using DisMax to search for a user
query across transcript. I would also like to do an exact search on
author_name so that for a query "Albert Einstein", I would want to get all
the documents which contain Albert or Einstein in transcript and also those
documents which have author_name exactly as 'Albert Einstein'.

Can we do this by dismax query parser? The schema for both the fields are
below:

 

  
  
  
  
  
  




  



  
  
  
  
  
  
  
  
  
  
  

 
 


--
*Pranav Prakash*

"temet nosce"

Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-10 Thread Pranav Prakash

I am experiencing similar problem related to encoding. In my case, the char
like " (double quote)
is also garbaled.

I believe this is because the encoding in my MySQL table is latin1 and in
the JDBC it is being specified as UTF-8. Is there a way to specify latin1
charset in JDBC? probably that would resolve this.


*Pranav Prakash*

"temet nosce"



On Sat, Sep 8, 2012 at 3:16 AM, Shawn Heisey  wrote:

> On 9/6/2012 6:54 PM, kiran chitturi wrote:
>
>> The error i am getting is 'org.apache.solr.common.**SolrException:
>> Invalid
>> Date String: '1345743552'.
>>
>>   I think it was being saved as a string in DB, so i will use the
>> DateFormatTransformer.
>>
>
> To go along with all the other replies that you have gotten:  I import
> from MySQL with a unix format date field.  It's a bigint, not a string, but
> a quick test on MySQL 5.1 shows that the function works with strings too.
>  This is how my SELECT handles that field - I have MySQL convert it before
> it gets to Solr:
>
> from_unixtime(`d`.`post_date`) AS `pd`
>
> When it comes to the character set issues, this is how I have defined the
> driver in the dataimport config.  The character set in the database is utf8.
>
>driver="com.mysql.jdbc.Driver"
> encoding="UTF-8"
> url="jdbc:mysql://${**dataimporter.request.dbHost}:**
> 3306/${dataimporter.request.**dbSchema}?**zeroDateTimeBehavior=**
> convertToNull"
> batchSize="-1"
> user=""
> password=""/>
>
> Thanks,
> Shawn
>
>

Re: Importing of unix date format from mysql database and dates of format 'Thu, 06 Sep 2012 22:32:33 +0000' in Solr 4.0

2012-09-10 Thread Pranav Prakash

The character is actually  - “ and not "


*Pranav Prakash*

"temet nosce"



On Mon, Sep 10, 2012 at 2:45 PM, Pranav Prakash  wrote:

> I am experiencing similar problem related to encoding. In my case, the
> char like " (double quote)
> is also garbaled.
>
> I believe this is because the encoding in my MySQL table is latin1 and in
> the JDBC it is being specified as UTF-8. Is there a way to specify latin1
> charset in JDBC? probably that would resolve this.
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
>
>
>
> On Sat, Sep 8, 2012 at 3:16 AM, Shawn Heisey  wrote:
>
>> On 9/6/2012 6:54 PM, kiran chitturi wrote:
>>
>>> The error i am getting is 'org.apache.solr.common.**SolrException:
>>> Invalid
>>> Date String: '1345743552'.
>>>
>>>   I think it was being saved as a string in DB, so i will use the
>>> DateFormatTransformer.
>>>
>>
>> To go along with all the other replies that you have gotten:  I import
>> from MySQL with a unix format date field.  It's a bigint, not a string, but
>> a quick test on MySQL 5.1 shows that the function works with strings too.
>>  This is how my SELECT handles that field - I have MySQL convert it before
>> it gets to Solr:
>>
>> from_unixtime(`d`.`post_date`) AS `pd`
>>
>> When it comes to the character set issues, this is how I have defined the
>> driver in the dataimport config.  The character set in the database is utf8.
>>
>>   > driver="com.mysql.jdbc.Driver"
>> encoding="UTF-8"
>> url="jdbc:mysql://${**dataimporter.request.dbHost}:**
>> 3306/${dataimporter.request.**dbSchema}?**zeroDateTimeBehavior=**
>> convertToNull"
>> batchSize="-1"
>> user=""
>> password=""/>
>>
>> Thanks,
>> Shawn
>>
>>
>

Re: DIH import from MySQL results in garbage text for special chars

2012-09-20 Thread Pranav Prakash

I am seeing the garbage text in browser, Luke Index Toolbox and everywhere
it is the same. My servlet container is Jetty which is the out-of-box one.
Many other special chars are getting indexed and stored properly, only few
characters causes pain.

*Pranav Prakash*

"temet nosce"



On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson wrote:

> Is your _browser_ set to handle the appropriate character set? Or whatever
> you're using to inspect your data? How about your servlet container?
>
>
>
> Best
> Erick
>
> On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash  wrote:
> > Hi Folks,
> >
> > I am attempting to import documents to Solr from MySQL using DIH. One of
> > the field contains the text - “Future of Mobile Value Added Services
> (VAS)
> > in Australia” .Notice the character “ and ”.
> >
> > When I am importing, it gets stored as - â€œFuture of Mobile Value Added
> > Services (VAS) in Australiaâ€�.
> >
> > The datasource config clearly mentions use of UTF-8 as follows:
> >
> >> driver="com.mysql.jdbc.Driver"
> > url="jdbc:mysql://localhost/ohapp_devel"
> > user="username"
> > useUnicode="true"
> > characterEncoding="UTF-8"
> > password="password"
> > zeroDateTimeBehavior="convertToNull"
> > name="app" />
> >
> >
> > A plain SQL Select statement on the MySQL Console gives appropriate
> text. I
> > even tried using following scriptTransformer to get rid of this char, but
> > it was of no particular use in my case.
> >
> > function gsub(source, pattern, replacement) {
> >   var match, result;
> >   if (!((pattern != null) && (replacement != null))) {
> > return source;
> >   }
> >   result = '';
> >   while (source.length > 0) {
> > if ((match = source.match(pattern))) {
> >   result += source.slice(0, match.index);
> >   result += replacement;
> >   source = source.slice(match.index + match[0].length);
> > } else {
> >   result += source;
> >   source = '';
> > }
> >   }
> >   return result;
> > }
> >
> > function fixQuotes(c){
> >   c = gsub(c, /\342\200(?:\234|\235)/,'"');
> >   c = gsub(c, /\342\200(?:\230|\231)/,"'");
> >   c = gsub(c, /\342\200\223/,"-");
> >   c = gsub(c, /\342\200\246/,"...");
> >   c = gsub(c, /\303\242\342\202\254\342\204\242/,"'");
> >   c = gsub(c, /\303\242\342\202\254\302\235/,'"');
> >   c = gsub(c, /\303\242\342\202\254\305\223/,'"');
> >   c = gsub(c, /\303\242\342\202\254"/,'-');
> >   c = gsub(c, /\342\202\254\313\234/,'"');
> >   c = gsub(c, /“/, '"');
> >   return c;
> > }
> >
> > function cleanFields(row){
> >   var fieldsToClean = ['title', 'description'];
> >   for(i =0; i< fieldsToClean.length; i++){
> > var old_text = String(row.get(fieldsToClean[i]));
> > row.put(fieldsToClean[i], fixQuotes(old_text) );
> >   }
> >   return row;
> > }
> >
> > My understanding goes that this must be a very common problem. It also
> > occurs with human names which have these chars. What is an appropriate
> way
> > to get the appropriate text indexed and searchable? The fieldtype where
> > this is stored goes as follows
> >
> >   
> > 
> >   
> >   
> >   
> >   
> >   
> >> protected="protwords.txt"/>
> >  >   synonyms="synonyms.txt"
> >   ignoreCase="true"
> >   expand="true" />
> >  >   words="stopwords_en.txt"
> >   ignoreCase="true" />
> >  >   words="stopwords_en.txt"
> >   ignoreCase="true" />
> >  >   generateWordParts="1"
> >   generateNumberParts="1"
> >   catenateWords="1"
> >   catenateNumbers="1"
> >   catenateAll="0"
> >   preserveOriginal="1" />
> >   
> > 
> >
> >
> > *Pranav Prakash*
> >
> > "temet nosce"
>

Re: DIH import from MySQL results in garbage text for special chars

2012-09-26 Thread Pranav Prakash

I looked at the HEX codes of the texts. The hex code in MySQL is different
from that which is stored in the index.

The hex code in index is longer than the hex code in MySQL, this leads me
to the fact that somewhere in between smething is messing up,

*Pranav Prakash*

"temet nosce"



On Fri, Sep 21, 2012 at 11:19 AM, Pranav Prakash  wrote:

> I am seeing the garbage text in browser, Luke Index Toolbox and everywhere
> it is the same. My servlet container is Jetty which is the out-of-box one.
> Many other special chars are getting indexed and stored properly, only few
> characters causes pain.
>
> *Pranav Prakash*
>
> "temet nosce"
>
>
>
>
> On Fri, Sep 14, 2012 at 6:36 PM, Erick Erickson 
> wrote:
>
>> Is your _browser_ set to handle the appropriate character set? Or whatever
>> you're using to inspect your data? How about your servlet container?
>>
>>
>>
>> Best
>> Erick
>>
>> On Mon, Sep 10, 2012 at 7:47 AM, Pranav Prakash  wrote:
>> > Hi Folks,
>> >
>> > I am attempting to import documents to Solr from MySQL using DIH. One of
>> > the field contains the text - “Future of Mobile Value Added Services
>> (VAS)
>> > in Australia” .Notice the character “ and ”.
>> >
>> > When I am importing, it gets stored as - â€œFuture of Mobile Value Added
>> > Services (VAS) in Australiaâ€�.
>> >
>> > The datasource config clearly mentions use of UTF-8 as follows:
>> >
>> >   > > driver="com.mysql.jdbc.Driver"
>> > url="jdbc:mysql://localhost/ohapp_devel"
>> > user="username"
>> > useUnicode="true"
>> > characterEncoding="UTF-8"
>> > password="password"
>> > zeroDateTimeBehavior="convertToNull"
>> > name="app" />
>> >
>> >
>> > A plain SQL Select statement on the MySQL Console gives appropriate
>> text. I
>> > even tried using following scriptTransformer to get rid of this char,
>> but
>> > it was of no particular use in my case.
>> >
>> > function gsub(source, pattern, replacement) {
>> >   var match, result;
>> >   if (!((pattern != null) && (replacement != null))) {
>> > return source;
>> >   }
>> >   result = '';
>> >   while (source.length > 0) {
>> > if ((match = source.match(pattern))) {
>> >   result += source.slice(0, match.index);
>> >   result += replacement;
>> >   source = source.slice(match.index + match[0].length);
>> > } else {
>> >   result += source;
>> >   source = '';
>> > }
>> >   }
>> >   return result;
>> > }
>> >
>> > function fixQuotes(c){
>> >   c = gsub(c, /\342\200(?:\234|\235)/,'"');
>> >   c = gsub(c, /\342\200(?:\230|\231)/,"'");
>> >   c = gsub(c, /\342\200\223/,"-");
>> >   c = gsub(c, /\342\200\246/,"...");
>> >   c = gsub(c, /\303\242\342\202\254\342\204\242/,"'");
>> >   c = gsub(c, /\303\242\342\202\254\302\235/,'"');
>> >   c = gsub(c, /\303\242\342\202\254\305\223/,'"');
>> >   c = gsub(c, /\303\242\342\202\254"/,'-');
>> >   c = gsub(c, /\342\202\254\313\234/,'"');
>> >   c = gsub(c, /“/, '"');
>> >   return c;
>> > }
>> >
>> > function cleanFields(row){
>> >   var fieldsToClean = ['title', 'description'];
>> >   for(i =0; i< fieldsToClean.length; i++){
>> > var old_text = String(row.get(fieldsToClean[i]));
>> > row.put(fieldsToClean[i], fixQuotes(old_text) );
>> >   }
>> >   return row;
>> > }
>> >
>> > My understanding goes that this must be a very common problem. It also
>> > occurs with human names which have these chars. What is an appropriate
>> way
>> > to get the appropriate text indexed and searchable? The fieldtype where
>> > this is stored goes as follows
>> >
>> >   
>> > 
>> >   
>> >   
>> >   
>> >   
>> >   
>> >   > language="English"
>> > protected="protwords.txt"/>
>> > > >   synonyms="synonyms.txt"
>> >   ignoreCase="true"
>> >   expand="true" />
>> > > >   words="stopwords_en.txt"
>> >   ignoreCase="true" />
>> > > >   words="stopwords_en.txt"
>> >   ignoreCase="true" />
>> > > >   generateWordParts="1"
>> >   generateNumberParts="1"
>> >   catenateWords="1"
>> >   catenateNumbers="1"
>> >   catenateAll="0"
>> >   preserveOriginal="1" />
>> >   
>> > 
>> >
>> >
>> > *Pranav Prakash*
>> >
>> > "temet nosce"
>>
>
>

Re: DIH import from MySQL results in garbage text for special chars

2012-09-26 Thread Pranav Prakash

The output of Show variables goes like this. I have verified with the hex
values and they are different in MySQL and Solr.

| Variable_name| Value  |
+--++
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database   | latin1 |
| character_set_filesystem | binary |
| character_set_results| latin1 |
| character_set_server | latin1 |
| character_set_system | utf8   |
| character_sets_dir   | /usr/share/mysql/charsets/



*Pranav Prakash*

"temet nosce"



On Wed, Sep 26, 2012 at 6:45 PM, Gora Mohanty  wrote:

> On 21 September 2012 11:19, Pranav Prakash  wrote:
>
> > I am seeing the garbage text in browser, Luke Index Toolbox and
> everywhere
> > it is the same. My servlet container is Jetty which is the out-of-box
> one.
> > Many other special chars are getting indexed and stored properly, only
> few
> > characters causes pain.
> >
>
> Could you double-check the encoding on the mysql side?
> What is the output of
>
> mysql> SHOW VARIABLES LIKE 'character\_set\_%';
>
> Regards,
> Gora
>

56 matches

Mail list logo