LSH in Solr/Lucene

2014-01-20 Thread Shashi Kant
Hi folks, have any of you successfully implemented LSH (MinHash) in Solr? If so, could you share some details of how you went about it? I know LSH is available in Mahout, but was hoping if someone has a solr or Lucene implementation. Thanks

Searching Numeric Data

2014-01-11 Thread Shashi Kant
Hi all, I have a use-case where I would need to search a set of numeric values, using a query set. My business case is 1. I have various Rock samples from various locations {R1...Rn} with multiple measurements like Porosity [255] - an array of values , Conductivity [1028] - also an array of number

Re: Solr Patent

2013-09-14 Thread Shashi Kant
You can ask on this site http://patents.stackexchange.com/ On Sat, Sep 14, 2013 at 10:03 AM, Michael Sokolov wrote: > On 9/13/2013 9:14 PM, Zaizen Ushio wrote: >> >> Hello >> I have a question about patent. I believe Apache license is protecting >> Solr developers from patent issue in Solr com

Re: Document Similarity Algorithm at Solr/Lucene

2013-07-23 Thread Shashi Kant
Here is a paper that I found useful: http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf On Tue, Jul 23, 2013 at 10:42 AM, Furkan KAMACI wrote: > Thanks for your comments. > > 2013/7/23 Tommaso Teofili > >> if you need a specialized algorithm for detecting blogposts plagiarism /

Re: Search for misspelled words in corpus

2013-06-08 Thread Shashi Kant
n-grams might help, followed by a edit distance metric such as Jaro-Winkler or Smith-Waterman-Gotoh to further filter out. On Sun, Jun 9, 2013 at 1:59 AM, Otis Gospodnetic wrote: > Interesting problem. The first thing that comes to mind is to do > "word expansion" during indexing. Kind of lik

Re: How apache solr stores indexes

2013-05-28 Thread Shashi Kant
Better still start here: http://en.wikipedia.org/wiki/Inverted_index http://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html And there are several books on search engines and related algorithms. On Tue, May 28, 2013 at 10:41 PM, Alexandre Rafalovitch

Re: Could I use Solr to index multiple applications?

2012-07-17 Thread Shashi Kant
to put each core name into > Solr config XML, if we add another core and change XML, do we > need to restart Solr? > > Best regards, Lisheng > > -Original Message- > From: shashi@gmail.com [mailto:shashi....@gmail.com]On Behalf Of > Shashi Kant > Sent: Tuesday,

Re: Could I use Solr to index multiple applications?

2012-07-17 Thread Shashi Kant
Look up multicore solr. Another choice could be ElasticSearch - which is more straightforward in managing multiple indexes IMO. On Tue, Jul 17, 2012 at 7:53 PM, Zhang, Lisheng wrote: > Hi, > > We have an application where we index data into many different directories > (each directory > is cor

Re: Does Solr fit my needs?

2012-04-27 Thread Shashi Kant
We have used both Solr and graph databases for our XML file indexing. Both are equivalent in terms of performance, but a graph db (such as Neo4j) offers a lot more flexibility in joining across the nodes and traversing. If your data is strictly hierarchical Solr might do it, alternately suggest loo

Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-23 Thread Shashi Kant
ot a static one. It must update > on the fly. As I know, Lucene index is not suitable to be updated too > frequently. If so, how to deal with that? > > Best regards, > Bing > > > On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant wrote: >> >> Lucene has a mecha

Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-21 Thread Shashi Kant
Lucene has a mechanism to "boost" up/down documents using your custom ranking algorithm. So if you come up with something like Pagerank you might do something like doc.SetBoost(myboost), before writing to index. On Sat, Jan 21, 2012 at 5:07 PM, Bing Li wrote: > Hi, Kai, > > Thanks so much for y

Re: Solr, SQL Server's LIKE

2011-12-29 Thread Shashi Kant
for a simple, hackish (albeit inefficient) approach look up wildcard searchers e,g foo*, *bar On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten wrote: > I have been tinkering with Solr for a few weeks, and I am convinced that it > could be very helpful in many of my upcoming projects. I am t

Re: How to run the solr dedup for the document which match 80% or match almost.

2011-12-27 Thread Shashi Kant
You can also look at cosine similarity (or related metrics) to measure document similarity. On Tue, Dec 27, 2011 at 6:51 AM, vibhoreng04 wrote: > Hi iorixxx, > > Thanks for the quick update.I hope I can take it from here ! > > > Regards, > > Vibhor > > -- > View this message in context: > http:/

Re: Score

2011-08-15 Thread Shashi Kant
https://wiki.apache.org/lucene-java/ScoresAsPercentages On Mon, Aug 15, 2011 at 8:13 PM, Bill Bell wrote: > How do I change the score to scale it between 0 and 100 irregardless of the > score? > > q.alt=*:*&bq=lang:Spanish&defType=dismax > > Bill Bell > Sent from mobile > >

Re: Multiple Cores on different machines?

2011-08-09 Thread Shashi Kant
"Betamax VCR"? really ? :-) On Tue, Aug 9, 2011 at 3:38 PM, Chris Hostetter wrote: > > : A quick question - is it possible to have 2 cores in Solr on two > different > : machines? > > your question is a little vague ... like asking "is it possible to have to > have two betamax VCRs in two diffe

Re: Solr can not index "F**K"!

2011-07-31 Thread Shashi Kant
Check your Stop words list On Jul 31, 2011 6:25 PM, "François Schiettecatte" wrote: > That seems a little far fetched, have you checked your analysis? > > François > > On Jul 31, 2011, at 4:58 PM, randohi wrote: > >> One of our clients (a hot girl!) brought this to our attention: >> In this docume

Re: searching a subset of SOLR index

2011-07-05 Thread Shashi Kant
Range query On Tue, Jul 5, 2011 at 4:37 AM, Jame Vaalet wrote: > Hi, > Let say, I have got 10^10 documents in an index with unique id being document > id which is assigned to each of those from 1 to 10^10 . > Now I want to search a particular query string in a subset of these documents > say (

Re: Solr vs ElasticSearch

2011-05-31 Thread Shashi Kant
Here is a very interesting comparison http://engineering.socialcast.com/2011/05/realtime-search-solr-vs-elasticsearch/ > -Original Message- > From: Mark > Sent: May-31-11 10:33 PM > To: solr-user@lucene.apache.org > Subject: Solr vs ElasticSearch > > I've been hearing more and more about

Re: I need an available solr lucene consultant

2011-05-17 Thread Shashi Kant
You might be better off looking for freelancers on sites such as odesk.com, guru.com, rentacoder.com, elance.com & many more On Tue, May 17, 2011 at 4:09 PM, Markus Jelsma wrote: > Check this out: > http://wiki.apache.org/solr/Support > >> Hi, >> >> I am looking for an experienced and skille

Re: Looking for help with Solr implementation

2010-11-12 Thread Shashi Kant
Have you tried posting on odesk.com? I have had decent success finding Solr/Lucene resources there. On Thu, Nov 11, 2010 at 7:52 PM, AC wrote: > Hi, > > > Not sure if this is the correct place to post but I'm looking for someone > to > help finish a Solr install on our LAMP based website. This

Re: Would it be nuts to store a bunch of large attachments (images, videos) in stored but-not-indexed fields

2010-10-29 Thread Shashi Kant
On Fri, Oct 29, 2010 at 6:00 PM, Ron Mayer wrote: > I have some documents with a bunch of attachments (images, thumbnails > for them, audio clips, word docs, etc); and am currently dealing with > them by just putting a path on a filesystem to them in solr; and then > jumping through hoops of keep

Re: Color search for images

2010-09-17 Thread Shashi Kant
> > What I am envisioning (at least to start) is have all this add two fields in > the index.  One would be for color information for the color similarity > search.  The other would be a simple multivalued text field that we put > keywords into based on what OpenCV can detect about the image.  If i

Re: Get all results from a solr query

2010-09-16 Thread Shashi Kant
return all the rows in the > results? > > -- Chris > > > > On Thu, Sep 16, 2010 at 4:43 PM, Shashi Kant wrote: >> q=*:* >> >> On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross wrote: >>> I have some queries that I'm running against a solr instan

Re: Get all results from a solr query

2010-09-16 Thread Shashi Kant
q=*:* On Thu, Sep 16, 2010 at 4:39 PM, Christopher Gross wrote: > I have some queries that I'm running against a solr instance (older, > 1.2 I believe), and I would like to get *all* the results back (and > not have to put an absurdly large number as a part of the rows > parameter). > > Is there

Re: Color search for images

2010-09-16 Thread Shashi Kant
> Lire looks promising, but how hard is it to integrate the content-based > search into Solr as opposed to Lucene?  I myself am not a Java developer.  I > have access to people who are, but their time is scarce. > Lire is a nascent effort and based on a cursory overview a while back, IMHO was an

Re: Color search for images

2010-09-16 Thread Shashi Kant
On Thu, Sep 16, 2010 at 3:21 AM, Lance Norskog wrote: > Yes, notice the flowers are all a medium-dark crimson red. There are a bunch > of these image-indexing & search technologies, but there is no (to my > knowledge) "finished technology"- it's very much an area of research. If you > want to sear

Re: Color search for images

2010-09-15 Thread Shashi Kant
> I'm sure there's some post doctoral types who could get a graphic shape > analyzer, color analyzer, to at least say it's a flower. > > However, even Google would have to build new datacenters to have the > horsepower to do that kind of graphic processing. > Not necessarily true. Like.com - whi

Re: Color search for images

2010-09-15 Thread Shashi Kant
> > On a related note, I'm curious if anyone has run across a good set of > algorithms (or hopefully a library) for doing naive image > classification. I'm looking for something that can classify images > into something similar to the broad categories that Google image > search has (Face, Photo, Cl

Re: Color search for images

2010-09-15 Thread Shashi Kant
Shawn, I have done some research into this, machine-vision especially on a large scale is a hard problem, not to be entered into lightly. I would recommend starting with OpenCV - a comprehensive toolkit for extracting various features such as Color, Edge etc from images. Also there is a project LIR

Re: Indexing all versions of Microsoft Office Documents

2010-04-27 Thread Shashi Kant
If you are on Windows try the Microsoft IFilter API - it supports current Office versions. http://www.microsoft.com/downloads/details.aspx?FamilyId=60C92A37-719C-4077-B5C6-CAC34F4227CC&displaylang=en On Tue, Apr 27, 2010 at 6:08 AM, Roland Villemoes wrote: > Hi All, > > Does anyone have a runn

Re: LucidWorks Solr

2010-04-21 Thread Shashi Kant
Why do these approaches have to be mutually exclusive? Do a dictionary lookup, if no satisfactory match found use an algorithmic stemmer. Would probably save a few CPU cycles by algorithmic stemming iff necessary. On Wed, Apr 21, 2010 at 1:31 PM, Robert Muir wrote: > sy to look at the "faults" o

Re: Query time only Ranges

2010-03-31 Thread Shashi Kant
In that case, you could just calculate an offset from 00:00:00 in seconds (ignore the date) Pretty simple. On Wed, Mar 31, 2010 at 4:57 PM, abhatna...@vantage.com wrote: > > Hi Sashi, > Could you elaborate point no .1 in the light of case where in a field should > have just time? > > > Ankit > >

Re: Query time only Ranges

2010-03-31 Thread Shashi Kant
I suggest approaching it thus: 1. Create a datetime offset from a baseline date (say Jan 1, 1900, 00:00:00) and store the date diff in secs from that date-time. 2. Use numeric range query. I find this approach works faster and would also give you the granularity you want. On Wed, Mar 31, 2010

Re: boost on certain keywords

2010-01-28 Thread Shashi Kant
http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ On Thu, Jan 28, 2010 at 6:54 AM, Shashi Kant wrote: > Look at Payload. > > On Thu, Jan 28, 2010 at 6:48 AM, murali k wrote: >> >> Say I have a clothes store,  i have ladies clothes, mens cloth

Re: boost on certain keywords

2010-01-28 Thread Shashi Kant
Look at Payload. On Thu, Jan 28, 2010 at 6:48 AM, murali k wrote: > > Say I have a clothes store,  i have ladies clothes, mens clothes > > when someone searches for "clothes", i want to prioritize mens clothing > results, > how can I achieve this ? > this logic should only apply for this keyword,

Re: HI

2009-12-13 Thread Shashi Kant
http://lmgtfy.com/?q=lucene+basics On Sun, Dec 13, 2009 at 1:01 PM, Faire Mii wrote: > Hi, > > I am a beginner and i wonder what a document, entity and a field relates to > in a database? > > And i wonder if there are some good tutorials that learn you how to design > your schema. Because all o

Re: Migrating to Solr

2009-11-24 Thread Shashi Kant
Here is a link that might be helpful: http://sesat.no/moving-from-fast-to-solr-review.html The site is choc-a-bloc with great information on their migration experience. On Tue, Nov 24, 2009 at 8:55 AM, Tommy Molto wrote: > Hi, > > I'm new at Solr and i need to make a "test pilot" of a migrati

Re: Solr - Load Increasing.

2009-11-16 Thread Shashi Kant
I think it would be useful for members of this list to realize that not everyone uses the same metrology and terms. It is very easy for "Americans" to use the imperial system and presume everyone does the same; Europeans to use the metric system etc. Hopefully members on this list would be persuad

Re: Search Within

2009-04-04 Thread Shashi Kant
This post describes the search-within-search implementation. http://sujitpal.blogspot.com/2007/04/lucene-search-within-search-with.html Shashi On Sat, Apr 4, 2009 at 1:21 PM, Vernon Chapman wrote: > Bess, > > I think that might work I'll try it out and see how it works for my case. > > thanks

Re: Hardware Questions...

2009-03-24 Thread Shashi Kant
Have you looked at http://wiki.apache.org/solr/SolrPerformanceData ? On Tue, Mar 24, 2009 at 4:51 PM, solr wrote: > We have three Solr servers (several two processor Dell PowerEdge > servers). I'd like to get three newer servers and I wanted to se

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
Can anyone back that up? IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus builds on Tesseract". Can you confirm that Vikram has a point? Shashi - Original Message From: Vikram Kumar To: solr-user@lucene.apache.org; Shashi Kant Sent: Thursday, F

Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Shashi Kant
Another project worth investigating is Tesseract. http://code.google.com/p/tesseract-ocr/ - Original Message From: Hannes Carl Meyer To: solr-user@lucene.apache.org Sent: Thursday, February 26, 2009 11:35:14 AM Subject: Re: Use of scanned documents for text extraction and indexing H

Re: why don't we have a forum for discussion?

2009-02-18 Thread Shashi Kant
Steve - could you not just subscribe to the list from another (off-mobile device) email (Gmail or Yahoo) for example? We discourage using corporate email for subscribing mailing lists precisely for such reasons : volume, spam, malware risks etc. Shashi - Original Message From: Steph

Re: why don't we have a forum for discussion?

2009-02-18 Thread Shashi Kant
one man's "crap" is another man's treasure. :-P So how would you decide what is worth posting? If you feel the list is overwhelming your email, set some filters. Shashi - Original Message From: Tony Wang To: solr-user@lucene.apache.org Sent: Wednesday, February 18, 2009 2:06:57 PM S