Re: Solr with many indexes

2011-08-02 Thread Vikram Kumar
We have a multi-tenant Solr deployment with a core for each user.

Due to the limitations we are facing with number of cores,
lazy-loading (and associated warm-up times), we are researching about
consolidating several users into one core with queries limited by
user-id field.

My question is about autosuggest.

1. Are there ways we can limit the autosuggest to only documents with
matching ids?

2. What other SOLR operations like these which need further
consideration when merging multiple indices and limiting by a field?

-- Vikram

On Sat, Jan 22, 2011 at 4:02 PM, Erick Erickson  wrote:
> See below.
>
> On Wed, Jan 19, 2011 at 7:26 PM, Joscha Feth  wrote:
>
>> Hello Erick,
>>
>> Thanks for your answer!
>>
>> But I question why you *require* many different indexes. [...] including
>> > isolating one
>> > users'
>> > data from all others, [...]
>>
>>
>> Yes, thats exactly what I am after - I need to make sure that indexes don't
>> mix, as every user shall only be able to query his own data (index).
>>
>
> well, this can also be handled by simply appending the equivalent of
> +user:theuser
> to each query. This solution does have some "interesting" side effects
> though.
> In particular if you autosuggest based on combined documents, users will see
> terms NOT in documents they own.
>
>
>>
>> And even using lots of cores can be made to work if you don't pre-warm
>> > newly-opened
>> > cores, assuming that the response time when using "cold searchers" is
>> > adequate.
>> >
>>
>> Could you explain that further or point me to some documentation? Are you
>> talking about: http://wiki.apache.org/solr/CoreAdmin#UNLOAD? if yes, LOAD
>> does not seem to be implemented, yet. Or has this something to do with
>> http://wiki.apache.org/solr/SolrCaching#autowarmCount only? About what
>> time
>> per X documents are we talking here for delay if auto warming is disabled?
>> Is there more documentation about this setting?
>>
>>
> It's the autoWarm parameter. When you open a core the first few queries that
> run
> on it will pay some penalty for filling caches etc. If your cores are small
> enough,
> then this penalty may not be noticeable to your users, in which case you can
> just
> not bother autowarming (see  , ). You might also
> be able to get away with having very small caches, it mostly depends on your
> usage patterns. If your pattern as that a user signs on, makes one search
> and
> signs off, there may not be much good in having large caches. On the other
> and,
> if users sign on and search for hours continually, their experience may be
> enhanced
> by having significant caches. It all depends.
>
> Hopt that helps
> Erick
>
>
>> Kind regards,
>> Joscha
>>
>



-- 
- Vikram


Re: What is the best scalable scheme to support multiple users?

2009-02-26 Thread Vikram Kumar
Hi Wunder,
Can you please elaborate?

Vikram

On Thu, Feb 26, 2009 at 10:13 AM, Walter Underwood
wrote:

> 1a. Multiple Solr instances partitioned by user_id%N, with index
> files segmented by user_id field.
>
> That can scale rather gracefully, though it does need reindexing
> to add a server.
>
> wunder
>
> On 2/26/09 3:44 AM, "Vikram B. Kumar"  wrote:
>
> > Hi All,
> >
> > Our web based document management system has few thousand users and is
> > growing rapidly. Like any SaaS, while we support a lot of customers,
> > only few of them (those logged in) will be reading their index and only
> > a subset of those logged in (who are adding documents) will be writing
> > to their index.
> >
> > i.,e TU > L > U
> >
> > and TU ~ 100 x L
> >
> > where TU is total no of users, L is logged in users who are searching
> > and U is the uploaders who are updating their index.
> >
> > We have been using Lucene over a simple RESTful server for searching.
> > Indexing is currently done using regular JavaSE based setup, instead of
> > a server. We are thinking about moving to Solr to scale better and to
> > get rid of the latency associated with our non-live JavaSE based
> > indexer. We have a custom Analyzer/Filter that adds some payload to each
> > term to support our web based service.
> >
> > My message is about on how best to partition the index to support
> > multiple users.
> >
> > Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
> > cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
> > 8000-1 users before I need to start sharding them across multiple
> hosts.
> >
> > I have thought of the following options:
> >
> > 1. One Monilithic index, but index files segmented by user_id field.
> >
> > 2. MultiCore - One core per user.
> >
> > 3. Multiple Solr instances - Non scalable.
> >
> > 4. Don't use Solr, but enhance our Lucene +RESTful server model to
> > support indexing as well. - Least favored approach as we will be doing a
> > lot of things that Solr already does (replication, live
> > add/update/delete). Most of the things we are doing, can be done with
> > Solr's pluggable query handlers. (I guess this is not a true option at
> all).
> >
> > I am currently favouring Option 2 though want to try out whether 1 works
> > as well.
> >
> > Looks like some of the most obvious problems with MultiCores are "too
> > many open file" problems, which can be handled with hardware and
> > software boundaries (properly close index after updating and after users
> > logout).
> >
> > My questions:
> >
> > 1. Can our analyzers/filters be plugged into Solr during the time of
> > indexing?
> > 2. Does option 2 fit the above needs? Has anybody done option 2 with
> > thousands of cores in a Solr instance?
> > 3. Does option 2 to support horizontal scaling (sharding?)
> >
> > Thanks,
> > Vikram
> >
> >
>
>


Re: Use of scanned documents for text extraction and indexing

2009-02-26 Thread Vikram Kumar
Tesseract is pure OCR. Ocropus builds on Tesseract.
Vikram

On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant  wrote:

> Another project worth investigating is Tesseract.
>
> http://code.google.com/p/tesseract-ocr/
>
>
>
>
> - Original Message 
> From: Hannes Carl Meyer 
> To: solr-user@lucene.apache.org
> Sent: Thursday, February 26, 2009 11:35:14 AM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Hi Sithu,
>
> there is a project called ocropus done by the DFKI, check the online demo
> here: http://demo.iupr.org/cgi-bin/main.cgi
>
> And also http://sites.google.com/site/ocropus/
>
> Regards
>
> Hannes
>
> m...@hcmeyer.com
> http://mimblog.de
>
> On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> sithu.sudar...@fda.hhs.gov> wrote:
>
> >
> > Hi All:
> >
> > Is there any study / research done on using scanned paper documents as
> > images (may be PDF), and then use some OCR or other technique for
> > extracting text, and the resultant index quality?
> >
> >
> > Thanks in advance,
> > Sithu D Sudarsan
> >
> > sithu.sudar...@fda.hhs.gov
> > sdsudar...@ualr.edu
> >
> >
> >
>
>


Re: Use of scanned documents for text extraction and indexing

2009-02-27 Thread Vikram Kumar
Check this: http://code.google.com/p/ocropus/wiki/FrequentlyAskedQuestions

> How well does it work?
>
The character recognition accuracy of OCRopus right now (04/2007) is about
> like Tesseract. That's because the only character recognition plug-in in
> OCRopus is, in fact, Tesseract. In the future, there will be additional
> character recognition plug-ins, both for Latin and for other character sets.
>
The big area of improvement relative to other open source OCR systems right
> now is in the area of layout analysis; in our benchmarks, OCRopus greatly
> reduces layout errors compared to other open source systems."
>
OCR is only a part of the solution with scanned documents. i.,e they
recognize text.

For structural/semantic understanding of documents, you need engines like
OCRopus that can do layout analysis and provide meaningful data for document
analysis and understanding.

>From their own Wiki:

Should I use OCRopus or Tesseract?
>
You might consider using OCRopus right now if you require layout analysis,
> if you want to contribute to it, if you find its output format more
> convenient (HTML with embedded OR information), and/or if you anticipate
> requiring some of its other capabilities in the future (pluggability,
> multiple scripts, statistical language models, etc.).
>
In terms of character error rates, OCRopus performs similar to Tesseract. In
> terms of layout analysis, OCRopus is significantly better than Tesseract.
>
The main reasons not to use OCRopus yet is that it hasn't been packaged yet,
> that it has limited multi-platform support, and that it runs somewhat
> slower. We hope to address all those issues by the beta release."
>


On Thu, Feb 26, 2009 at 11:35 PM, Shashi Kant  wrote:

> Can anyone back that up?
>
> IMHO Tesseract is the state-of-the-art in OCR, but not sure that "Ocropus
> builds on Tesseract".
> Can you confirm that Vikram has a point?
>
> Shashi
>
>
>
>
> - Original Message 
> From: Vikram Kumar 
> To: solr-user@lucene.apache.org; Shashi Kant 
> Sent: Thursday, February 26, 2009 9:21:07 PM
> Subject: Re: Use of scanned documents for text extraction and indexing
>
> Tesseract is pure OCR. Ocropus builds on Tesseract.
> Vikram
>
> On Thu, Feb 26, 2009 at 12:11 PM, Shashi Kant 
> wrote:
>
> > Another project worth investigating is Tesseract.
> >
> > http://code.google.com/p/tesseract-ocr/
> >
> >
> >
> >
> > - Original Message 
> > From: Hannes Carl Meyer 
> > To: solr-user@lucene.apache.org
> > Sent: Thursday, February 26, 2009 11:35:14 AM
> > Subject: Re: Use of scanned documents for text extraction and indexing
> >
> > Hi Sithu,
> >
> > there is a project called ocropus done by the DFKI, check the online demo
> > here: http://demo.iupr.org/cgi-bin/main.cgi
> >
> > And also http://sites.google.com/site/ocropus/
> >
> > Regards
> >
> > Hannes
> >
> > m...@hcmeyer.com
> > http://mimblog.de
> >
> > On Thu, Feb 26, 2009 at 5:29 PM, Sudarsan, Sithu D. <
> > sithu.sudar...@fda.hhs.gov> wrote:
> >
> > >
> > > Hi All:
> > >
> > > Is there any study / research done on using scanned paper documents as
> > > images (may be PDF), and then use some OCR or other technique for
> > > extracting text, and the resultant index quality?
> > >
> > >
> > > Thanks in advance,
> > > Sithu D Sudarsan
> > >
> > > sithu.sudar...@fda.hhs.gov
> > > sdsudar...@ualr.edu
> > >
> > >
> > >
> >
> >
>
>