+1 to Charlie's guidance.

And...

>60,000 documents, mostly pdfs and emails.
> However, there's a premium on precision (and recall) in searches.

Please, oh, please, no matter what you're using for content/text extraction 
and/or OCR, run tika-eval[1] on the output to ensure that that you are getting 
mostly language-y content out of your documents.  Ping us on the Tika user's 
list if you have any questions.

Bad text, bad search. 😊

[1] https://wiki.apache.org/tika/TikaEval

-----Original Message-----
From: Charlie Hull [mailto:char...@flax.co.uk] 
Sent: Tuesday, April 17, 2018 4:17 AM
To: solr-user@lucene.apache.org
Subject: Re: Specialized Solr Application

On 16/04/2018 19:48, Terry Steichen wrote:
> I have from time-to-time posted questions to this list (and received 
> very prompt and helpful responses).  But it seems that many of you are 
> operating in a very different space from me.  The problems (and
> lessons-learned) which I encounter are often very different from those 
> that are reflected in exchanges with most other participants.

Hi Terry,

Sounds like a fascinating use case. We have some similar clients - small scale 
law firms and publishers - who have taken advantage of Solr.

One thing I would encourage you to do is to blog and/or talk about what you've 
built. Lucene Revolution is worth applying to talk at and if you do manage to 
get accepted - or if you go anyway - you'll meet lots of others with similar 
challenges and come away with a huge amount of useful information and contacts. 
Otherwise there are lots of smaller Meetup events (we run the London, UK one).

Don't assume just because some people here are describing their 350 billion 
document learning-to-rank clustered monster that the small applications don't 
matter - they really do, and the fact that they're possible to build at all is 
a testament to the open source model and how we share information and tips.

Cheers

Charlie
> 
> So I thought it would be useful to describe what I'm about, and see if 
> there are others out there with similar implementations (or interest 
> in moving in that direction).  A sort of pay-forward.
> 
> We (the Lakota Peoples Law Office) are a small public interest, pro 
> bono law firm actively engaged in defending Native American North 
> Dakota Water Protector clients against (ridiculously excessive) criminal 
> charges.
> 
> I have a small Solr (6.6.0) implementation - just one shard.  I'm 
> using the cloud mode mainly to be able to implement access controls.  
> The server is an ordinary (2.5GHz) laptop running Ubuntu 16.04 with 
> 8GB of RAM and 4 cpu processors.  We presently have 8 collections with 
> a total of about 60,000 documents, mostly pdfs and emails.  The 
> indexed documents are partly our own files and partly those we obtain 
> through legal discovery (which, surprisingly, is allowed in ND for 
> criminal cases).  We only have a few users (our lawyers and a couple 
> of researchers mostly), so traffic is minimal.  However, there's a 
> premium on precision (and recall) in searches.
> 
> The document repository is local to the server.  I piggyback on the 
> embedded Jetty httpd in order to serve files (selected from the 
> hitlists).  I just use a symbolic link to tie the repository to 
> Solr/Jetty's "webapp" subdirectory.
> 
> We provide remote access via ssh with port forwarding.  It provides 
> very snappy performance, with fully encrypted links.  Appears quite stable.
> 
> I've had some bizarre behavior apparently caused by an interaction 
> between repository permissions, solr permissions and the ssh link.  I 
> seem "solved" for the moment, but time will tell for how long.
> 
> If there are any folks out there who have similar requirements, I'd be 
> more than happy to share the insights I've gained and problems I've 
> encountered and (I think) overcome.  There are so many unique parts of 
> this small scale, specialized application (many dimensions of which 
> are not strictly internal to Solr) that it probably won't be 
> appreciated to dump them on this (excellent) Solr list.  So, if you 
> encounter problems peculiar to this kind of setup, we can perhaps help 
> handle them off-list (although if they have more general Solr 
> application, we should, of course, post them to the list).
> 
> Terry Steichen
> 


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

Reply via email to