Re: Easiest way to export the entire index

2020-01-29 Thread Steve Ge
@Amanda
You can try using curl and write output to a file
  curl http://localhost:8983/Solr?q={theSolrQuery) > out.json
  theSolrQuery - you need to specify all attrs you want exported, not just *
If you are on Windows, there is a Windows curl tool you can download to use




Steve  
 
  On Wed, Jan 29, 2020 at 10:21 AM, Emir 
Arnautović wrote:   Hi Amanda,
I assume that you have all the fields stored so you will be able to export full 
document.

Several thousands records should not be too much to use regular start+rows to 
paginate results, but the proper way of doing that would be to use cursors. 
Adjust page size to avoid creating huge responses and you can use curl or some 
similar tool to avoid using admin console. I did a quick search and there are 
several blog posts with scripts that does what you need.

HTH,
Emir

--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 29 Jan 2020, at 15:43, Amanda Shuman  wrote:
> 
> Dear all:
> 
> I've been asked to produce a JSON file of our index so it can be combined
> and indexed with other records. (We run solr 5.3.1 on this project; we're
> not going to upgrade, in part because funding has ended.) The index has
> several thousand rows, but nothing too drastic. Unfortunately, this is too
> much to handle for a simple query dump from the admin console. I tried to
> follow instructions related to running /export directly but I guess the
> export handler isn't installed. I tried to divide the query into rows, but
> after a certain amount it freezes, and it also freezes when I try to limit
> rows (e.g., rows 501-551 freezes the console). Is there any other way to
> export the index short of having to install the export handler considering
> we're not working on this project anyone?
> 
> Thanks,
> Amanda
> 
> --
> Dr. Amanda Shuman
> Researcher and Lecturer, Institute of Chinese Studies, University of
> Freiburg
> Coordinator for the MA program in Modern China Studies
> Database Administrator, The Maoist Legacy 
> PhD, University of California, Santa Cruz
> http://www.amandashuman.net/
> http://www.prchistoryresources.org/
> Office: +49 (0) 761 203 96748
  


Re: regarding Extracting text from Images

2020-01-22 Thread Steve Ge
In my experience, enabling Tika at server level can result in memory heap space 
used up under high volume of extraction, and bring down Solr entirely.   Likely 
due to garbage collector not able to keep up w/ load, even tuning garbage 
collector didn't resolve the problem completely.  Not recommend.
Steve  
 
  On Wed, Oct 23, 2019 at 7:08 PM, suresh pendap wrote: 
  Hi Alex,
Thanks for your reply. How do we integrate tesseract with Solr?  Do we have
to implement Custom update processor or extend the
ExtractingRequestProcessor?

Regards
Suresh

On Wed, Oct 23, 2019 at 11:21 AM Alexandre Rafalovitch 
wrote:

> I believe Tika that powers this can do so with extra libraries (tesseract?)
> But Solr does not bundle those extras.
>
> In any case, you may want to run Tika externally to avoid the
> conversion/extraction process be a burden to Solr itself.
>
> Regards,
>      Alex
>
> On Wed, Oct 23, 2019, 1:58 PM suresh pendap, 
> wrote:
>
> > Hello,
> > I am reading the Solr documentation about integration with Tika and Solr
> > Cell framework over here
> >
> >
> https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > I would like to know if the can Solr Cell framework also be used to
> extract
> > text from the image files?
> >
> > Regards
> > Suresh
> >
>