Re: PDF extraction using Tika

2020-08-25 Thread Charlie Hull

On 25/08/2020 06:04, Srinivas Kashyap wrote:

Hi Alexandre,

Yes, these are the same PDF files running in windows and linux. There are 
around 30 pdf files and I tried indexing single file, but faced same error. Is 
it related to how PDF stored in linux?
Did you try running Tika (the same version as you're using in Solr) 
standalone on the file as Alexandre suggested?


And with regard to DIH and TIKA going away, can you share if any program which 
extracts from PDF and pushes into solr?


https://lucidworks.com/post/indexing-with-solrj/ is one example. You 
should run Tika separately as it's entirely possible for it to fail to 
parse a PDF and crash - and if you're running it in DIH & Solr it then 
brings down everything. Separate your PDF processing from your Solr 
indexing.



Cheers

Charlie



Thanks,
Srinivas Kashyap

-Original Message-
From: Alexandre Rafalovitch 
Sent: 24 August 2020 20:54
To: solr-user 
Subject: Re: PDF extraction using Tika

The issue seems to be more with a specific file and at the level way below 
Solr's or possibly even Tika's:
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
 at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)

Are you indexing the same files on Windows and Linux? I am guessing not. I 
would try to narrow down which of the files it is. One way could be to get a 
standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably complain with 
the same error.

Regards,
Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended for 
production. And both will be going away in future Solr versions. You may have a 
much less brittle pipeline if you save the structured outputs from those Tika 
standalone runs and then index them into Solr, possibly pre-processed.

On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap 
 wrote:

Hello,

We are using TikaEntityProcessor to extract the content out of PDF and make the 
content searchable.

When jetty is run on windows based machine, we are able to successfully load 
documents using full import DIH(tika entity). Here PDF's is maintained in 
windows file system.

But when jetty solr is run on linux machine, and try to run DIH, we
are getting below exception: (Here PDF's are maintained in linux
filesystem)

Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
 at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
 ... 4 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to read content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
 ... 6 more
Caused by: org.apache.tika.exception.TikaException: Unable to extract PDF 
content
 at 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
 at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
 at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaE

�b���u:�k(�D^q�y���M6k^�ӯ4w�4�L"��L��km��v�^

Re: PDF extraction using Tika

2020-08-25 Thread Joe Doupnik
    More properly,it would be best to fix Tika and thus not push extra 
complexity upon many many users. Error handling is one thing, crashes 
though ought to be designed out.

    Thanks,
    Joe D.

On 25/08/2020 10:54, Charlie Hull wrote:

On 25/08/2020 06:04, Srinivas Kashyap wrote:

Hi Alexandre,

Yes, these are the same PDF files running in windows and linux. There 
are around 30 pdf files and I tried indexing single file, but faced 
same error. Is it related to how PDF stored in linux?
Did you try running Tika (the same version as you're using in Solr) 
standalone on the file as Alexandre suggested?


And with regard to DIH and TIKA going away, can you share if any 
program which extracts from PDF and pushes into solr?


https://lucidworks.com/post/indexing-with-solrj/ is one example. You 
should run Tika separately as it's entirely possible for it to fail to 
parse a PDF and crash - and if you're running it in DIH & Solr it then 
brings down everything. Separate your PDF processing from your Solr 
indexing.



Cheers

Charlie



Thanks,
Srinivas Kashyap

-Original Message-
From: Alexandre Rafalovitch 
Sent: 24 August 2020 20:54
To: solr-user 
Subject: Re: PDF extraction using Tika

The issue seems to be more with a specific file and at the level way 
below Solr's or possibly even Tika's:

Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
 at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045) 



Are you indexing the same files on Windows and Linux? I am guessing 
not. I would try to narrow down which of the files it is. One way 
could be to get a standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably 
complain with the same error.


Regards,
    Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended for 
production. And both will be going away in future Solr versions. You 
may have a much less brittle pipeline if you save the structured 
outputs from those Tika standalone runs and then index them into 
Solr, possibly pre-processed.


On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap 
 wrote:

Hello,

We are using TikaEntityProcessor to extract the content out of PDF 
and make the content searchable.


When jetty is run on windows based machine, we are able to 
successfully load documents using full import DIH(tika entity). Here 
PDF's is maintained in windows file system.


But when jetty solr is run on linux machine, and try to run DIH, we
are getting below exception: (Here PDF's are maintained in linux
filesystem)

Full Import failed:java.lang.RuntimeException: 
java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to read content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
 at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
 at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
 at 
org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)

 at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to read content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
 at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
 at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)

 ... 4 more
Caused by: 
org.apache.solr.handler.dataimport.DataImportHandlerException: 
Unable to read content Processing Document # 1
 at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:69)
 at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEntityProcessor.java:171)
 at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:267)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:517)
 at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)

 ... 6 more
Caused by: org.apache.tika.exception.TikaException: Unable to 
extract PDF content
 at 
org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:139)
 at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:172)
 at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
 at 
org.apache.tika.parser.CompositeParser.parse(CompositePa

Re: Apache Solr 8.6.0 with SSL

2020-08-25 Thread Patrik Peng
Thanks for your input regarding SOLR-14711, that makes sense.

I wasn't able to reproduce the bin/solr script issue on a Debian
machine, so I guess there's something wrong with my setup.

Patrik

On 24.08.20 17:26, Jan Høydahl wrote:
> I think you’re experiencing this:
>
> https://issues.apache.org/jira/browse/SOLR-14711
>
> No idea why the bin/solr script won’t work with SSL...
>
> Jan
>
>> 24. aug. 2020 kl. 15:52 skrev Patrik Peng :
>>
>> Greetings
>>
>> I'm in the process of setting up a SolrCloud cluster with 3 Zookeeper
>> and 3 Solr nodes on FreeBSD and wish to enable SSL between the Solr nodes.
>> Before enabling SSL, everything worked as expected and I followed the
>> instructions described in the Solr 8.6 docs
>> . But after
>> enabling SSL, the solr command line utility stopped working for various
>> tasks.
>>
>> For example:
>>
>> $ /usr/local/solr/bin/solr status
>>
>> Found 1 Solr nodes:
>>
>> Solr process 974 from /var/db/solr/solr-8983.pid not found.
>>
>> $ /usr/local/solr/bin/solr create_collection -c test
>> Failed to determine the port of a local Solr instance, cannot create test!
>>
>> Also the following line appears in the logfile even though SSL is enabled:
>>
>> 2020-08-24 15:29:52.612 WARN  (main) [   ] o.a.s.c.CoreContainer Solr 
>> authentication is enabled, but SSL is off.  Consider enabling SSL to protect 
>> user credentials and data with encryption.
>>
>> Apart from these oddities, the cluster is working fine and dandy. The
>> dashboard is available via HTTPS and the nodes can communicate via SSL.
>>
>> Does anyone have any clue what's causing this? Any help would be
>> appreciated.
>>
>> Regards
>> Patrik
>>



Issues deploying LTR into SolrCloud

2020-08-25 Thread Dmitry Kan
Hi,

There is a recent thread "Replication of Solr Model and feature store" on
deploying LTR feature store and model into a master/slave Solr topology.

I'm facing an issue of deploying into SolrCloud (solr 7.5.0), where
collections have shards with replicas. This is the process I've been
following:

1. Deploy a feature store from a JSON file to each collection.
2. Reload all collections as advised in the documentation:
https://lucene.apache.org/solr/guide/7_5/learning-to-rank.html#applying-changes
3. Deploy the related model from a JSON file.
3. Reload all collections again.


The problem is that even after reloading the collections, shard replicas
continue to not have the model:

Error from server at http://server1:8983/solr/collection1_shard1_replica_n1:
cannot find model 'model_name'

What is the proper way to address this issue and can it be potentially a
bug in SolrCloud?

Is there any workaround I can try, like saving the feature store and model
JSON files into the collection config path and creating the SolrCloud from
there?

Thanks,

Dmitry

-- 
Dmitry Kan
Luke Toolbox: http://github.com/DmitryKey/luke
Blog: http://dmitrykan.blogspot.com and https://medium.com/@dmitry.kan
Twitter: http://twitter.com/dmitrykan
SemanticAnalyzer: https://semanticanalyzer.info


How to Prevent Recovery?

2020-08-25 Thread Anshuman Singh
Hi,

We have a 10 node (150G RAM, 1TB SAS HDD, 32 cores) Solr 8.5.1 cluster with
50 shards, rf 2 (NRT replicas), 7B docs, We have 5 Zk with 2 running on the
same nodes where Solr is running. Our use case requires continuous
ingestions (updates mostly). If we ingest at 40k records per sec, after
10-15mins some replicas go into recovery with the errors observed given in
the end. We also observed high CPU during these ingestions (60-70%) and
disks frequently reach 100% utilization.

We know our hardware is limited but this system will be used by only a few
users and search times taking a few minutes and slow ingestions are fine so
we are trying to run with these specifications for now but recovery is
becoming a bottleneck.

So to prevent recovery which I'm thinking could be due to high CPU/Disk
during ingestions, we reduced the data rate to 10k records per sec. Now CPU
usage is not high and recovery is not that frequent but it can happen in a
long run of 2-3 hrs. We further reduced the rate to 4k records per sec but
again it happened after 3-4 hrs. Logs were filled with the below error on
the instance on which recovery happened. Seems like reducing data rate is
not helping with recovery.

*2020-08-25 12:16:11.008 ERROR (qtp1546693040-235) [c:collection s:shard41
r:core_node565 x:collection_shard41_replica_n562] o.a.s.s.HttpSolrCall
null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
timeout expired: 30/30 ms*

Solr thread dump showed commit threads taking upto 10-15 minutes. Currently
auto commit happens at 10M docs or 30seconds.

Can someone point me in the right direction? Also can we perform
core-binding for Solr processes?

*2020-08-24 12:32:55.835 WARN  (zkConnectionManagerCallback-11-thread-1) [
  ] o.a.s.c.c.ConnectionManager Watcher
org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
WatchedEvent state:Disconnected type:None path:null path: null type: None*














*2020-08-24 12:41:02.005 WARN  (main-SendThread(x.x.x.8:2181)) [   ]
o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
0x273f9a8fb229269 has expired2020-08-24 12:41:06.177 WARN
 (MetricsHistoryHandler-8-thread-1) [   ] o.a.s.h.a.MetricsHistoryHandler
Could not obtain overseer's address, skipping. =>
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/leaderat
org.apache.zookeeper.KeeperException.create(KeeperException.java:134)org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired for /overseer_elect/leaderat
org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
~[?:?]at
org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
~[?:?]at
org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
~[?:?]2020-08-24 12:41:13.365 WARN
 (zkConnectionManagerCallback-11-thread-1) [   ]
o.a.s.c.c.ConnectionManager Watcher
org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
WatchedEvent state:Expired type:None path:null path: null type:
None2020-08-24 12:41:13.366 WARN  (zkConnectionManagerCallback-11-thread-1)
[   ] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was
expired. Attempting to reconnect to recover relationship with
ZooKeeper...2020-08-24 12:41:16.705 ERROR (qtp1546693040-163255)
[c:collection s:shard31 r:core_node525 x:collection_shard31_replica_n522]
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Cannot
talk to ZooKeeper - Updates are disabled*


Re: How to Prevent Recovery?

2020-08-25 Thread Houston Putman
Are you able to use TLOG replicas? That should reduce the time it takes to
recover significantly. It doesn't seem like you have a hard need for
near-real-time, since slow ingestions are fine.

- Houston

On Tue, Aug 25, 2020 at 12:03 PM Anshuman Singh 
wrote:

> Hi,
>
> We have a 10 node (150G RAM, 1TB SAS HDD, 32 cores) Solr 8.5.1 cluster with
> 50 shards, rf 2 (NRT replicas), 7B docs, We have 5 Zk with 2 running on the
> same nodes where Solr is running. Our use case requires continuous
> ingestions (updates mostly). If we ingest at 40k records per sec, after
> 10-15mins some replicas go into recovery with the errors observed given in
> the end. We also observed high CPU during these ingestions (60-70%) and
> disks frequently reach 100% utilization.
>
> We know our hardware is limited but this system will be used by only a few
> users and search times taking a few minutes and slow ingestions are fine so
> we are trying to run with these specifications for now but recovery is
> becoming a bottleneck.
>
> So to prevent recovery which I'm thinking could be due to high CPU/Disk
> during ingestions, we reduced the data rate to 10k records per sec. Now CPU
> usage is not high and recovery is not that frequent but it can happen in a
> long run of 2-3 hrs. We further reduced the rate to 4k records per sec but
> again it happened after 3-4 hrs. Logs were filled with the below error on
> the instance on which recovery happened. Seems like reducing data rate is
> not helping with recovery.
>
> *2020-08-25 12:16:11.008 ERROR (qtp1546693040-235) [c:collection s:shard41
> r:core_node565 x:collection_shard41_replica_n562] o.a.s.s.HttpSolrCall
> null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
> timeout expired: 30/30 ms*
>
> Solr thread dump showed commit threads taking upto 10-15 minutes. Currently
> auto commit happens at 10M docs or 30seconds.
>
> Can someone point me in the right direction? Also can we perform
> core-binding for Solr processes?
>
> *2020-08-24 12:32:55.835 WARN  (zkConnectionManagerCallback-11-thread-1) [
>   ] o.a.s.c.c.ConnectionManager Watcher
> org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
> ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
> WatchedEvent state:Disconnected type:None path:null path: null type: None*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *2020-08-24 12:41:02.005 WARN  (main-SendThread(x.x.x.8:2181)) [   ]
> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> 0x273f9a8fb229269 has expired2020-08-24 12:41:06.177 WARN
>  (MetricsHistoryHandler-8-thread-1) [   ] o.a.s.h.a.MetricsHistoryHandler
> Could not obtain overseer's address, skipping. =>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/leaderat
>
> org.apache.zookeeper.KeeperException.create(KeeperException.java:134)org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/leaderat
> org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
> ~[?:?]at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> ~[?:?]at
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
> ~[?:?]2020-08-24 12:41:13.365 WARN
>  (zkConnectionManagerCallback-11-thread-1) [   ]
> o.a.s.c.c.ConnectionManager Watcher
> org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
> ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
> WatchedEvent state:Expired type:None path:null path: null type:
> None2020-08-24 12:41:13.366 WARN  (zkConnectionManagerCallback-11-thread-1)
> [   ] o.a.s.c.c.ConnectionManager Our previous ZooKeeper session was
> expired. Attempting to reconnect to recover relationship with
> ZooKeeper...2020-08-24 12:41:16.705 ERROR (qtp1546693040-163255)
> [c:collection s:shard31 r:core_node525 x:collection_shard31_replica_n522]
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Cannot
> talk to ZooKeeper - Updates are disabled*
>


Re: How to Prevent Recovery?

2020-08-25 Thread Erick Erickson
Commits should absolutely not be taking that much time, that’s where I’d focus 
first.

Some sneaky places things go wonky:
1> you have  suggester configured that builds whenever there’s a commit.
2> you send commits from the client
3> you’re optimizing on commit
4> you have too much data for your hardware

My guess though is that the root cause of your recovery is that the followers
get backed up. If there are enough merge threads running, the
next update can block until at least one is done. Then the scenario
goes something like this:

leader sends doc to follower
follower does not index the document in time
leader puts follower into “leader initiated recovery”.

So one thing to look for if that scenario is correct is whether there are 
messages
in your logs with "leader-initiated recovery” I’d personally grep my logs for

grep logfile initated | grep recovery | grep leader

‘cause I never remember whether that’s the exact form. If it is this, you can
lengthen the timeouts, look particularly for:
• distribUpdateConnTimeout
• distribUpdateSoTimeout

All that said, your symptoms are consistent with a lot of merging going on. 
With NRT
nodes, all replicas do all indexing and thus merging. Have you considered
using TLOG/PULL replicas? In your case they could even all be TLOG replicas. In 
that
case, only the leader does the indexing, the other TLOG replicas of a shard 
just stuff
the documents into their local tlogs without indexing at all.

Speaking of which, you could reduce some of the disk pressure if you can put 
your
tlogs on another drive, don’t know if that’s possible. Ditto the Solr logs.

Beyond that, it may be a matter of increasing the hardware. You’re really 
indexing
120K records second ((1 leader + 2 followers) * 40K)/sec.

Best,
Erick

> On Aug 25, 2020, at 12:02 PM, Anshuman Singh  
> wrote:
> 
> Hi,
> 
> We have a 10 node (150G RAM, 1TB SAS HDD, 32 cores) Solr 8.5.1 cluster with
> 50 shards, rf 2 (NRT replicas), 7B docs, We have 5 Zk with 2 running on the
> same nodes where Solr is running. Our use case requires continuous
> ingestions (updates mostly). If we ingest at 40k records per sec, after
> 10-15mins some replicas go into recovery with the errors observed given in
> the end. We also observed high CPU during these ingestions (60-70%) and
> disks frequently reach 100% utilization.
> 
> We know our hardware is limited but this system will be used by only a few
> users and search times taking a few minutes and slow ingestions are fine so
> we are trying to run with these specifications for now but recovery is
> becoming a bottleneck.
> 
> So to prevent recovery which I'm thinking could be due to high CPU/Disk
> during ingestions, we reduced the data rate to 10k records per sec. Now CPU
> usage is not high and recovery is not that frequent but it can happen in a
> long run of 2-3 hrs. We further reduced the rate to 4k records per sec but
> again it happened after 3-4 hrs. Logs were filled with the below error on
> the instance on which recovery happened. Seems like reducing data rate is
> not helping with recovery.
> 
> *2020-08-25 12:16:11.008 ERROR (qtp1546693040-235) [c:collection s:shard41
> r:core_node565 x:collection_shard41_replica_n562] o.a.s.s.HttpSolrCall
> null:java.io.IOException: java.util.concurrent.TimeoutException: Idle
> timeout expired: 30/30 ms*
> 
> Solr thread dump showed commit threads taking upto 10-15 minutes. Currently
> auto commit happens at 10M docs or 30seconds.
> 
> Can someone point me in the right direction? Also can we perform
> core-binding for Solr processes?
> 
> *2020-08-24 12:32:55.835 WARN  (zkConnectionManagerCallback-11-thread-1) [
>  ] o.a.s.c.c.ConnectionManager Watcher
> org.apache.solr.common.cloud.ConnectionManager@372ea2bc name:
> ZooKeeperConnection Watcher:x.x.x.7:2181,x.x.x.8:2181/solr got event
> WatchedEvent state:Disconnected type:None path:null path: null type: None*
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> *2020-08-24 12:41:02.005 WARN  (main-SendThread(x.x.x.8:2181)) [   ]
> o.a.z.ClientCnxn Unable to reconnect to ZooKeeper service, session
> 0x273f9a8fb229269 has expired2020-08-24 12:41:06.177 WARN
> (MetricsHistoryHandler-8-thread-1) [   ] o.a.s.h.a.MetricsHistoryHandler
> Could not obtain overseer's address, skipping. =>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/leaderat
> org.apache.zookeeper.KeeperException.create(KeeperException.java:134)org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired for /overseer_elect/leaderat
> org.apache.zookeeper.KeeperException.create(KeeperException.java:134)
> ~[?:?]at
> org.apache.zookeeper.KeeperException.create(KeeperException.java:54)
> ~[?:?]at
> org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:2131)
> ~[?:?]2020-08-24 12:41:13.365 WARN
> (zkConnectionManagerCallback-11-thread-1) [   ]
> o.a.s.c.c.ConnectionM

How does Solr suggest sort results when weight is 0

2020-08-25 Thread Hanjan, Harinderdeep S.
Hello,

I can't find anything in the 
docs to understand how 
Solr sorts suggest results when the weight is the same (0 in my case).

Here is my suggester config:
---

  mySuggester
  AnalyzingInfixLookupFactory
  DocumentDictionaryFactory
  autocomplete
  payload
  text_general
  true

---

Results for query: /select?q=plas
---
"plas":{
"numFound":5,
"suggestions":[{
"term":"Lids - plastic, small",
"weight":0,
"payload":"/bottle-caps.html"},
  {
"term":"Body lotion bottle - plastic",
"weight":0,
"payload":"/body-lotion-bottle.html"},
  {
"term":"Lotion bottle - plastic",
"weight":0,
"payload":"/body-lotion-bottle.html"},
  {
"term":"Suncreen bottle - plastic",
"weight":0,
"payload":"/body-lotion-bottle.html"},
  {
"term":"Hand lotion bottle - plastic",
"weight":0,
"payload":"/body-lotion-bottle.html"}]}
---

Looking at the above response, looked like it was sorting by payload. However, 
when I increase the # of results to 20, I see air-freshener-plastic-bottle.html 
at the bottom.

Clearly it's not sorting alphabetically (either by term or payload). Solr can't 
sort by weight field here either. So then is it sorting by relevancy? How can I 
find relevancy score? debugQuery=true does not work.

Thanks!



NOTICE -
This communication is intended ONLY for the use of the person or entity named 
above and may contain information that is confidential or legally privileged. 
If you are not the intended recipient named above or a person responsible for 
delivering messages or communications to the intended recipient, YOU ARE HEREBY 
NOTIFIED that any use, distribution, or copying of this communication or any of 
the information contained in it is strictly prohibited. If you have received 
this communication in error, please notify us immediately by telephone and then 
destroy or delete this communication, or return it to us by mail if requested 
by us. The City of Calgary thanks you for your attention and co-operation.


RE: PDF extraction using Tika

2020-08-25 Thread Phil Scadden
Code for solrj is going to be very dependent on your needs but the beating 
heart of my code is below ( note that I do OCR as separate step before feeding 
files into indexer). Solrj and tika docs should help.

File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
   
Logger.getLogger(JsMapAdminService.class.getName()).log(Level.SEVERE, 
null,String.format("File %s failed", f.getCanonicalPath()));
   e.printStackTrace();
   writeLog(String.format("File %s failed", f.getCanonicalPath()));
   return false;
  }
 SolrInputDocument up = new SolrInputDocument();
 if (title==null) title = metadata.get("title");
 if (author==null) author = metadata.get("author");
 up.addField("id",f.getCanonicalPath());
 up.addField("location",idString);
 up.addField("title",title);
 up.addField("author",author);
etc for all your fields.
 String content = textHandler.toString();
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr,"prindex");
 req.commit(solr, "prindex");

-Original Message-
From: Srinivas Kashyap 
Sent: Tuesday, 25 August 2020 17:04
To: solr-user@lucene.apache.org
Subject: RE: PDF extraction using Tika

Hi Alexandre,

Yes, these are the same PDF files running in windows and linux. There are 
around 30 pdf files and I tried indexing single file, but faced same error. Is 
it related to how PDF stored in linux?

And with regard to DIH and TIKA going away, can you share if any program which 
extracts from PDF and pushes into solr?

Thanks,
Srinivas Kashyap

-Original Message-
From: Alexandre Rafalovitch 
Sent: 24 August 2020 20:54
To: solr-user 
Subject: Re: PDF extraction using Tika

The issue seems to be more with a specific file and at the level way below 
Solr's or possibly even Tika's:
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)

Are you indexing the same files on Windows and Linux? I am guessing not. I 
would try to narrow down which of the files it is. One way could be to get a 
standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably complain with 
the same error.

Regards,
   Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended for 
production. And both will be going away in future Solr versions. You may have a 
much less brittle pipeline if you save the structured outputs from those Tika 
standalone runs and then index them into Solr, possibly pre-processed.

On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap 
 wrote:
>
> Hello,
>
> We are using TikaEntityProcessor to extract the content out of PDF and make 
> the content searchable.
>
> When jetty is run on windows based machine, we are able to successfully load 
> documents using full import DIH(tika entity). Here PDF's is maintained in 
> windows file system.
>
> But when jetty solr is run on linux machine, and try to run DIH, we
> are getting below exception: (Here PDF's are maintained in linux
> filesystem)
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content Processing Document # 1
> at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
> at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
> at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
> at 
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(DataImporter.java:466)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content Processing

RE: PDF extraction using Tika

2020-08-25 Thread Srinivas Kashyap
Thanks Phil,

I will modify it according to the need.

Thanks,
Srinivas

-Original Message-
From: Phil Scadden  
Sent: 26 August 2020 02:44
To: solr-user@lucene.apache.org
Subject: RE: PDF extraction using Tika

Code for solrj is going to be very dependent on your needs but the beating 
heart of my code is below ( note that I do OCR as separate step before feeding 
files into indexer). Solrj and tika docs should help.

File f = new File(filename);
 ContentHandler textHandler = new 
BodyContentHandler(Integer.MAX_VALUE);
 Metadata metadata = new Metadata();
 Parser parser = new AutoDetectParser();
 ParseContext context = new ParseContext();
 if (filename.toLowerCase().contains("pdf")) {
   PDFParserConfig pdfConfig = new PDFParserConfig();
   pdfConfig.setExtractInlineImages(false);
   pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.NO_OCR);
   context.set(PDFParserConfig.class,pdfConfig);
   context.set(Parser.class,parser);
 }
 InputStream input = new FileInputStream(f);
 try {
   parser.parse(input, textHandler, metadata, context);
 } catch (Exception e) {
   
Logger.getLogger(JsMapAdminService.class.getName()).log(Level.SEVERE, 
null,String.format("File %s failed", f.getCanonicalPath()));
   e.printStackTrace();
   writeLog(String.format("File %s failed", f.getCanonicalPath()));
   return false;
  }
 SolrInputDocument up = new SolrInputDocument();
 if (title==null) title = metadata.get("title");
 if (author==null) author = metadata.get("author");
 up.addField("id",f.getCanonicalPath());
 up.addField("location",idString);
 up.addField("title",title);
 up.addField("author",author); etc for all your fields.
 String content = textHandler.toString();
 up.addField("_text_",content);
 UpdateRequest req = new UpdateRequest();
 req.add(up);
 req.setBasicAuthCredentials("solrAdmin", password);
 UpdateResponse ur =  req.process(solr,"prindex");
 req.commit(solr, "prindex");

-Original Message-
From: Srinivas Kashyap 
Sent: Tuesday, 25 August 2020 17:04
To: solr-user@lucene.apache.org
Subject: RE: PDF extraction using Tika

Hi Alexandre,

Yes, these are the same PDF files running in windows and linux. There are 
around 30 pdf files and I tried indexing single file, but faced same error. Is 
it related to how PDF stored in linux?

And with regard to DIH and TIKA going away, can you share if any program which 
extracts from PDF and pushes into solr?

Thanks,
Srinivas Kashyap

-Original Message-
From: Alexandre Rafalovitch 
Sent: 24 August 2020 20:54
To: solr-user 
Subject: Re: PDF extraction using Tika

The issue seems to be more with a specific file and at the level way below 
Solr's or possibly even Tika's:
Caused by: java.io.IOException: expected='>' actual='
' at offset 2383
at
org.apache.pdfbox.pdfparser.BaseParser.readExpectedChar(BaseParser.java:1045)

Are you indexing the same files on Windows and Linux? I am guessing not. I 
would try to narrow down which of the files it is. One way could be to get a 
standalone Tika (make sure to match the version Solr
embeds) and run it over the documents by itself. It will probably complain with 
the same error.

Regards,
   Alex.
P.s. Additionally, both DIH and Embedded Tika are not recommended for 
production. And both will be going away in future Solr versions. You may have a 
much less brittle pipeline if you save the structured outputs from those Tika 
standalone runs and then index them into Solr, possibly pre-processed.

On Mon, 24 Aug 2020 at 11:09, Srinivas Kashyap 
 wrote:
>
> Hello,
>
> We are using TikaEntityProcessor to extract the content out of PDF and make 
> the content searchable.
>
> When jetty is run on windows based machine, we are able to successfully load 
> documents using full import DIH(tika entity). Here PDF's is maintained in 
> windows file system.
>
> But when jetty solr is run on linux machine, and try to run DIH, we 
> are getting below exception: (Here PDF's are maintained in linux
> filesystem)
>
> Full Import failed:java.lang.RuntimeException: java.lang.RuntimeException: 
> org.apache.solr.handler.dataimport.DataImportHandlerException: Unable to read 
> content Processing Document # 1
> at 
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
> at 
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:424)
> at 
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:483)
> at 
> org.apache.solr.handler.dataimport.DataImporter.lambda$runAsync$0(Data

About solr.HyphenatedWordsFilter

2020-08-25 Thread Kayak28
Hello, Solr community:

I would like to tokenize the following sentence.
I do want to tokens that remain hyphens.
 So, for example,
original text: This is a new abc-edg and xyz-abc is coming soon!
desired output tokens: this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/!

Is there any way that I do not omit hyphens from tokens?

I though HyphenatedWordsFilter does have similar functionalities, but it
gets rid of hyphens.

Any help will be appreciated.




-- 

Sincerely,
Kaya
github: https://github.com/28kayak


Re: About solr.HyphenatedWordsFilter

2020-08-25 Thread Shawn Heisey

On 8/26/2020 12:05 AM, Kayak28 wrote:
I would like to tokenize the following sentence. I do want to tokens 
that remain hyphens. So, for example, original text: This is a new 
abc-edg and xyz-abc is coming soon! desired output tokens: 
this/is/a/new/abc-edg/and/xyz-abc/is/coming/soon/! Is there any way 
that I do not omit hyphens from tokens? I though HyphenatedWordsFilter 
does have similar functionalities, but it gets rid of hyphens.


I doubt that filter is what you need.  It is fully described in Javadocs:

https://lucene.apache.org/core/8_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/HyphenatedWordsFilter.html

Your requirement is a little odd.  Are you SURE that you want to 
preserve hyphens like that?


I think that you could probably achieve it with a carefully configured 
WordDelimiterGraphFilter.  This filter can be highly customized with its 
"types" parameter.  This parameter refers to a file in the conf 
directory that can change how the filter recognizes certain characters.  
I think that if you used the whitespace tokenizer along with the word 
delimiter filter, and put the following line into the file referenced by 
the "types" parameter, it would do most of what you're after:


- => ALPHA

What that config would do is cause the word delimiter filter to treat 
the hyphen as an alpha character -- so it will not use it as a 
delimiter.  One thing about the way it works -- the exclamation point at 
the end of your sentence would NOT be emitted as a token as you have 
described.  If that is critically important, and I cannot imagine that 
it would be, you're probably going to want to write your own custom 
filter.  That would be very much an expert option.


Thanks,
Shawn