When I worked for a search engine vendor, we did exactly the same thing.
We always ran the document crackers in a different process because they tended
to hang, crash, run forever, or use all of memory. Adobe PDFlib was not an
exception to that rule.
wunder
Walter Underwood
Ultraseek Server (at
There are a bunch of variables. If there are too many merge threads going on,
for instance, then
the commit will block until one of the merge threads finishes. It could
well be that the one you identify as “slow” is coincidentally after the hard
commit, which are
could accumulate for 10 minutes o
I found it better to offload PDF parsing and text extraction to a standalone
Tika Server instead. This way, if a PDF crashes the Tika Server, it will not
take down the JVM where your code is running.
You could easily have multiple instances of Tika Server running (perhaps on
another machine) and
Thanks for your quick reply.
Commit is not called from client side.
We do not use any cache. Here is my solrconfig.xml :
https://drive.google.com/file/d/1LwA1d4OiMhQQv806tR0HbZoEjA8IyfdR/view
We give set SOLR_OPTS=%SOLR_OPTS% -Dsolr.autoSoftCommit.maxTime=100 because we
want quick view after in
Hello ,
I opened a bug issue not knowing it's not the correct place to ask the
question at hand,
So I was directed to send an e-mail to the mailing list, hopefully I'm
correct this time.
Here's a link to the issue opened:
https://issues.apache.org/jira/browse/SOLR-14780?page=com.atlassian.jira.pl
When I worked for a search engine vendor in my previous life, the PDF parsing
pipeline looked something like this
Try parsing the PDF file with tool X
If failure or timeout, try instead with tool Y
If failure or timeout, try instead with tool Z
In this case X would be the preferred parser, but Y
Followup regarding the bin/solr issue for anyone running Solr on FreeBSD.
The script uses "ps auxww | grep ..." in various places, like:
SOLR_PROC=`ps auxww |grep -w $SOLR_PID|grep start\.jar |grep jetty\.port`
For reasons unknown to me, FreeBSD's "ps auxww" truncates the COMMAND
column output a
It depends on how the commit is called. You have openSearcher=true, which means
the call
won’t return until all your autowarming is done. This _looks_ like it might be
a commit
called from a client, which you should not do.
It’s also suspicious that these are soft commits 1 second apart. The oth
Another option is to suggest from a copyField with a very simple analysis
chain. Say:
PatternReplaceCharFilterFactory - to remove everything you don’t want to keep.
WhitespaceTokenizerFactory
LowercaseFilterFactory - maybe
And I think you miss Shawn’s point about the exclamation point. If you ju
Hello, Shawn
Thank you for your response.
Yes. I am sure that I need to preserve "-" in the words.
What I want to do is not actually search, it is for a suggestion.
"abc-efg" is a dummy sample of our product ID.
So, there are several product IDs. such as abc-efg, abc-hij, abc-klm and so
on.
When
Maybe to add to this . Additionally try to batch the requests from the queue -
don’t do it one by one , but take n items at the same time.
Look on the Solr side also on the configuration of soft commits vs hard commits
. Soft commits are relevant for definition how real time this is and can be.
You do not provide many details, but a queuing mechanism seems to be
appropriate for this use case.
> Am 26.08.2020 um 11:30 schrieb Tushar Arora :
>
> Hi,
>
> One of our use cases requires real time indexing of data in solr from DB.
> Approximately, 30 rows are updated in a second in DB. And
Hi,
One of our use cases requires real time indexing of data in solr from DB.
Approximately, 30 rows are updated in a second in DB. And I also want these
to be updated in the index simultaneously.
Is the Queuing mechanism like Rabbitmq helpful in my case?
Please suggest the ways to achieve it.
Re
I am using solr 6.1.0. We have 2 shards and each has one replica.
When I checked shard1 log, I found that commit process was going to slow for
some collection.
Slow commit:
2020-08-25 09:08:10.328 INFO (commitScheduler-124-thread-1) [c:forms s:shard1
r:core_node1 x:forms] o.a.s.u.DirectUpdateHa
Hi Joe,
Tika is pretty amazing at coping with the things people throw at it and
I know the team behind it have added a very extensive testing framework.
However, the reality is that malformed, huge or just plain crazy
documents may cause crashes - PDFs are mad, you can even embed
Javascript i
Hello,
Just noticed my numbering is off, should be:
1. Deploy a feature store from a JSON file to each collection.
2. Reload all collections as advised in the documentation:
https://lucene.apache.org/solr/guide/7_5/learning-to-rank.html#applying-changes
3. Deploy the related model from a JSON fil
Hi Joe,
Yes I had made these changes for getting HDFS to work with Solr. Below are
config changes which I carried out:
Changes in solr.in.cmd
set SOLR_OPTS=%SOLR_OPTS% -Dsolr.directoryFactory=HdfsDirectoryFactory
set SOLR_OPTS=%SOL
17 matches
Mail list logo