Re: Limit Solr Disk IO

2020-06-07 Thread Anshuman Singh
Hi Eric,

Thanks for your reply!
I have one more question which I think you missed in my previous email.
*"When our core size becomes ~100 G, indexing becomes really slow. Why is
this happening? Do we need to put a limit on how large each core can grow?"*

This question is unrelated to segments. I think I missed setting the
context properly in my previous email.

We have a collection with 20 shards and rf 2. Basically we want to hold
500M documents in each shard. Depending on our avg doc size (~1KB), it will
grow up to 400G. Is this shard size feasible or should we split it?

On Sat, Jun 6, 2020 at 10:50 PM Erick Erickson 
wrote:

> New segments are created when
> 1> the RAMBufferSizeMB is exceeded
> or
> 2> a commit happens.
>
> The maximum segment size defaults to 5G, but TieredMergePolicy can be
> configured in solrconfig.xml to have larger max sizes by setting
> maxMergedSegmentMB
>
> Depending on your indexing rate, requiring commits every 100K records may
> be too frequent, I have no idea what your indexing rate is. In general I
> prefer a time based autocommit policy. Say, for some reason, you stop
> indexing after 50K records. They’ll never be searchable unless you have a
> time-based commit. Besides, it’s much easier to explain to users “it may
> take 60 seconds for your doc to be searchable” than “well, depending on the
> indexing rate, it may be between 10 seconds and 6 hours for your docs to be
> searchable”. Of course if you’re indexing at a very fast rate, that may not
> matter.
>
> There’s no such thing as low disk read during segment merging”. If 5
> segments need to be read, they all must be read in their entirety and the
> new segment must be completely written out. At best you can try to cut down
> on the number of times segment merges happen, but from what you’re
> describing that may not be feasible.
>
> Attachments are aggressively stripped by the mail server, your graph did
> not come through.
>
> Once a segment grows to the max size (5g by default), it is not mreged
> again unless and until it accumulates quite a number of deleted documents.
> So one question is whether you update existing documents frequently. Is
> that the case? If not, then the index size really shouldn’t matter and your
> problem is something else.
>
> And I sincerely hope that part of your indexing does _NOT_ include
> optimize/forcemerge or expungeDeletes. Those are very expensive operations,
> and prior to Solr 7.5 would leave your index in an awkward state, see:
> https://lucidworks.com/post/segment-merging-deleted-documents-optimize-may-bad/.
> There’s a link for how this is different in Solr 7.5+ in that article.
>
> But something smells fishy about this situation. Segment merging is
> typically not very noticeable. Perhaps you just have too much data on too
> small hardware? You’ve got some evidence that segment merging is the root
> cause, but I wonder if what’s happening is you’re just swapping instead?
> Segment merging will certainly increase the I/O pressure, but by and large
> that shouldn’t really affect search speed if the OS memory space is large
> enough to hold the important portions of your index. If the OS isn’t large
> enough, the additional I/O pressure from merging may be enough to start
> your system swapping which is A Bad Thing.
>
> See:
> https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
> for how Lucene uses MMapDirectory...
>
> Best,
> Erick
>
> > On Jun 6, 2020, at 11:29 AM, Anshuman Singh 
> wrote:
> >
> > Hi Eric,
> >
> > We are looking into TLOG/PULL replicas. But I have some doubts regarding
> segments. Can you explain what causes creation of a new segment and how
> large it can grow?
> > And this is my index config:
> > maxMergeAtOnce - 20
> > segmentsPerTier - 20
> > ramBufferSizeMB - 512 MB
> >
> > Can I configure these settings optimally for low disk read during
> segment merging? Like increasing segmentsPerTier may help but a large
> number of segments may impact search. And as per the documentation,
> ramBufferSizeMB can trigger segment merging so maybe that can be tweaked.
> >
> > One more question:
> > This graph is representing index time wrt core size (0-100G). Commits
> were happening automatically at every 100k records.
> >
> >
> >
> > As you can see the density of spikes is increasing as the core size is
> increasing. When our core size becomes ~100 G, indexing becomes really
> slow. Why is this happening? Do we need to put a limit on how large each
> core can grow?
> >
> >
> > On Fri, Jun 5, 2020 at 5:59 PM Erick Erickson 
> wrote:
> > Have you considered TLOG/PULL replicas rather than NRT replicas?
> > That way, all the indexing happens on a single machine and you can
> > use shards.preference to confine the searches happen on the PULL
> replicas,
> > see:  https://lucene.apache.org/solr/guide/7_7/distributed-requests.html
> >
> > No, you can’t really limit the number of segments. While that seems like
> a
> > good idea, it quickly becomes c

Re: Edismax query using different strings for different fields

2020-06-07 Thread David Zimmermann
Thanks for the support Erick. Not using the “qf" parameter at all seems to give 
me valid query results now. The query debug information:

"debug":{ "rawquerystring":"claims_en:(An English sentence) description_en:(An 
English sentence) claims_de:(Ein Deutscher Satz) description_de:(Ein Deutscher 
Satz)", "querystring":"claims_en:(An English sentence) description_en:(An 
English sentence) claims_de:(Ein Deutscher Satz) description_de:(Ein Deutscher 
Satz)", "parsedquery":"+((claims_en:english claims_en:sentenc) 
(description_en:english description_en:sentenc) (claims_de:deutsch 
claims_de:satz) (description_de:deutsch description_de:satz))", 
"parsedquery_toString":"+((claims_en:english claims_en:sentenc) 
(description_en:english description_en:sentenc) (claims_de:deutsch 
claims_de:satz) (description_de:deutsch description_de:satz))"

But this way it now seems like the “tie” parameter has no impact anymore. The 
fact that I wanted something between a sum and a max query was the original 
reason why I intend to use a edismax query. Also since I do have full sentences 
as query, I thought it would be a good idea to use the phrase query feature at 
a later stage.

If the edismax query is not the way to achieve my goal, do you see a proper way 
to do this? The only alternative I see is running 2 seperate edismax query, one 
for the English fields and one for the German fields and then recombining the 
results. But that way I don’t know if the resulting scores are comparable? Can 
I assume a score of 15 from the English edismax is better than a score of 13 
from the German edismax?

Best regards
David


On 5 Jun 2020, at 19:39, Erick Erickson 
mailto:erickerick...@gmail.com>> wrote:

Let’s see the results of adding &debug=query to the query, in particular the 
parsed version.

Because what you’re reporting doesn’t really make sense. edismax should be 
totally
ignoring the “qf” parameter since you’re specifically qualifying all the 
clauses with
a field. Unless you’re not really enclosing the search text in parentheses (or 
quotes
if they should be phrases).

Also, if you’re willing to form separate clauses like this, there's no reason 
to even
use edismax since its purpose is to automatically distribute search terms over 
multiple
fields and you’re explicitly specifying the fields..

Best,
Erick

On Jun 5, 2020, at 10:10 AM, David Zimmermann 
mailto:david.zimmerm...@usi.ch>> wrote:

I could need some advice on how to handle a particular cross language search 
with Solr. I posted it on Stackoverflow 2 months ago, but could not find a 
solution.
I have documents in 3 languages (English, German, French). For simplicity let's 
assume it's just two languages (English and German). The documents are 
standardised in the sense that they contain the same parts (text_part1 and 
text_part2), just the language they are written in is different. The language 
of the documents is known. In my index schema I use one core with different 
fields for each language.

For a German document the index will look something like this:

*   text_part1_en: empty
*   text_part2_en: empty
*   text_part1_de: German text
*   text_part2_de: Another German text

For an English document it will be the other way around.

What I want to achieve: A user entering a query in English should receive both, 
English and German documents that are relevant to his search. Further 
conditions are:

*   I want results with hits in text_part1 and text_part2 to be higher ranked 
than results with hits only in one field (tie value > 0).
*   The queries will not be single words, but full sentences (stop word removal 
needed and partial hits [only a few words out of the sentences] must be valid).
*   English and German documents must output into one ranking. I need to be 
able to compare the relevance of an English document to the relevance of a 
German document.
*   the text parts need to stay separate, I want to boost the importance of 
(let's say part1) over the other.

My general approach so far has been to get a German translation of the user's 
query by sending it to a translation API. Then I want use an edismax query, 
since it seems to fulfill all of my requirements. The problem is that I cannot 
manage to search for the German query in the German fields and the English 
query in the English fields only. The Solr edismax 
documentation
 states that it supports the full Lucene query parser syntax, but I can't find 
a way to address different fields with different inputs. I tried:

q=text_part1_en: (A sentence in English) text_part1_de: (Ein Satz auf Deutsch) 
text_part2_en: (A sentence in English) text_part2_de: (Ein Satz auf Deutsch)
qf=text_part1_en text_part2_en text_part1_de text_part2_de


This syntax should be in line with what MatsLindh wrote in this 
thread

Re: Periodically 100% cpu and high load/IO

2020-06-07 Thread Phill Campbell
Can you switch to 8.5.2 and see if it still happens.
In my testing of 8.5.1 I had one of my machines get really hot and bring the 
entire system to a crawl.
What seemed to cause my issue was memory usage. I could give the JVM running 
Solr less heap and the problem wouldn’t manifest.
I haven’t seen it with 8.5.2. Just a thought.

> On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug 
>  wrote:
> 
> Yes, there are light/moderate indexing most of the time.
> The setup has NRT replicas. And the shards are around 45GB each.
> Index merging has been the hypothesis for some time, but we haven't dared
> to activate info stream logging.
> 
> On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson 
> wrote:
> 
>> One possibility is merging index segments. When this happens, are you
>> actively indexing? And are these NRT replicas or TLOG/PULL? If the latter,
>> are your TLOG leaders on the affected machines?
>> 
>> Best,
>> Erick
>> 
>>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
>> marvin.lilleh...@gmail.com> wrote:
>>> 
>>> Hi,
>>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes one
>>> or two nodes has Solr running with 100% cpu on all cores, «load» over
>> 400,
>>> and high IO. It usually lasts five to ten minutes, and the node is hardly
>>> responding.
>>> Does anyone have any experience with this type of behaviour? Is there any
>>> logging other than infostream that could give any information?
>>> 
>>> We managed to trigger a thread dump,
>>> 
 java.base@11.0.6
 
>> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
 org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
 org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
 org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
 
 
>> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
 
 
>> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
 org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
 
 
>> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
 
>> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
 org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
 
 
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
 
 
>> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
 
 
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
 org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
>>> 
>>> 
>>> But not sure if this is from the incident or just right after. It seems
>>> strange that a fsync should behave like this.
>>> 
>>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it
>> off)
>>> 
>>> --
>>> Kind regards,
>>> Marvin B. Lillehaug
>> 
>> 
> 
> -- 
> med vennlig hilsen,
> Marvin B. Lillehaug



Solr admin error message - where are relevant log files?

2020-06-07 Thread Jim Anderson
Hi,

I'm a newbie with Solr, and going through tutorials and trying to get Solr
working with Nutch.

Today, I started up Solr and then brought up Solr Admin at:

http://localhost:8983/solr/

The admin pages comes up with:

SolrCore Initialization Failures

   - *{{core}}:* {{error}}

Please check your logs for more information


I look in my .../solr/server/logs directory and cannot find and meaningful
errors or warnings.


Should I be looking elsewhere?

Jim A.


Re: Solr admin error message - where are relevant log files?

2020-06-07 Thread Shawn Heisey

On 6/7/2020 10:16 AM, Jim Anderson wrote:

The admin pages comes up with:

SolrCore Initialization Failures





I look in my .../solr/server/logs directory and cannot find and meaningful
errors or warnings.

Should I be looking elsewhere?


That depends.  Did you install Solr with the installer script, or just 
start it up after extracting the archive?  Does the solr/server/logs 
directory you mentioned contain files with timestamps that are current? 
If not, then the logs are likely going somewhere else.


If you go to the "Logging" tab when the admin UI shows that error, you 
will be able to see any log messages at WARN or higher severity.  Often 
such log entries will need to be expanded by clicking on the little "i" 
icon.  It will close again quickly, so you need to read fast.


Thanks,
Shawn


Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Hello SOLR Experts,

I am working on a POC to Index millions of PDF documents present in
Multiple Folder in fileshare.

Could you please let me the best practices and step to implement it.

Thanks
Fiz Nadiyal.


Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Jörn Franke
You have to write an external application that creates multiple threads, parses 
the PDFs and index them in Solr. Ideally you parse the PDFs once and store the 
resulting text on some file system and then index it. Reason is that if you 
upgrade to two major versions of Solr you might need to reindex again. Then you 
can save time because you don’t need to parse the PDFs again. 
It can be also useful in case you are not sure yet about the final schema and 
need to index several times in different schemas etc

You can also use Apache manifoldCF.



> Am 07.06.2020 um 19:19 schrieb Fiz N :
> 
> Hello SOLR Experts,
> 
> I am working on a POC to Index millions of PDF documents present in
> Multiple Folder in fileshare.
> 
> Could you please let me the best practices and step to implement it.
> 
> Thanks
> Fiz Nadiyal.


Re: Periodically 100% cpu and high load/IO

2020-06-07 Thread Marvin Bredal Lillehaug
We have upgrading 8.5.2 on the way to production, so we'll see.

We are running with default merge config, and based on the description on
https://lucene.apache.org/solr/guide/8_5/taking-solr-to-production.html#dynamic-defaults-for-concurrentmergescheduler
I don't understand why all cpus are maxed.


On Sun, 7 Jun 2020, 16:59 Phill Campbell,  wrote:

> Can you switch to 8.5.2 and see if it still happens.
> In my testing of 8.5.1 I had one of my machines get really hot and bring
> the entire system to a crawl.
> What seemed to cause my issue was memory usage. I could give the JVM
> running Solr less heap and the problem wouldn’t manifest.
> I haven’t seen it with 8.5.2. Just a thought.
>
> > On Jun 3, 2020, at 8:27 AM, Marvin Bredal Lillehaug <
> marvin.lilleh...@gmail.com> wrote:
> >
> > Yes, there are light/moderate indexing most of the time.
> > The setup has NRT replicas. And the shards are around 45GB each.
> > Index merging has been the hypothesis for some time, but we haven't dared
> > to activate info stream logging.
> >
> > On Wed, Jun 3, 2020 at 2:34 PM Erick Erickson 
> > wrote:
> >
> >> One possibility is merging index segments. When this happens, are you
> >> actively indexing? And are these NRT replicas or TLOG/PULL? If the
> latter,
> >> are your TLOG leaders on the affected machines?
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 3, 2020, at 3:57 AM, Marvin Bredal Lillehaug <
> >> marvin.lilleh...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>> We have a cluster with five Solr(8.5.1, Java 11) nodes, and sometimes
> one
> >>> or two nodes has Solr running with 100% cpu on all cores, «load» over
> >> 400,
> >>> and high IO. It usually lasts five to ten minutes, and the node is
> hardly
> >>> responding.
> >>> Does anyone have any experience with this type of behaviour? Is there
> any
> >>> logging other than infostream that could give any information?
> >>>
> >>> We managed to trigger a thread dump,
> >>>
>  java.base@11.0.6
> 
> >>
> /java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:112)
>  org.apache.lucene.util.IOUtils.fsync(IOUtils.java:483)
>  org.apache.lucene.store.FSDirectory.fsync(FSDirectory.java:331)
>  org.apache.lucene.store.FSDirectory.sync(FSDirectory.java:286)
> 
> 
> >>
> org.apache.lucene.store.NRTCachingDirectory.sync(NRTCachingDirectory.java:158)
> 
> 
> >>
> org.apache.lucene.store.LockValidatingDirectoryWrapper.sync(LockValidatingDirectoryWrapper.java:68)
>  org.apache.lucene.index.IndexWriter.startCommit(IndexWriter.java:4805)
> 
> 
> >>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:3277)
> 
> >>
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:3445)
>  org.apache.lucene.index.IndexWriter.commit(IndexWriter.java:3410)
> 
> 
> >>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:678)
> 
> 
> >>
> org.apache.solr.cloud.RecoveryStrategy.doSyncOrReplicateRecovery(RecoveryStrategy.java:636)
> 
> 
> >>
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:337)
>  org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:318)
> >>>
> >>>
> >>> But not sure if this is from the incident or just right after. It seems
> >>> strange that a fsync should behave like this.
> >>>
> >>> Swappiness is set to default for RHEL 7 (Ops have resisted turning it
> >> off)
> >>>
> >>> --
> >>> Kind regards,
> >>> Marvin B. Lillehaug
> >>
> >>
> >
> > --
> > med vennlig hilsen,
> > Marvin B. Lillehaug
>
>


Re: Solr admin error message - where are relevant log files?

2020-06-07 Thread Jim Anderson
 >>> Did you install Solr with the installer script

I was not aware that there is an install script. I will look for it, but if
you can point me to it, that will help

>>> or just
>>> start it up after extracting the archive?

I extracted the files from a tar ball and did a bit of setting up. For
example, I created a core and modified my schema.xml file a bit.

>> Does the solr/server/logs
>> directory you mentioned contain files with timestamps that are current?

The log files were current.

>>> If you go to the "Logging" tab when the admin UI shows that error

I cannot go to the "Logging" tab. When the admin UI comes up, it shows the
error message and hangs with the cursor spinning.

Thanks for the input. Again, if you can provide the install script, that
will likely help. I'm going to go back and start with installing Solr again.

Jim



On Sun, Jun 7, 2020 at 1:09 PM Shawn Heisey  wrote:

> On 6/7/2020 10:16 AM, Jim Anderson wrote:
> > The admin pages comes up with:
> >
> > SolrCore Initialization Failures
>
> 
>
> > I look in my .../solr/server/logs directory and cannot find and
> meaningful
> > errors or warnings.
> >
> > Should I be looking elsewhere?
>
> That depends.  Did you install Solr with the installer script, or just
> start it up after extracting the archive?  Does the solr/server/logs
> directory you mentioned contain files with timestamps that are current?
> If not, then the logs are likely going somewhere else.
>
> If you go to the "Logging" tab when the admin UI shows that error, you
> will be able to see any log messages at WARN or higher severity.  Often
> such log entries will need to be expanded by clicking on the little "i"
> icon.  It will close again quickly, so you need to read fast.
>
> Thanks,
> Shawn
>


Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
Here’s a skeletal SolrJ program using Tika as another alternative.

Best,
Erick

> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> 
> You have to write an external application that creates multiple threads, 
> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and 
> store the resulting text on some file system and then index it. Reason is 
> that if you upgrade to two major versions of Solr you might need to reindex 
> again. Then you can save time because you don’t need to parse the PDFs again. 
> It can be also useful in case you are not sure yet about the final schema and 
> need to index several times in different schemas etc
> 
> You can also use Apache manifoldCF.
> 
> 
> 
>> Am 07.06.2020 um 19:19 schrieb Fiz N :
>> 
>> Hello SOLR Experts,
>> 
>> I am working on a POC to Index millions of PDF documents present in
>> Multiple Folder in fileshare.
>> 
>> Could you please let me the best practices and step to implement it.
>> 
>> Thanks
>> Fiz Nadiyal.



Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Jorn and Erick.

Hi Erick, looks like the skeletal SOLRJ program attachment is missing.

Thanks
Fiz

On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
wrote:

> Here’s a skeletal SolrJ program using Tika as another alternative.
>
> Best,
> Erick
>
> > On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> >
> > You have to write an external application that creates multiple threads,
> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and
> store the resulting text on some file system and then index it. Reason is
> that if you upgrade to two major versions of Solr you might need to reindex
> again. Then you can save time because you don’t need to parse the PDFs
> again.
> > It can be also useful in case you are not sure yet about the final
> schema and need to index several times in different schemas etc
> >
> > You can also use Apache manifoldCF.
> >
> >
> >
> >> Am 07.06.2020 um 19:19 schrieb Fiz N :
> >>
> >> Hello SOLR Experts,
> >>
> >> I am working on a POC to Index millions of PDF documents present in
> >> Multiple Folder in fileshare.
> >>
> >> Could you please let me the best practices and step to implement it.
> >>
> >> Thanks
> >> Fiz Nadiyal.
>
>


Re: Solr admin error message - where are relevant log files?

2020-06-07 Thread Jan Høydahl
Try force reloading the admin page in your browser a few times. Or try another 
browser?

Jan Høydahl

> 7. jun. 2020 kl. 21:07 skrev Jim Anderson :
> 
> 
 Did you install Solr with the installer script
> 
> I was not aware that there is an install script. I will look for it, but if
> you can point me to it, that will help
> 
 or just
 start it up after extracting the archive?
> 
> I extracted the files from a tar ball and did a bit of setting up. For
> example, I created a core and modified my schema.xml file a bit.
> 
>>> Does the solr/server/logs
>>> directory you mentioned contain files with timestamps that are current?
> 
> The log files were current.
> 
 If you go to the "Logging" tab when the admin UI shows that error
> 
> I cannot go to the "Logging" tab. When the admin UI comes up, it shows the
> error message and hangs with the cursor spinning.
> 
> Thanks for the input. Again, if you can provide the install script, that
> will likely help. I'm going to go back and start with installing Solr again.
> 
> Jim
> 
> 
> 
>>> On Sun, Jun 7, 2020 at 1:09 PM Shawn Heisey  wrote:
>>> On 6/7/2020 10:16 AM, Jim Anderson wrote:
>>> The admin pages comes up with:
>>> SolrCore Initialization Failures
>> 
>>> I look in my .../solr/server/logs directory and cannot find and
>> meaningful
>>> errors or warnings.
>>> Should I be looking elsewhere?
>> That depends.  Did you install Solr with the installer script, or just
>> start it up after extracting the archive?  Does the solr/server/logs
>> directory you mentioned contain files with timestamps that are current?
>> If not, then the logs are likely going somewhere else.
>> If you go to the "Logging" tab when the admin UI shows that error, you
>> will be able to see any log messages at WARN or higher severity.  Often
>> such log entries will need to be expanded by clicking on the little "i"
>> icon.  It will close again quickly, so you need to read fast.
>> Thanks,
>> Shawn


Re: Solr admin error message - where are relevant log files?

2020-06-07 Thread Jim Anderson
An update.

I started over by removing my Solr 7.3.1 installation and untarring again.

Then went to the bin root directory and entered:

bin/solr -start

Next, I brought up the solr admin window and it still gives the same error
message and hangs up. As far as I can tell I am running solr straight out
of the box.

Jim

On Sun, Jun 7, 2020 at 3:07 PM Jim Anderson 
wrote:

> >>> Did you install Solr with the installer script
>
> I was not aware that there is an install script. I will look for it, but
> if you can point me to it, that will help
>
> >>> or just
> >>> start it up after extracting the archive?
>
> I extracted the files from a tar ball and did a bit of setting up. For
> example, I created a core and modified my schema.xml file a bit.
>
> >> Does the solr/server/logs
> >> directory you mentioned contain files with timestamps that are current?
>
> The log files were current.
>
> >>> If you go to the "Logging" tab when the admin UI shows that error
>
> I cannot go to the "Logging" tab. When the admin UI comes up, it shows the
> error message and hangs with the cursor spinning.
>
> Thanks for the input. Again, if you can provide the install script, that
> will likely help. I'm going to go back and start with installing Solr again.
>
> Jim
>
>
>
> On Sun, Jun 7, 2020 at 1:09 PM Shawn Heisey  wrote:
>
>> On 6/7/2020 10:16 AM, Jim Anderson wrote:
>> > The admin pages comes up with:
>> >
>> > SolrCore Initialization Failures
>>
>> 
>>
>> > I look in my .../solr/server/logs directory and cannot find and
>> meaningful
>> > errors or warnings.
>> >
>> > Should I be looking elsewhere?
>>
>> That depends.  Did you install Solr with the installer script, or just
>> start it up after extracting the archive?  Does the solr/server/logs
>> directory you mentioned contain files with timestamps that are current?
>> If not, then the logs are likely going somewhere else.
>>
>> If you go to the "Logging" tab when the admin UI shows that error, you
>> will be able to see any log messages at WARN or higher severity.  Often
>> such log entries will need to be expanded by clicking on the little "i"
>> icon.  It will close again quickly, so you need to read fast.
>>
>> Thanks,
>> Shawn
>>
>


Re: Solr admin error message - where are relevant log files?

2020-06-07 Thread Jim Anderson
@Jan

Thanks for the suggestion. I tried opera instead of firefox and it worked.
I will try cleaner the cache on firefox, restart it and see if it works
there.

Jim

On Sun, Jun 7, 2020 at 3:28 PM Jim Anderson 
wrote:

> An update.
>
> I started over by removing my Solr 7.3.1 installation and untarring again.
>
> Then went to the bin root directory and entered:
>
> bin/solr -start
>
> Next, I brought up the solr admin window and it still gives the same error
> message and hangs up. As far as I can tell I am running solr straight out
> of the box.
>
> Jim
>
> On Sun, Jun 7, 2020 at 3:07 PM Jim Anderson 
> wrote:
>
>> >>> Did you install Solr with the installer script
>>
>> I was not aware that there is an install script. I will look for it, but
>> if you can point me to it, that will help
>>
>> >>> or just
>> >>> start it up after extracting the archive?
>>
>> I extracted the files from a tar ball and did a bit of setting up. For
>> example, I created a core and modified my schema.xml file a bit.
>>
>> >> Does the solr/server/logs
>> >> directory you mentioned contain files with timestamps that are
>> current?
>>
>> The log files were current.
>>
>> >>> If you go to the "Logging" tab when the admin UI shows that error
>>
>> I cannot go to the "Logging" tab. When the admin UI comes up, it shows
>> the error message and hangs with the cursor spinning.
>>
>> Thanks for the input. Again, if you can provide the install script, that
>> will likely help. I'm going to go back and start with installing Solr again.
>>
>> Jim
>>
>>
>>
>> On Sun, Jun 7, 2020 at 1:09 PM Shawn Heisey  wrote:
>>
>>> On 6/7/2020 10:16 AM, Jim Anderson wrote:
>>> > The admin pages comes up with:
>>> >
>>> > SolrCore Initialization Failures
>>>
>>> 
>>>
>>> > I look in my .../solr/server/logs directory and cannot find and
>>> meaningful
>>> > errors or warnings.
>>> >
>>> > Should I be looking elsewhere?
>>>
>>> That depends.  Did you install Solr with the installer script, or just
>>> start it up after extracting the archive?  Does the solr/server/logs
>>> directory you mentioned contain files with timestamps that are current?
>>> If not, then the logs are likely going somewhere else.
>>>
>>> If you go to the "Logging" tab when the admin UI shows that error, you
>>> will be able to see any log messages at WARN or higher severity.  Often
>>> such log entries will need to be expanded by clicking on the little "i"
>>> icon.  It will close again quickly, so you need to read fast.
>>>
>>> Thanks,
>>> Shawn
>>>
>>


Re: Solr admin error message - where are relevant log files?

2020-06-07 Thread Jim Anderson
I cleared the Firefox cache and restarted and things are working ok now.

Jim

On Sun, Jun 7, 2020 at 3:44 PM Jim Anderson 
wrote:

> @Jan
>
> Thanks for the suggestion. I tried opera instead of firefox and it worked.
> I will try cleaner the cache on firefox, restart it and see if it works
> there.
>
> Jim
>
> On Sun, Jun 7, 2020 at 3:28 PM Jim Anderson 
> wrote:
>
>> An update.
>>
>> I started over by removing my Solr 7.3.1 installation and untarring again.
>>
>> Then went to the bin root directory and entered:
>>
>> bin/solr -start
>>
>> Next, I brought up the solr admin window and it still gives the same
>> error message and hangs up. As far as I can tell I am running solr straight
>> out of the box.
>>
>> Jim
>>
>> On Sun, Jun 7, 2020 at 3:07 PM Jim Anderson 
>> wrote:
>>
>>> >>> Did you install Solr with the installer script
>>>
>>> I was not aware that there is an install script. I will look for it, but
>>> if you can point me to it, that will help
>>>
>>> >>> or just
>>> >>> start it up after extracting the archive?
>>>
>>> I extracted the files from a tar ball and did a bit of setting up. For
>>> example, I created a core and modified my schema.xml file a bit.
>>>
>>> >> Does the solr/server/logs
>>> >> directory you mentioned contain files with timestamps that are
>>> current?
>>>
>>> The log files were current.
>>>
>>> >>> If you go to the "Logging" tab when the admin UI shows that error
>>>
>>> I cannot go to the "Logging" tab. When the admin UI comes up, it shows
>>> the error message and hangs with the cursor spinning.
>>>
>>> Thanks for the input. Again, if you can provide the install script, that
>>> will likely help. I'm going to go back and start with installing Solr again.
>>>
>>> Jim
>>>
>>>
>>>
>>> On Sun, Jun 7, 2020 at 1:09 PM Shawn Heisey  wrote:
>>>
 On 6/7/2020 10:16 AM, Jim Anderson wrote:
 > The admin pages comes up with:
 >
 > SolrCore Initialization Failures

 

 > I look in my .../solr/server/logs directory and cannot find and
 meaningful
 > errors or warnings.
 >
 > Should I be looking elsewhere?

 That depends.  Did you install Solr with the installer script, or just
 start it up after extracting the archive?  Does the solr/server/logs
 directory you mentioned contain files with timestamps that are current?
 If not, then the logs are likely going somewhere else.

 If you go to the "Logging" tab when the admin UI shows that error, you
 will be able to see any log messages at WARN or higher severity.  Often
 such log entries will need to be expanded by clicking on the little "i"
 icon.  It will close again quickly, so you need to read fast.

 Thanks,
 Shawn

>>>


Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Erick Erickson
https://lucidworks.com/post/indexing-with-solrj/


> On Jun 7, 2020, at 3:22 PM, Fiz N  wrote:
> 
> Thanks Jorn and Erick.
> 
> Hi Erick, looks like the skeletal SOLRJ program attachment is missing.
> 
> Thanks
> Fiz
> 
> On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
> wrote:
> 
>> Here’s a skeletal SolrJ program using Tika as another alternative.
>> 
>> Best,
>> Erick
>> 
>>> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
>>> 
>>> You have to write an external application that creates multiple threads,
>> parses the PDFs and index them in Solr. Ideally you parse the PDFs once and
>> store the resulting text on some file system and then index it. Reason is
>> that if you upgrade to two major versions of Solr you might need to reindex
>> again. Then you can save time because you don’t need to parse the PDFs
>> again.
>>> It can be also useful in case you are not sure yet about the final
>> schema and need to index several times in different schemas etc
>>> 
>>> You can also use Apache manifoldCF.
>>> 
>>> 
>>> 
 Am 07.06.2020 um 19:19 schrieb Fiz N :
 
 Hello SOLR Experts,
 
 I am working on a POC to Index millions of PDF documents present in
 Multiple Folder in fileshare.
 
 Could you please let me the best practices and step to implement it.
 
 Thanks
 Fiz Nadiyal.
>> 
>> 



Re: Indexing PDF on SOLR 8.5

2020-06-07 Thread Fiz N
Thanks Erick...

On Sun, Jun 7, 2020 at 1:50 PM Erick Erickson 
wrote:

> https://lucidworks.com/post/indexing-with-solrj/
>
>
> > On Jun 7, 2020, at 3:22 PM, Fiz N  wrote:
> >
> > Thanks Jorn and Erick.
> >
> > Hi Erick, looks like the skeletal SOLRJ program attachment is missing.
> >
> > Thanks
> > Fiz
> >
> > On Sun, Jun 7, 2020 at 12:20 PM Erick Erickson 
> > wrote:
> >
> >> Here’s a skeletal SolrJ program using Tika as another alternative.
> >>
> >> Best,
> >> Erick
> >>
> >>> On Jun 7, 2020, at 2:06 PM, Jörn Franke  wrote:
> >>>
> >>> You have to write an external application that creates multiple
> threads,
> >> parses the PDFs and index them in Solr. Ideally you parse the PDFs once
> and
> >> store the resulting text on some file system and then index it. Reason
> is
> >> that if you upgrade to two major versions of Solr you might need to
> reindex
> >> again. Then you can save time because you don’t need to parse the PDFs
> >> again.
> >>> It can be also useful in case you are not sure yet about the final
> >> schema and need to index several times in different schemas etc
> >>>
> >>> You can also use Apache manifoldCF.
> >>>
> >>>
> >>>
>  Am 07.06.2020 um 19:19 schrieb Fiz N :
> 
>  Hello SOLR Experts,
> 
>  I am working on a POC to Index millions of PDF documents present in
>  Multiple Folder in fileshare.
> 
>  Could you please let me the best practices and step to implement it.
> 
>  Thanks
>  Fiz Nadiyal.
> >>
> >>
>
>


Re: Highlighting values of non stored fields

2020-06-07 Thread mosh bla


Thanks Erick for the reply. Your answer is eaxctly what I was expecting from 
the highlight component but it seems like I am getting different behaviour.
I'll try to give a simple example and I hope you can explain where is my 
mistake.
Say I have the following fields configuration:



 
And I indexed the following document:
{
"doc_text": "MOSH"
}
 
When executing the following query 
"http://.../select?q=doc_text_lw:mosh&hl=true&hl.fl=doc_text"; - the document is 
matched and returned in response, but the highlighed fragment is empty.
I also tried to change 'hl.method' param to 'unified' and 'fastVector' but no 
luck either. My conclusion was that 'hl.fl' param should be set to 
'doc_text_lw' and it must be also stored...
 
 
 

Sent: Tuesday, June 02, 2020 at 3:15 PM
From: "Erick Erickson" 
To: solr-user@lucene.apache.org
Subject: Re: Highlighting values of non stored fields
Why do you think even variants need to be stored/highlighted? Usually
when you store variants for ranking purposes those extra copies are
invisible to the user. So most often people store exactly one copy
of a particular field and highlight _that_ field in the return.

So say my field is f1 and I have indexed f1_1, f1_2, f1_3. I just store
f1_1 and return the highlighted text from that one.

You could even just stored the data only once in a field that’s never
indexed and return/highlight that if you wanted.

Best,
Erick

> On Jun 2, 2020, at 3:24 AM, mosheB  wrote:
>
> Our use case is as follow:
> We are indexing free text documents. Each document contains metadata fields
> (such as author, creation date...) which are kinda small, and one "big"
> field that holds the document's text itself.
>
> For ranking purpose each field is indexed in more then one "variation" and
> query is executed with edismax query parser. Things are working alright, but
> now a new feature is requested by the customer - highlighting.
> To enable highlighting every field must be stored, including all variations
> of the big text field. This pushes our storage to the limit (and probably
> the document cache...) and feels a bit redundant, as the stored value is
> duplicated n times... Is there any way to “reference” stored value from one
> field to another?
> For example:
> Say we have the following config:
>  />
>  />
>
> 
> 
> 
>
> And we execute the following query:
> http://.../select?defType=edismax&q=desired_terms&qf=doc_text^2
> doc_text_bigrams^3
> doc_text_phrases^4&hl=on&hl.fl=doc_text,doc_text_bigrams,doc_text_phrases
>
> Highlight fragments in response will be blank if match occurred on the
> non-stored fields (doc_text_bigrams or doc_text_phrases). Is it possible to
> pass extra parameter to the highlight component, to point it to the stored
> data of the “original” doc_text field? a kind of “stored value reference
> field”?
>
> Thanks in advance.
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html