RE: How to restore deleted collection from filesystem

2020-05-26 Thread Kommu, Vinodh K.
Thanks Eric.

We were able successfully restored deleted collection data as suggested. In 
fact tried both approaches as below & both worked fine:

1) Create collection with same number of shards and replication factor = 1
2) Create collection with same number of shards and same replication factor as 
deleted collection.

As we create collections using rule-based replica placement method, first 
approach is little difficult to find which on node the replicas should be added 
manually. With 2nd approach, as the replicas are already created, just copied 
shard1 leader's index files from restored data to all corresponding shard1 
replicas index directory on newly created collection. Once copy is done, 
brought up solr nodes and everything was working fine.


Thanks & Regards,
Vinodh

-Original Message-
From: Erick Erickson  
Sent: Thursday, May 21, 2020 11:09 PM
To: solr-user@lucene.apache.org
Subject: Re: How to restore deleted collection from filesystem

ATTENTION: External Email – Be Suspicious of Attachments, Links and Requests 
for Login Information.

See inline.

> On May 21, 2020, at 10:13 AM, Kommu, Vinodh K.  wrote:
>
> Thanks Eric for quick response.
>
> Yes, our VMs are equipped with NetBackup which is like file based backup and 
> it can restore any files or directories that were deleted from latest 
> available full backup.
>
> Can we create an empty collection with the same name which was deleted with 
> same number of shared & replicas and copy the content from restored core to 
> corresponding core?

Kind of. It is NOT necessary that it has the same name. There is no need at all 
(and I do NOT recommend) that you create the same number of replicas to start. 
As I said earlier, create a single-replica (i.e. leader-only) collection with 
the same number of shards. Copy _one_ data dir (not everything under core) to 
that _one_ corresponding replica. It doesn’t matter which replica you copy from

> I mean, copy all contents (directories & files) under 
> Oldcollection_shard1_replica1 core from old collection to corresponding 
> Newcollection_shard1_replica1 core in new collection. Would this approach 
> will work?
>

As above, do not do this. Just copy the data dir from one of your backup copies 
to the leader-only replica. It doesn’t matter at all if the replica names are 
the same. The only thing that matters is that the shard number is identical. 
For instance, copy blah/blah/collection1_shard1_replica_57/data to 
blah/blah/collection1_shared1_replica_1/data if you want.

Once you have a one-replica collection with the data in it and you’ve done a 
bit of verification, use ADDREPLICA to build it out.

> Lastly anything needs to be aware in core.properties in newly created 
> collection or any reference pointing to new collection specific?

Do not copy or touch  core.properties, you can mess this up thoroughly by 
hand-editing. The _only_ thing you copy is the data directory, which will 
contain a tlog and index directory. And, the tlog isn’t even necessary.

Best,
Erick

>
>
> Thanks & Regards,
> Vinodh
>
> -Original Message-
> From: Erick Erickson 
> Sent: Thursday, May 21, 2020 6:17 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to restore deleted collection from filesystem
>
> ATTENTION: External Email – Be Suspicious of Attachments, Links and Requests 
> for Login Information.
>
> So what I’m reading here is that you have the _data_ saved somewhere, right? 
> By “data” I just mean the data directories under the replica.
>
> 1> Go ahead and recreate the collection. It _must_ have the same number of 
> shards. Make it leader-only, i.e. replicationFactor == 1
> 2> The collection will be empty, now shut down the Solr instances hosting any 
> of the replicas.
> 3> Replace the data directory under each replica with the corresponding one 
> from the backup. “Corresponding” means from the same shard, which should be 
> obvious from the replica name.
> 4> Start your Solr instances back up and verify it’s as you expect.
> 5> Use ADDREPLICA to build out your collection to have as many replicas of 
> each shard as you require. NOTE: I’d do this gradually, maybe 2-3 at a time 
> then wait for them to become active before adding more. The point here is 
> that each ADDREPLICA will cause the entire index down from from the leader 
> and with that many documents you don’t want to saturate your network.
>
> Best,
> Erick
>
>> On May 21, 2020, at 8:17 AM, Kommu, Vinodh K.  wrote:
>>
>> Hi,
>>
>> One of our largest collection which holds 3.2 billion docs was deleted 
>> accidentally in QA environment. Unfortunately we don't have latest solr 
>> backup for this collection either to restore. The only option left for us is 
>> to restore deleted replica directories under data directory using netbackup 
>> restore process.
>>
>> We haven't done this way of restore before so following things are not clear:
>>
>> 1. As the collection was deleted (not created yet), if the necessary replica 

Re: Indexing huge data onto solr

2020-05-26 Thread Erick Erickson
It Depends (tm). Often, you can create a single (albeit, perhaps complex)
SQL query that does this for you and just process the response.

I’ve also seen situations where it’s possible to hold one of the tables 
in memory on the client and just use that rather than a separate query.

It depends on the characteristics of your particular database, your DBA
could probably help.

Best,
Erick

> On May 25, 2020, at 11:56 PM, Srinivas Kashyap 
>  wrote:
> 
> Hi Erick,
> 
> Thanks for the below response. The link which you provided holds good if you 
> have single entity where you can join the tables and index it. But in our 
> scenario, we have nested entities joining different tables as shown below:
> 
> db-data-config.xml:
> 
> 
> 
> (table 1 join table 2)
> (table 3 join table 4)
> (table 5 join table 6)
> (table 7 join table 8)
> 
> 
> 
> Do you have any recommendations for it to run multiple sql’s and make it as 
> single solr document that can be sent over solrJ for indexing?
> 
> Say parent entity has 100 documents, should I iterate over each one of parent 
> tuples and execute the child entity sql’s(with where condition of parent) to 
> create one solr document? Won’t it be more load on database by executing more 
> sqls? Is there an optimum solution?
> 
> Thanks,
> Srinivas
> From: Erick Erickson 
> Sent: 22 May 2020 22:52
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing huge data onto solr
> 
> You have a lot more control over the speed and form of importing data if
> you just do the initial load in SolrJ. Here’s an example, taking the Tika
> parts out is easy:
> 
> https://lucidworks.com/post/indexing-with-solrj/
> 
> It’s especially instructive to comment out just the call to 
> CloudSolrClient.add(doclist…); If
> that _still_ takes a long time, then your DB query is the root of the 
> problem. Even with 100M
> records, I’d be really surprised if Solr is the bottleneck, but the above 
> test will tell you
> where to go to try to speed things up.
> 
> Best,
> Erick
> 
>> On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
>> mailto:srini...@bamboorose.com.INVALID>> 
>> wrote:
>> 
>> Hi All,
>> 
>> We are runnnig solr 8.4.1. We have a database table which has more than 100 
>> million of records. Till now we were using DIH to do full-import on the 
>> tables. But for this table, when we do full-import via DIH it is taking more 
>> than 3-4 days to complete and also it consumes fair bit of JVM memory while 
>> running.
>> 
>> Are there any speedier/alternates ways to load data onto this solr core.
>> 
>> P.S: Only initial data import is problem, further updates/additions to this 
>> core is being done through SolrJ.
>> 
>> Thanks,
>> Srinivas
>> 
>> DISCLAIMER:
>> E-mails and attachments from Bamboo Rose, LLC are confidential.
>> If you are not the intended recipient, please notify the sender immediately 
>> by replying to the e-mail, and then delete it without making copies or using 
>> it in any way.
>> No representation is made that this email or any attachments are free of 
>> viruses. Virus scanning is recommended and is the responsibility of the 
>> recipient.
>> 
>> Disclaimer
>> 
>> The information contained in this communication from the sender is 
>> confidential. It is intended solely for use by the recipient and others 
>> authorized to receive it. If you are not the recipient, you are hereby 
>> notified that any disclosure, copying, distribution or taking action in 
>> relation of the contents of this information is strictly prohibited and may 
>> be unlawful.
>> 
>> This email has been scanned for viruses and malware, and may have been 
>> automatically archived by Mimecast Ltd, an innovator in Software as a 
>> Service (SaaS) for business. Providing a safer and more useful place for 
>> your human generated data. Specializing in; Security, archiving and 
>> compliance. To find out more visit the Mimecast website.



Re: Solr Deletes

2020-05-26 Thread Emir Arnautović
Hi Dwane,
DBQ does not play well with concurrent updates - it’ll block updates on 
replicas causing replicas to fall behind, trigger full replication and 
potentially OOM. My advice is to go with cursors (or even better use some DB as 
source of IDs) and DBID with some batching. You’ll need some tests to see which 
test size is best in your case.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 26 May 2020, at 01:48, Dwane Hall  wrote:
> 
> Hey Solr users,
> 
> 
> 
> I'd really appreciate some community advice if somebody can spare some time 
> to assist me.  My question relates to initially deleting a large amount of 
> unwanted data from a Solr Cloud collection, and then advice on best patterns 
> for managing delete operations on a regular basis.   We have a situation 
> where data in our index can be 're-mastered' and as a result orphan records 
> are left dormant and unneeded in the index (think of a scenario similar to 
> client resolution where an entity can switch between golden records depending 
> on the information available at the time).  I'm considering removing these 
> dormant records with a large initial bulk delete, and then running a delete 
> process on a regular maintenance basis.  The initial record backlog is 
> ~50million records in a ~1.2billion document index (~4%) and the maintenance 
> deletes are small in comparison ~20,000/week.
> 
> 
> 
> So with this scenario in mind I'm wondering what my best approach is for the 
> initial bulk delete:
> 
>  1.  Do nothing with the initial backlog and remove the unwanted documents 
> during the next large reindexing process?
>  2.  Delete by query (DBQ) with a specific delete query using the document 
> id's?
>  3.  Delete by id (DBID)?
> 
> Are there any significant performance advantages between using DBID over a 
> specific DBQ? Should I break the delete operations up into batches of say 
> 1000, 1, 10, N DOC_ID's at a time if I take this approach?
> 
> 
> 
> The Solr Reference guide mentions DBQ ignores the commitWithin parameter but 
> you can specify multiple documents to remove with an OR (||) clause in a DBQ 
> i.e.
> 
> 
> Option 1 – Delete by id
> 
> {"delete":["",""]}
> 
> 
> 
> Option 2 – Delete by query (commitWithin ignored)
> 
> {"delete":{"query":"DOC_ID:( || )"}}
> 
> 
> 
> Shawn also provides a great explanation in this user group post from 2015 of 
> the DBQ process 
> (https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html)
> 
> 
> 
> I follow the Solr release notes fairly closely and also noticed this 
> excellent addition and discussion from Hossman and committers in the Solr 8.5 
> release and it looks ideal for this scenario 
> (https://issues.apache.org/jira/browse/SOLR-14241).  Unfortunately we're 
> still on the 7.7.2 branch and are unable to take advantage of the streaming 
> deletes feature.
> 
> 
> 
> If I do implement a weekly delete maintenance regime is there any advice the 
> community can offer from experience?  I'll definitely want to avoid times of 
> heavy indexing but how do deletes effect query performance?  Will users 
> notice decreased performance during delete operations so they should be 
> avoided during peak query windows as well?
> 
> 
> 
> As always any advice greatly is appreciated,
> 
> 
> 
> Thanks,
> 
> 
> 
> Dwane
> 
> 
> 
> Environment
> 
> SolrCloud 7.7.2, 30 shards, 2 replicas
> 
> ~3 qps during peak times



Re: Solr Deletes

2020-05-26 Thread Erick Erickson
Dwane:

DBQ for very large deletes is “iffy”. The problem is this: Solr must lock out 
_all_ indexing for _all_ replicas while the DBQ runs and this can just take a 
long time. This is just a consequence of distributed computing. Imagine a 
scenario where one of the documents affected by the DBQ is added by some other 
process. That has to be processed in order relative to the DBQ, but the DBQ can 
take a long time to find and delete the docs. But this has other implications, 
namely if updates don’t complete in a timely manner, the leader can throw the 
replicas into recovery...

So best practice is to go ahead and use delete-by-id. Do note that this means 
you’re responsible for resolving the issue above, but in your case it sounds 
like you’re guaranteed that none of the docs being deleted will be modified 
during the operation so you can ignore it.

What I’d do is use streaming to get my IDs (this is not using the link you 
provided, this is essentially doing that patch yourself but on the client) and 
use that to generate delete-by-id requests. This is just something like

create search stream source
while (more tuples) {
   assemble delete-by-id request (perhaps one with multiple IDs)
   send to Solr
}
don’t forget to send the last batch of deletes if you’re sending batches, I 
have ;)

Joel Bernstein’s blog is the most authoritative source, see: 
https://joelsolr.blogspot.com/2015/04/the-streaming-api-solrjio-basics.html. 
Although IDK whether that example is up to date, but it’ll give you an idea 
where to start. And Joel is pretty responsive about questions….
 
I'd package up maybe 1,000 ids per request.  I regularly package up that many 
updates, and deletes are relatively cheap. You’ll avoid the overhead of 
establishing a request for every ID. This may seem contrary to the points above 
about DBQ taking a long time, but we’re talking orders of magnitude differences 
in the time it takes to delete 1,000 docs and query/delete vastly larger 
numbers, plus this does not require all the indexes be locked.

Your users likely won’t notice this running, so while it’s usually good 
practice to do maintenance during off hours, I wouldn’t stress about it.

And a question you didn’t ask for extra credit ;). The streaming expression 
will _not_ reflect any changes to the collection while it runs. The underlying 
index searcher is kept open for the duration and it only knows about segments 
that were closed when it started. But let’s assume your autocommit interval 
expires while this process is running and opens a new searcher. _Other_ 
requests from other clients _will_ see the changes. Again, I doubt you care 
since I’m assuming your orphan records are never seen by other clients anyway.

Best,
Erick

> On May 25, 2020, at 7:48 PM, Dwane Hall  wrote:
> 
> Hey Solr users,
> 
> 
> 
> I'd really appreciate some community advice if somebody can spare some time 
> to assist me.  My question relates to initially deleting a large amount of 
> unwanted data from a Solr Cloud collection, and then advice on best patterns 
> for managing delete operations on a regular basis.   We have a situation 
> where data in our index can be 're-mastered' and as a result orphan records 
> are left dormant and unneeded in the index (think of a scenario similar to 
> client resolution where an entity can switch between golden records depending 
> on the information available at the time).  I'm considering removing these 
> dormant records with a large initial bulk delete, and then running a delete 
> process on a regular maintenance basis.  The initial record backlog is 
> ~50million records in a ~1.2billion document index (~4%) and the maintenance 
> deletes are small in comparison ~20,000/week.
> 
> 
> 
> So with this scenario in mind I'm wondering what my best approach is for the 
> initial bulk delete:
> 
>  1.  Do nothing with the initial backlog and remove the unwanted documents 
> during the next large reindexing process?
>  2.  Delete by query (DBQ) with a specific delete query using the document 
> id's?
>  3.  Delete by id (DBID)?
> 
> Are there any significant performance advantages between using DBID over a 
> specific DBQ? Should I break the delete operations up into batches of say 
> 1000, 1, 10, N DOC_ID's at a time if I take this approach?
> 
> 
> 
> The Solr Reference guide mentions DBQ ignores the commitWithin parameter but 
> you can specify multiple documents to remove with an OR (||) clause in a DBQ 
> i.e.
> 
> 
> Option 1 – Delete by id
> 
> {"delete":["",""]}
> 
> 
> 
> Option 2 – Delete by query (commitWithin ignored)
> 
> {"delete":{"query":"DOC_ID:( || )"}}
> 
> 
> 
> Shawn also provides a great explanation in this user group post from 2015 of 
> the DBQ process 
> (https://lucene.472066.n3.nabble.com/SolrCloud-delete-by-query-performance-td4206726.html)
> 
> 
> 
> I follow the Solr release notes fairly closely and also noticed this 
> excellent addition and discussion from Hossman and committer

Strange Synonym Graph Filter Bug in Admin UI

2020-05-26 Thread Audrey Lorberfeld - audrey.lorberf...@ibm.com
Hi All,

We are coming across a strange bug in the Analysis section of the Admin UI. For 
our non-English schema components, instead of the Synonym Graph Filter (SGF) 
showing in the UI, it's showing something called a "List Based Token Stream" 
(LBTS) in its place. We found an old issue that documented this bug, but it 
doesn't seem to have been resolved: 
https://issues.apache.org/jira/browse/SOLR-10366. Has anyone else come across 
this &/or have a solve?

Thanks!

Best,
Audrey



RE: Log slow queries to SQL Database using Log4j2 (JDBC)

2020-05-26 Thread Krönert Florian
Hi Walter,

thanks for your response.
That sounds like a feasible approach, although I would like to keep the stack 
as small as possible.

But the direction that you pointed out seems promising, the JDBC issues with 
log4j2 don't seem to lead anywhere.

Kind Regards,
Florian

-Original Message-
From: Walter Underwood 
Sent: Dienstag, 26. Mai 2020 02:06
To: solr-user@lucene.apache.org
Subject: Re: Log slow queries to SQL Database using Log4j2 (JDBC)

I would back up and do this a different way, with off-the-shelf parts.

Send the logs to syslog or your favorite log aggregator. From there, configure 
something that puts them into an ELK stack (Elasticsearch, Logstash, Kibana). A 
commercial version of this is logz.io .

Traditional relational databases are not designed for time-series data like 
logs.

Also, do you want your search service to be slow when the SQL database gets 
slow? That is guaranteed to happen. Writing to logs should be very, very low 
overhead. Do all of the processing after Solr writes the log line.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On May 25, 2020, at 3:53 AM, Krönert Florian  
> wrote:
>
> Hi everyone,
>
> For our Solr instance I have the requirement that all queries should be 
> logged, so that we can later on analyze, which search texts were queried most 
> often.
> Were using solr 8.3.1 using the official docker image, hosted on Azure.
> My approach for implementing this, was now to configure a Slow Request rule 
> of 0ms, so that in fact every request is logged to the slow requests file.
> This part works without any issues.
>
> However now I need to process these logs.
> It would be convenient, if I had those logs already inside a SQL database.
> I saw that log4j2 is able to log to a JDBC database, so for me it seemed the 
> easiest way to just add a new appender for JDBC, which also logs the slow 
> requests.
> Unfortunately I can’t seem to get the JDBC driver loaded properly. I have the 
> correct jar and the correct driver namespace. I’m sure because I use the same 
> setup for the dataimport handler and it works flawlessly.
>
> My log4j2.xml looks like this (removed non-relevant appenders):
>
> 
>   
> 
>connectionString="${jvmrunargs:dataimporter.datasource.url}"
> driverClassName="com.microsoft.sqlserver.jdbc.SQLServerDriver"
> username="${jvmrunargs:dataimporter.datasource.user}"
> password="${jvmrunargs:dataimporter.datasource.password}"
>   />
>   
>   
>   
>   
>   
>   
> 
>   
>   
>  level="info" additivity="false">
>   
> 
>   
> 
>
> I have loaded the JDBC driver per core in solrconfig.xml and also “globally” 
> inside solr.xml by adding its containing folder as libDir.
>
> Unforunately log4j2 still can’t find the JDBC driver, I always receive these 
> issues inside the sysout log:
>
>
> 2020-05-25T10:49:08.616562330Z DEBUG StatusLogger Acquiring JDBC
> connection from jdbc:sqlserver://--redacted--
>  2020-05-25T10:49:08.618023875Z DEBUG
> StatusLogger Loading driver class
> com.microsoft.sqlserver.jdbc.SQLServerDriver
>
> 2020-05-25T10:49:08.623000529Z DEBUG StatusLogger Cannot reestablish
> JDBC connection to FactoryData
> [connectionSource=jdbc:sqlserver://--redacted--
> , tableName=dbo.solr_requestlog,
> columnConfigs=[{ name=date, layout=null, literal=null, timestamp=true
> }, { name=id, layout=%u{RANDOM}, literal=null, timestamp=false }, {
> name=logLevel, layout=%level, literal=null, timestamp=false }, {
> name=logger, layout=%logger, literal=null, timestamp=false }, {
> name=message, layout=%message, literal=null, timestamp=false }, {
> name=exception, layout=%ex{full}, literal=null, timestamp=false }],
> columnMappings=[], immediateFail=false, retry=true,
> reconnectIntervalMillis=5000, truncateStrings=true]: The
> DriverManagerConnectionSource could not load the JDBC driver
> com.microsoft.sqlserver.jdbc.SQLServerDriver:
> java.lang.ClassNotFoundException:
> com.microsoft.sqlserver.jdbc.SQLServerDriver
>
> 2020-05-25T10:49:08.627598771Z  java.sql.SQLException: The
> DriverManagerConnectionSource could not load the JDBC driver
> com.microsoft.sqlserver.jdbc.SQLServerDriver:
> java.lang.ClassNotFoundException:
> com.microsoft.sqlserver.jdbc.SQLServerDriver
>
> 2020-05-25T10:49:08.628239791Z  at 
> org.apache.logging.log4j.core.appender.db.jdbc.AbstractDriverManagerConnectionSource.loadDriver(AbstractDriverManagerConnectionSource.java:203)
>
> 2020-05-25T10:49:08.628442097Z  at 
> org.apache.logging.log4j.core.appender.db.jdbc.AbstractDriverManagerConnectionSource.loadDriver(AbstractDriverManagerConnectionSource.java:185)
>
> 2020-05-25T10:49:08.628637603Z  at 
> org.apache.logging.log4j.core.appender.db.jdbc.AbstractDriverManagerConnectionSource.getConnection(AbstractDriverManagerConnectionSource.java:147)
>
> 2020-05-25T10:49:08.628921112Z  at 
> org.apache.loggi

RE: Use cases for the graph streams

2020-05-26 Thread Nightingale, Jonathan A (US)
Without getting too in the weeks with our product, we have a bunch of solr 
records that represent entities and their relationships to other entities or 
files. For example a document may describe a bunch of people. We have entries 
for the people as well as the document. We also have entries that represent 
connections between these people based on things described in the document.

From those solr records we have a bunch of docIds as references in each record 
that lets us link them. We build a graph in a separate graph store so we can to 
traversal on it. We do semantic filtering like find people nodes that are 
related to people nodes described in this document. That relationship can be 
expanded to allow for greater walks on the graph, so find everything up to N 
steps from this person.

We also allow just viewing all nodes on the graph that are connected by N steps 
from a target node and allow the user to traverse that way to just explore the 
information as they then shift to another node and display the new subgraph 
from their new focus node.

So those are the kinds of things I was hoping to do with the gather nodes 
functions in solr but I couldn't find a simple way to do it.

Jonathan

-Original Message-
From: Joel Bernstein  
Sent: Thursday, May 21, 2020 9:57 AM
To: solr-user@lucene.apache.org
Subject: Re: Use cases for the graph streams

*** WARNING ***
EXTERNAL EMAIL -- This message originates from outside our organization.


Good question. Let me first point to an interesting example in the Visual Guide 
to Streaming Expressions and Math Expressions:

https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/search-sample.adoc#nodes

This example gets to the heart of the core use case for the nodes expression 
which is to discover the relationships between nodes in a graph.
So it's a discovery tool to learn something new about the data that you can't 
see without having this specific ability of walking the nodes in a graph.

In the broader context the nodes expression is part of a much wider set of 
tools that allow people to use Solr to explore the relationships in their data. 
This is described here:

https://github.com/apache/lucene-solr/blob/visual-guide/solr/solr-ref-guide/src/math-expressions.adoc

The goal of all this is to move search engines beyond basic aggregations to 
study the correlations and relationships within the data set.

Graph traversal is part of this broader goal which will get developed more over 
time. I'd be interested in hearing more about specific graph use cases that 
you're interested in solving.

Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, May 20, 2020 at 12:32 PM Nightingale, Jonathan A (US) < 
jonathan.nighting...@baesystems.com> wrote:

> This is kind of  broad question, but I was playing with the graph 
> streams and was having trouble making the tools work for what I wanted 
> to do. I'm wondering if the use case for the graph streams really 
> supports standard graph queries you might use with Gemlin or the like? 
> I ask because right now we have two implementations of our data 
> storage to support these two ways of looking at it, the standard query and 
> the semantic filtering.
>
> The usecases I usually see for the graph streams always seem to be 
> limited to one link traversal for finding things related to nodes 
> gathered from a query. But even with that it wasn't clear the best way 
> to do things with lists of docvalues. So for example if you wanted to 
> represent a node that had many doc values I had to use cross products 
> to make a node for each doc value. The traversal didn't allow for that 
> kind of node linking inherently it seemed.
>
> So my question really is (and maybe this is not the place for this) 
> what is the intent of these graph features and what is the goal for 
> them in the future? I was really hoping at one point to only use solr 
> for our product but it didn't seem feasible, at least not easily.
>
> Thanks for all your help
> Jonathan
>
> Jonathan Nightingale
> GXP Solutions Engineer
> (office) 315 838 2273
> (cell) 315 271 0688
>
>


Thanks to developers

2020-05-26 Thread Serkan KAZANCI
 

Have been using SOLR for almost five years. Great, powerful software.

 

Thanks to all the developers for this masterpiece.

 

Appreciate it,

 

Serkan

 



Re: unified highlighter performance in solr 8.5.1

2020-05-26 Thread David Smiley
Please create an issue.  I haven't reproduced it yet but it seems unlikely
to be user-error.

~ David


On Mon, May 25, 2020 at 9:28 AM Michal Hlavac  wrote:

> Hi,
>
> I have field:
>  stored="true" indexed="false" storeOffsetsWithPositions="true"/>
>
> and configuration:
> true
> unified
> true
> content_txt_sk_highlight
> 2
> true
>
> Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> is really slow.
> Same query with hl.bs.type=WORD takes from 8 - 45 ms
>
> is this normal behaviour or should I create issue?
>
> thanks, m.
>


Re: TimestampUpdateProcessorFactory updates the field even if the value if present

2020-05-26 Thread Chris Hostetter
: Subject: TimestampUpdateProcessorFactory updates the field even if the value
: if present
: 
: Hi,
: 
: Following is the update request processor chain.
: 
:  <
: processor class="solr.TimestampUpdateProcessorFactory"> index_time_stamp_create
: 
: And, here is how the field is defined in schema.xml
: 
: 
: 
: Every time I index the same document, above field changes its value with
: latest timestamp. According to TimestampUpdateProcessorFactory  javadoc
: page, if a document does not contain a value in the timestamp field, a new

based on the wording of your question, i suspect you are confused about 
the overall behavior of how "updating" an existing document works in solr, 
and how update processors "see" an *input document* when processing an 
add/update command.


First off, completley ignoring TimestampUpdateProcessorFactory and 
assuming just the simplest possibel update change, let's clarify how 
"updates" work, let's assume you when you say you "index the same 
document" twice you do so with a few diff field values ...

First Time...

{  id:"x",  title:"" }

Second time...

{  id:"x",  body:"      xxx" }

Solr does not implicitly know that you are trying to *update* that 
document, the final result will not be a document containing both a 
"title" field and "body" field in addition to the "id", it will *only* 
have the "id" and "body" fields and the title field will be lost.

The way to "update" a document *and keep existing field values* is with 
one of the "Atomic Update" command options...

https://lucene.apache.org/solr/guide/8_4/updating-parts-of-documents.html#UpdatingPartsofDocuments-AtomicUpdates

{  id:"x",  title:"" }

Second time...

{  id:"x",  body: { set: "      xxx" } }


Now, with that background info clarified: let's talk about update 
processors


The docs for TimestampUpdateProcessorFactory are refering to how it 
modifies an *input* document that it recieves (as part of the processor 
chain). It adds the timestamp field if it's not already in the *input* 
document, it doesn't know anything about wether that document is already 
in the index, or if it has a value for that field in the index.


When processors like TimestampUpdateProcessorFactory (or any other 
processor that modifies a *input* document) are run they don't know if the 
document you are "indexing" already exists in the index or not.  even if 
you are using the "atomic update" options to set/remove/add a field value, 
with the intent of preserving all other field values, the documents based 
down the processors chain don't include those values until the "document 
merger" logic is run -- as part of the DistributedUpdateProcessor (which 
if not explicit in your chain happens immediatly before the 
RunUpdateProcessorFactory)

Off the top of my head i don't know if there is an "easy" way to have a 
Timestamp added to "new" documents, but left "as is" for existing 
documents.

Untested idea

explicitly configured 
DistributedUpdateProcessorFactory, so that (in addition to putting 
TimestampUpdateProcessorFactory before it) you can 
also put MinFieldValueUpdateProcessorFactory on the timestamp field 
*after* DistributedUpdateProcessorFactory (but before 
RunUpdateProcessorFactory).  

I think that would work?

Just putting TimestampUpdateProcessorFactory after the 
DistributedUpdateProcessorFactory would be dangerous, because it would 
introduce descrepencies -- each replica would would up with it's own 
locally computed timestamp.  having the timetsamp generated before the 
distributed update processor ensures the value is computed only once.

-Hoss
http://www.lucidworks.com/


Re: Solr Deletes

2020-05-26 Thread Bram Van Dam
On 26/05/2020 14:07, Erick Erickson wrote:
> So best practice is to go ahead and use delete-by-id. 


I've noticed that this can cause issues when using implicit routing, at
least on 7.x. Though I can't quite remember whether the issue was a
performance issue, or whether documents would sometimes not get deleted.

In either case, I worked it around it by doing something like this:

UpdateRequest req = new UpdateRequest();
req.deleteById(id);
req.setCommitWithin(-1);
req.setParam(ShardParams._ROUTE_, shard);

Maybe that'll help if you run into either of those issues.

 - Bram


Solr Collection core initialization Error with length mismatch

2020-05-26 Thread Mohamed Sirajudeen Mayitti Ahamed Pillai
Hello team,

We have 4 Solr VMs in Solr Cloud 7.4. Only a specific node Admin UI, we are 
seeing the message,
· cs_signals_shard1_replica_n1: 
org.apache.solr.common.SolrException:org.apache.solr.common.SolrException: 
Error opening new searcher

When restarting Solr, noticed below error for the collection in solr.log.

2020-05-26T12:28:29.759-0500 - ERROR 
[coreContainerWorkExecutor-2-thread-1-processing-n:fzcexfsnbepd02:8983_solr:solr.core.CoreContainer@714]
 - {node_name=n:fzcexfsnbepd02:8983_solr} - Error waiting for SolrCore to be 
loaded on startup
org.apache.solr.common.SolrException: Unable to create core 
[cs_signals_shard1_replica_n1]
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1156)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at 
org.apache.solr.core.CoreContainer.lambda$load$13(CoreContainer.java:681) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedCallable.call(InstrumentedExecutorService.java:197)
 ~[metrics-core-3.2.6.jar:3.2.6]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
~[?:1.8.0_212]
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:209)
 ~[solr-solrj-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
jpountz - 2018-06-18 16:55:14]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_212]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_212]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.(SolrCore.java:1012) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at org.apache.solr.core.SolrCore.(SolrCore.java:867) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1135)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
... 7 more
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at 
org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:2126) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at 
org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:2246) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at 
org.apache.solr.core.SolrCore.initSearcher(SolrCore.java:1095) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at org.apache.solr.core.SolrCore.(SolrCore.java:984) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at org.apache.solr.core.SolrCore.(SolrCore.java:867) 
~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
at 
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreContainer.java:1135)
 ~[solr-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - jpountz 
- 2018-06-18 16:55:13]
... 7 more
Caused by: org.apache.lucene.index.CorruptIndexException: length should be 
5271336964 bytes, but is 5272582148 instead 
(resource=MMapIndexInput(path="/opt/solr-7.4/data/solr/cs_signals_shard1_replica_n1/data/index.20200514060942617/_52b3.cfs"))
at 
org.apache.lucene.codecs.lucene50.Lucene50CompoundReader.(Lucene50CompoundReader.java:91)
 ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
jpountz - 2018-06-18 16:51:45]
at 
org.apache.lucene.codecs.lucene50.Lucene50CompoundFormat.getCompoundReader(Lucene50CompoundFormat.java:70)
 ~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
jpountz - 2018-06-18 16:51:45]
at 
org.apache.lucene.index.IndexWriter.readFieldInfos(IndexWriter.java:960) 
~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
jpountz - 2018-06-18 16:51:45]
at 
org.apache.lucene.index.IndexWriter.getFieldNumberMap(IndexWriter.java:977) 
~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
jpountz - 2018-06-18 16:51:45]
at 
org.apache.lucene.index.IndexWriter.(IndexWriter.java:869) 
~[lucene-core-7.4.0.jar:7.4.0 9060ac689c270b02143f375de0348b7f626adebc - 
jpountz - 2018-06-18

Re: Solr 8.5.1 startup error - lengthTag=109, too big.

2020-05-26 Thread Mike Drob
Did you have SSL enabled with 8.2.1?

The error looks common to certificate handling and not specific to Solr.

I would verify that you have no extra characters in your certificate file
(including line endings) and that the keystore type that you specified
matches the file you are presenting (JKS or PKCS12)

Mike

On Sat, May 23, 2020 at 10:11 PM Zheng Lin Edwin Yeo 
wrote:

> Hi,
>
> I'm trying to upgrade from Solr 8.2.1 to Solr 8.5.1, with Solr SSL
> Authentication and Authorization.
>
> However, I get the following error when I enable SSL. The Solr itself can
> start up if there is no SSL.  The main error that I see is this
>
>   java.io.IOException: DerInputStream.getLength(): lengthTag=109, too big.
>
> What could be the reason that causes this?
>
>
> INFO  - 2020-05-24 10:38:20.080;
> org.apache.solr.util.configuration.SSLConfigurations; Setting
> javax.net.ssl.keyStorePassword
> INFO  - 2020-05-24 10:38:20.081;
> org.apache.solr.util.configuration.SSLConfigurations; Setting
> javax.net.ssl.trustStorePassword
> Waiting up to 120 to see Solr running on port 8983
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.eclipse.jetty.start.Main.invokeMain(Main.java:218)
> at org.eclipse.jetty.start.Main.start(Main.java:491)
> at org.eclipse.jetty.start.Main.main(Main.java:77)d
> Caused by: java.security.PrivilegedActionException: java.io.IOException:
> DerInputStream.getLength(): lengthTag=109, too big.
> at java.security.AccessController.doPrivileged(Native Method)
> at
> org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:1837)
> ... 7 more
> Caused by: java.io.IOException: DerInputStream.getLength(): lengthTag=109,
> too big.
> at sun.security.util.DerInputStream.getLength(Unknown Source)
> at sun.security.util.DerValue.init(Unknown Source)
> at sun.security.util.DerValue.(Unknown Source)
> at sun.security.util.DerValue.(Unknown Source)
> at sun.security.pkcs12.PKCS12KeyStore.engineLoad(Unknown Source)
> at java.security.KeyStore.load(Unknown Source)
> at
>
> org.eclipse.jetty.util.security.CertificateUtils.getKeyStore(CertificateUtils.java:54)
> at
>
> org.eclipse.jetty.util.ssl.SslContextFactory.loadKeyStore(SslContextFactory.java:1188)
> at
>
> org.eclipse.jetty.util.ssl.SslContextFactory.load(SslContextFactory.java:323)
> at
>
> org.eclipse.jetty.util.ssl.SslContextFactory.doStart(SslContextFactory.java:245)
> at
>
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> at
>
> org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:169)
> at
>
> org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:117)
> at
>
> org.eclipse.jetty.server.SslConnectionFactory.doStart(SslConnectionFactory.java:92)
> at
>
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> at
>
> org.eclipse.jetty.util.component.ContainerLifeCycle.start(ContainerLifeCycle.java:169)
> at
>
> org.eclipse.jetty.util.component.ContainerLifeCycle.doStart(ContainerLifeCycle.java:117)
> at
>
> org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:320)
> at
>
> org.eclipse.jetty.server.AbstractNetworkConnector.doStart(AbstractNetworkConnector.java:81)
> at
> org.eclipse.jetty.server.ServerConnector.doStart(ServerConnector.java:231)
> at
>
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> at org.eclipse.jetty.server.Server.doStart(Server.java:385)
> at
>
> org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:72)
> at
>
> org.eclipse.jetty.xml.XmlConfiguration.lambda$main$0(XmlConfiguration.java:1888)
> ... 9 more
> java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
> at java.lang.reflect.Method.invoke(Unknown Source)
> at org.eclipse.jetty.start.Main.invokeMain(Main.java:218)
> at org.eclipse.jetty.start.Main.start(Main.java:491)
> at org.eclipse.jetty.start.Main.main(Main.java:77)
> Caused by: java.security.PrivilegedActionException: java.io.IOException:
> DerInputStream.getLength(): lengthTag=109, too big.
> at java.security.AccessController.doPrivileged(Native Method)
> at
> org.eclipse.jetty.xml.XmlConfiguration.main(XmlConfiguration.java:18

Re: unified highlighter performance in solr 8.5.1

2020-05-26 Thread Michal Hlavac
fine, I'l try to write simple test, thanks

On utorok 26. mája 2020 17:44:52 CEST David Smiley wrote:
> Please create an issue.  I haven't reproduced it yet but it seems unlikely
> to be user-error.
> 
> ~ David
> 
> 
> On Mon, May 25, 2020 at 9:28 AM Michal Hlavac  wrote:
> 
> > Hi,
> >
> > I have field:
> >  > stored="true" indexed="false" storeOffsetsWithPositions="true"/>
> >
> > and configuration:
> > true
> > unified
> > true
> > content_txt_sk_highlight
> > 2
> > true
> >
> > Doing query with hl.bs.type=SENTENCE it takes around 1000 - 1300 ms which
> > is really slow.
> > Same query with hl.bs.type=WORD takes from 8 - 45 ms
> >
> > is this normal behaviour or should I create issue?
> >
> > thanks, m.
> >
> 


[ANNOUNCE] Apache Solr 8.5.2 released

2020-05-26 Thread Mike Drob
26 May 2020, Apache Solr™ 8.5.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 8.5.2

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release contains two bug fixes. The release is available for immediate
download at:

The release is available for immediate download at:

https://lucene.apache.org/solr/downloads.html

Solr 8.5.2 Bug Fixes:

   - SOLR-14411 : Fix
   regression from SOLR-14359 (Admin UI 'Select an Option')
   - SOLR-14471 : base
   replica selection strategy not applied to "last place" shards.preference
   matches

Solr 8.5.2 also includes 1 bugfix in the corresponding Apache Lucene
release:



Please report any feedback to the mailing lists (
https://lucene.apache.org/solr/community.html#mailing-lists-irc)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


Re: Solr Deletes

2020-05-26 Thread Dwane Hall
Thank you very much Erick, Emir, and Bram this is extremly useful advice I 
sincerely appreciate everyone’s input!


Before I received your responses I ran a controlled DBQ test in our DR 
environment and exactly what you said occurred.  It was like reading a step by 
step playbook of events with heavy blocking occurring on the Solr nodes and 
lots of threads going into a TIMED_WAITING state. Several shards were pushed 
into recovery mode and things were starting to get ugly, fast!


I'd read snippets in blog posts and JIRA tickets on DBQ being a blocking 
operation but I did not expect having such a specific DBQ (i.e. by ID's) would 
operate very differently from the DBID (which I expected block as well). Boy 
was I wrong! They're used interchangeably in the Solr ref guide examples so 
it’s very useful to understand the performance implications of each.  
Additionally all of the information I found on delete operations never 
mentioned query performance so I was unsure of its impact in this dimension.


Erik thanks again for your comprehensive response your blogs and user group 
responses are always a pleasure to read I'm constantly picking useful pieces of 
information that I use on a daily basis in managing our Solr/Fusion clusters. 
Additionally, I've been looking for an excuse to use streaming expressions and 
I did not think to use them the way you suggested.  I've watched quite a few of 
Joel's presentations on youtube and his blog is brilliant.  Streaming 
expressions are expanding with every Solr release they really are a very 
exciting part of Solr's evolution.  Your final point on searcher state while 
streaming expressions are running and its relationship with new searchers is a 
very interesting additional piece of information I’ll add to the toolbox. Thank 
you.



At the moment we're fortunate to have all the ID's of the documents to remove 
in a DB so I'll be able to construct batches of DBID requests relatively easily 
and store them in a backlog table for processing without needing to traverse 
Solr with cursors, streaming (or other means) to identify them.  We follow a 
similar approach for updates in batches of around ~1000 docs/batch.  
Inspiration for that sweet spot was once again determined after reading one of 
Erik's Lucidworks blog posts and testing 
(https://lucidworks.com/post/really-batch-updates-solr-2/).



Again thanks to the community and users for everyone’s contribution on the 
issue it is very much appreciated.


Successful Solr-ing to all,


Dwane


From: Bram Van Dam 
Sent: Wednesday, 27 May 2020 5:34 AM
To: solr-user@lucene.apache.org 
Subject: Re: Solr Deletes

On 26/05/2020 14:07, Erick Erickson wrote:
> So best practice is to go ahead and use delete-by-id.


I've noticed that this can cause issues when using implicit routing, at
least on 7.x. Though I can't quite remember whether the issue was a
performance issue, or whether documents would sometimes not get deleted.

In either case, I worked it around it by doing something like this:

UpdateRequest req = new UpdateRequest();
req.deleteById(id);
req.setCommitWithin(-1);
req.setParam(ShardParams._ROUTE_, shard);

Maybe that'll help if you run into either of those issues.

 - Bram


Question for SOLR-14471

2020-05-26 Thread Kayak28
Hello, Solr community members:

I am working on translating Solr's release note every release.
Now, I am not clear about what SOLR-14471 actually fixes.

URL for SOLR-14471: https://issues.apache.org/jira/browse/SOLR-14471

My questions are the following.
- what does "all inherently equivalent groups of replicas mean"?
- does it mean children of the same shard?
- are they different from "all available replicas"?
- what does "last place" mean?
 - does it mean that a replica, which is created at the last?

Honestly, I am not familiar with Solr Cloud structure, so
I would be happy if anyone could help me to understand the issue.



-- 

Sincerely,
Kaya
github: https://github.com/28kayak