Document centric external version conflict not returning 409

2020-10-04 Thread Deepu
Hi Team,

I am using Solr document centric external version configuration to control
concurrent updates.
Followed sample configuration given in below github path.
https://github.com/apache/lucene-solr/blob/master/solr/core/src/test-files/solr/collection1/conf/solrconfig-externalversionconstraint.xml

Document not getting updated if the new document's external version is less
than the existing one, but it is not giving response status code as 409,
still getting status=0 in update response, need to configure anything else
to get status=409 in response.


  
live_b
true
  

  
_external_version_
_external_version_
  

  
live_b
false
  

  
update_timestamp_tdt
  

  
  



I am using Solr version : 8.6.1 and Solrj version : 8.4.0

Thanks,
Deepu


Order of applying tokens/filter

2020-10-04 Thread Jayadevan Maymala
Hi all,

Is this the best (performance-wise as well as efficacy) order of applying
analyzers/filters? We have an eCom site where the many products are listed,
and users may type in search words and get relevant results.

1) Tokenize on whitespace (WhitespaceTokenizerFactory)
2) Remove stopwords (StopFilterFactory)
3) Stem (PorterStemFilterFactory)
4) Convert to lowercase  (LowerCaseFilterFactory)
5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory)

Any possible gotchas?

Regards,
Jayadevan


Re: Solr 7.7 - Few Questions

2020-10-04 Thread Rahul Goswami
Charlie,
Thanks for providing an alternate approach to doing this. It would be
interesting to know how one  could go about organizing the docs in this
case? (Nested documents?) How would join queries perform on a large
index(200 million+ docs)?

Thanks,
Rahul



On Fri, Oct 2, 2020 at 5:55 AM Charlie Hull  wrote:

> Hi Rahul,
>
>
>
> In addition to the wise advice below: remember in Solr, a 'document' is
>
> just the name for the thing that would appear as one of the results when
>
> you search (analagous to a database record). It's not the same
>
> conceptually as a 'Word document' or a 'PDF document'. If your source
>
> documents are so big, consider how they might be broken into parts, or
>
> whether you really need to index all of them for retrieval purposes, or
>
> what parts of them need to be extracted as text. Thus, the Solr
>
> documents don't necessarily need to be as large as your source documents.
>
>
>
> Consider an email size 20kb with ten PDF attachments, each 20MB. You
>
> probably shouldn't push all this data into a single Solr document, but
>
> you *could* index them as 11 separate Solr documents, but with metadata
>
> to indicate that one is an email and ten are PDFs, and a shared ID of
>
> some kind to indicate they're related. Then at query time there are
>
> various ways for you to group these together, so for example if the
>
> query hit one of the PDFs you could show the user the original email,
>
> plus the 9 other attachments, using the shared ID as a key.
>
>
>
> HTH,
>
>
>
> Charlie
>
>
>
> On 02/10/2020 01:53, Rahul Goswami wrote:
>
> > Manisha,
>
> > In addition to what Shawn has mentioned above, I would also like you to
>
> > reevaluate your use case. Do you *need to* index the whole document ? eg:
>
> > If it's an email, the body of the email *might* be more important than
> any
>
> > attachments, in which case you could choose to only index the email body
>
> > and ignore (or only partially index) the text from attachments. If you
>
> > could afford to index the documents partially, you could consider Solr's
>
> > "Limit token count filter": See the link below.
>
> >
>
> >
> https://lucene.apache.org/solr/guide/7_7/filter-descriptions.html#limit-token-count-filter
>
> >
>
> > You'll need to configure it in the schema for the "index" analyzer for
> the
>
> > data type of the field with large text.
>
> > Indexing documents of the order of half a GB will definitely come to hurt
>
> > your operations, if not now, later (think OOM, extremely slow atomic
>
> > updates, long running merges etc.).
>
> >
>
> > - Rahul
>
> >
>
> >
>
> >
>
> > On Thu, Oct 1, 2020 at 7:06 PM Shawn Heisey  wrote:
>
> >
>
> >> On 10/1/2020 6:57 AM, Manisha Rahatadkar wrote:
>
> >>> We are using Apache Solr 7.7 on Windows platform. The data is synced to
>
> >> Solr using Solr.Net commit. The data is being synced to SOLR in batches.
>
> >> The document size is very huge (~0.5GB average) and solr indexing is
> taking
>
> >> long time. Total document size is ~200GB. As the solr commit is done as
> a
>
> >> part of API, the API calls are failing as document indexing is not
>
> >> completed.
>
> >>
>
> >> A single document is five hundred megabytes?  What kind of documents do
>
> >> you have?  You can't even index something that big without tweaking
>
> >> configuration parameters that most people don't even know about.
>
> >> Assuming you can even get it working, there's no way that indexing a
>
> >> document like that is going to be fast.
>
> >>
>
> >>> 1.  What is your advise on syncing such a large volume of data to
>
> >> Solr KB.
>
> >>
>
> >> What is "KB"?  I have never heard of this in relation to Solr.
>
> >>
>
> >>> 2.  Because of the search requirements, almost 8 fields are defined
>
> >> as Text fields.
>
> >>
>
> >> I can't figure out what you are trying to say with this statement.
>
> >>
>
> >>> 3.  Currently Solr_JAVA_MEM is set to 2gb. Is that enough for such
> a
>
> >> large volume of data?
>
> >>
>
> >> If just one of the documents you're sending to Solr really is five
>
> >> hundred megabytes, then 2 gigabytes would probably be just barely enough
>
> >> to index one document into an empty index ... and it would probably be
>
> >> doing garbage collection so frequently that it would make things REALLY
>
> >> slow.  I have no way to predict how much heap you will need.  That will
>
> >> require experimentation.  I can tell you that 2GB is definitely not
> enough.
>
> >>
>
> >>> 4.  How to set up Solr in production on Windows? Currently it's set
>
> >> up as a standalone engine and client is requested to take the backup of
> the
>
> >> drive. Is there any other better way to do? How to set up for the
> disaster
>
> >> recovery?
>
> >>
>
> >> I would suggest NOT doing it on Windows.  My reasons for that come down
>
> >> to costs -- a Windows Server license isn't cheap.
>
> >>
>
> >> That said, there's nothing wrong with running on Windows, but you're on
>
> >> your own as far as 

Re: Daylight savings time issue using NOW in Solr 6.1.0

2020-10-04 Thread vishal patel
Hello,

Can anyone help me?

Regards,
Vishal

Sent from Outlook

From: vishal patel 
Sent: Thursday, October 1, 2020 4:51 PM
To: solr-user@lucene.apache.org 
Subject: Daylight savings time issue using NOW in Solr 6.1.0


Hi

I am using Solr 6.1.0. My SOLR_TIMEZONE=UTC  in solr.in.cmd.
My current Solr server machine time zone is also UTC.

My one collection has below one field in schema.


Suppose my current Solr server machine time is 2020-10-01 10:00:00.000. I have 
one document in that collection and in that document action_date is 
2020-10-01T09:45:46Z.
When I search in Solr action_date:[2020-10-01T08:00:00Z TO NOW] , I cannot 
return that record. I check my solr log and found that time was different 
between Solr log time and solr server machine time.(almost 1 hours difference)

Why I cannot get the result? Why NOW is not taking the 2020-10-01T10:00:00Z?
"NOW" takes which time? Is there difference due to daylight saving 
time? How can I configure 
or change timezone which consider daylight saving time?

Regards,
Vishal



Re: Order of applying tokens/filter

2020-10-04 Thread Walter Underwood
Several problems.

1. Do not remove stopwords. That is a 1970s-era hack for saving disk space. 
Want to search for “vitamin a”? Better not remove stopwords.
2. Synonyms are before the stemmer, especially the Porter stemmer, where the 
output isn’t English words.
3. Use KStem instead of Porter. Porter is a clever hack from 1980, but we have 
better technology now.
4. Add RemoveDuplcatesFilter as the last step, just in case your synonyms stem 
to the same word. It is cheap insurance.

Also, I really recommend using the ICUNormalizer2CharFilterFactory with “nfkc” 
mode as the first step before the tokenizer. Otherwise, you’ll get bitten by 
some weird Unicode thing that takes days to debug. And if you are going to 
lower-case everything, let ICU do that for you with “nfkc_cf” mode.

So that gives:

ICUNormalizer2CharFilterFactory name=“nfkc_cf” (the default)
WhitespaceTokenizerFactory
SynonymGraphFilterFactory
FlattenGraphFilterFactory
KStemFilterFactory
RemoveDuplicatesFilterFactory

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Oct 4, 2020, at 9:24 PM, Jayadevan Maymala  
> wrote:
> 
> Hi all,
> 
> Is this the best (performance-wise as well as efficacy) order of applying
> analyzers/filters? We have an eCom site where the many products are listed,
> and users may type in search words and get relevant results.
> 
> 1) Tokenize on whitespace (WhitespaceTokenizerFactory)
> 2) Remove stopwords (StopFilterFactory)
> 3) Stem (PorterStemFilterFactory)
> 4) Convert to lowercase  (LowerCaseFilterFactory)
> 5) Add synonyms (SynonymGraphFilterFactory,FlattenGraphFilterFactory)
> 
> Any possible gotchas?
> 
> Regards,
> Jayadevan



Converting a 3 node cloud to 1-node

2020-10-04 Thread Jayadevan Maymala
Hi all,
We have a Solr cluster (Version 7.3) with 3 nodes in production. We are
moving from one platform provider to another. Even after the move is
complete, I would like to keep the Solr cloud in the existing platform
running for a few weeks - just to compare search results, refer config
files etc. Since live searches won't be going to the existing cloud, and
availability is not critical, I would like to make it a one-node instance.
What are the steps to be followed? All our collections have 2 replicas,
they are not sharded. The Solr cloud uses a 3-node zookeeper ensemble. I
would like to move to a one-solr node, one-zookeeper set up.

A few things I can think of -
On node1 -
1) Edit /etc/default/solr.in.sh
ZK_HOST=
to refer only one server

2) Edit /opt/zookeeper/conf/zoo.cfg
and remove server.2 and server.3 entries.

Shutdown node2 and node 3. On node1, restart solr service, restart zookeeper
Anything else?

Regards,
Jayadevan