On 30/01/2018 16:09, Jan Høydahl wrote:
Hi,
A customer has 10 separate SolrCloud clusters, with same schema across all, but
different content.
Now they want users in each location to be able to federate a search across all
locations.
Each location is 100% independent, with separate ZK etc. Ban
Many years ago, in a different universe, when Federated Search was a buzzword we
used Unity from FAST FDS (which is now MS ESP). It worked pretty well across
many systems like FAST FDS, Google, Gigablast, ...
Very flexible with different mixers, parsers, query transformers.
Was written in Python an
"I am getting the records matching the full name sorted by distance.
If the input string(for ex Dae Kim) is provided, I am getting the records
other than Dae Kim(for ex Rodney Kim) too at the top of the search results
including Dae Kim
just before the next Dae Kim because Kim is matching with all
Hi Wendy,
I see several issues, but not sure if any of them is the reason why you are not
getting what you expect:
* there are no spaces around OR and that results in query being parsed
sometimes with OR, e.g. (pdb_id:OR\”Solution)^5
* wildcard in quotes - it is not handled as you expected - the
Hello,
I'm trying to get the documents which got indexed on calling DIH and I want to
differentiate such documents with the ones which are added using SolrJ atomic
update.
Is it possible to get the document primary keys which got indexed thru
"onImportEnd" Eventlistener?
Any alternative way I
I worked personally on the SimpleFacets class which does the facet method
selection :
FacetMethod appliedFacetMethod = selectFacetMethod(field,
sf, requestedMethod, mincount,
exists);
RTimer timer = null;
if (fdebug != null)
Hi Srinivas,
I guess you can add some field that will be set in your DIH config - something
like:
And you can use ‘dih’ field to filter out doc that are imported using DIH.
HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training -
Yes, that is correct. Collection 'features' stores mapping between features
and their scores.
For simplicity, I tried to keep the level of detail about these collections
to a minimum.
Both collections contain thousands of records and are updated by (lily)
hbase-indexer. Therefore storing scores/we
Hi Emir,
Thanks for the reply,
As I'm doing atomic update on the existing documents(already indexed from DIH)
as well, with the suggested approach, I might end up doing atomic update on DIH
imported document and commit the same.
So, I wanted to get the document values which were indexed when i
We are currently deploying Solr in war mode(Yes, recommendation is not war.
But this is something I can't change now. Planned for future). I am setting
authentication for solr. As Solr provided basic authentication is not
working in Solr 6.4.2, I am setting up digest authentication in tomcat for
So
So all fields are DIH imported? And you just want to know which are from the
last run? Can you add date field and track when DIH started and ended and
filter based on that?
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http:
I am not sure I fully understood your use case, but let me suggest few
different possible solutions :
1) Query Time join approach : you keep 2 collections, one static with all
the pages, one that just store lighweight documents containing the crawling
interaction :
1) Id, content -> Pages
2)pageId
Ah thanks, i just submitted a patch fixing it.
Anyway, in the end it appears this is not the problem we are seeing as our
timeouts were already at 30 seconds.
All i know is that at some point nodes start to lose ZK connections due to
timeouts (logs say so, but all within 30 seconds), the logs a
Hello guys,
I want to add an option to search document by size. For example, find the
top categories with the biggest documents. I thought about creating a new
update processor wich will counting the bytes of all fields in the document,
but I think it wont work good, because some fields are stored
Hello,
I'm trying to get the documents which got indexed on calling DIH and I want to
differentiate such documents with the ones which are added using SolrJ atomic
update.
Is it possible to get the document primary keys which got indexed thru
"onImportEnd" Eventlistener?
Any alternative way I
With any generic solution there will be always the question of what is the
document size: should you count the same field twice if indexed in two
different ways? Does size of index count or size of response?
If simplified version works for you - approximate doc size to the size of the
largest f
Hi,
We are using solr for our movie title search.
As it is "title search", this should be treated different than the normal
document search.
Hence, we use a modified version of TFIDFSimilarity with the following
changes.
- disabled TF & IDF and will only have 1 as value.
- disabled norms by spe
Hi Luigi,
What about using an updatable DocValue [1] for the field x ? you could
initially set it to -1,
and then update it for the docs in the step j. Range queries should still work
and the update should be fast.
Cheers
[1] http://shaierera.blogspot.com/2014/04/updatable-docvalues-under-ho
Luigi
Is there a reason for not indexing all of your on-disk pages? That seems to be
the first step. But I do not understand what your goal is.
Cheers -- Rick
On January 30, 2018 1:33:27 PM EST, Luigi Caiazza wrote:
>Hello,
>
>I am working on a project that simulates a selective, large-scale
>cr
Eddy
Maybe your request is getting through twice. Check your logs to see.
Cheers -- Rick
On January 31, 2018 5:59:53 AM EST, ddramireddy wrote:
>We are currently deploying Solr in war mode(Yes, recommendation is not
>war.
>But this is something I can't change now. Planned for future). I am
>setti
Hi all,
According to the documentation, the 'shard' parameter for the CLUSTERSTATUS
action should allow a comma delimited list of shards. However, passing
'shard1,shard2' as the value results in a shard-not-found error where it
was looking for 'shard1,shard2'. Not a search for 'shard1' and 'shard2
Hi,
We are using solr cloud 6.1. We have around 20 collection on 4 nodes (We have 2
shards and each shard have 2 replicas). We have allocated 40 GB RAM to each
shard.
Intermittently we found long GC pauses (60 sec to 200 sec) due to which solr
stops responding and hence collections goes in rec
Hello guys,
I want to add an option to search document by size. For example, find the
top categories with the biggest documents. I thought about creating a new
update processor wich will counting the bytes of all fields in the document,
but I think it wont work good, because some fields are stored
Hi
I'm using Solrj 6.6.1 found in spring-data-solr 3.0.3.RELEASE, solr is 7.2.1
.
I'm currently able to upload solrDocument via spring-data but would like to
add the equivalent to
tika new AutoDetectParser().parse(stream, new BodyContentHandler(-1),
new MetaData())
as a content field. Witho
Hi Maulin,
To clarify, when you said "...allocated 40 GB RAM to each shard." above,
I'm going to assume you meant "to each node" instead. If you actually did
mean "to each shard" above, please correct me and anyone who chimes in
afterward.
Firstly, it's really hard to even take guesses about pot
Am 29.01.18 um 18:05 schrieb Erick Erickson:
Try searching with lowercase the word and. Somehow you have to allow
the parser to distinguish the two.
Oh yeah, the biggest unsolved problem in the ~80 years history of
programming languages... NOT ;-)
You _might_ be able to try "AND~2" (with qu
Hi,
I'm using Solr 6.6.2 and I use Zookeeper too handle Solr cloud. In Java
client I use SolrJ this way:
*client = new CloudSolrClient.Builder().withZkHost(zkHostString).build();*
In the log I see the followings:
*WARN [org.apache.zookeeper.SaslClientCallbackHandler] Could not login:
the Clie
You can use a separate field for title aliases. That is what I did for Netflix
search.
Why disable idf? Disabling tf for titles can be a good idea, for example the
movie “New York, New York” is not twice as much about New York as some other
film that just lists it once.
Also, consider using a
Or use a boost for the phrase, something like
"beauty and the beast"^5
On Wed, Jan 31, 2018 at 8:43 AM, Walter Underwood wrote:
> You can use a separate field for title aliases. That is what I did for
> Netflix search.
>
> Why disable idf? Disabling tf for titles can be a good idea, for example
On 1/31/2018 9:07 AM, Tamás Barta wrote:
I'm using Solr 6.6.2 and I use Zookeeper too handle Solr cloud. In Java
client I use SolrJ this way:
*client = new CloudSolrClient.Builder().withZkHost(zkHostString).build();*
In the log I see the followings:
*WARN [org.apache.zookeeper.SaslClientCall
Just to double check, when you san you're seeing 60-200 sec GC pauses
are you looking at the GC logs (or using some kind of monitor) or is
that the time it takes the query to respond to the client? Because a
single GC pause that long on 40G is unusual no matter what. Another
take on Jason's questi
Hi Emir,
Thank you so much for following up with your ticket.
Listed below are the parts of debugQuery outputs via /search request
handler. The reason I used * in the query term is that there are a couple of
methods starting with "x-ray". When I used space surrounding the "OR"
boolean search opera
Hi Wendy,
With OR with spaces OR is interpreted as another search term. Can you try
without or - just a space between two parts. If you need and, use + before
each part.
HTH,
Emir
On Jan 31, 2018 6:24 PM, "Wendy2" wrote:
Hi Emir,
Thank you so much for following up with your ticket.
Listed belo
Hi,
first of all, thank you for your answers.
@ Rick: the reason is that the set of pages that are stored into the disk
represents just a static view of the Web, in order to let my experiments be
fully replicable. My need is to run simulations of different crawlers on
top of it, each working on t
Hiya,
So I have some nested documents in my index with this kind of structure:
{
"id": “parent",
"gridcell_rpt": "POLYGON((30 10, 40 40, 20 40, 10 20, 30 10))",
"density": “30"
"_childDocuments_" : [
{
"id":"child1",
"gridcell_rpt":"MULTI
Hi Emir,
Listed below are the debugQuery outputs from query without "OR" operator. I
really appreciate your help! --Wendy
===DebugQuery Outputs for case 1f-a, 1f-b without "OR"
operator=
*1f-a (/search?q=+method:"x-ray*" +method:"Solution NMR") result counts = 0:
*
We did some basic load testing on our 7.1.0 and 7.2.1 clusters.
And that came out all right.
We saw a performance increase of about 30% in read latencies between 6.6.0
and 7.1.0
And then we saw a performance degradation of about 10% between 7.1.0 and
7.2.1 in many metrics.
But overall, it still see
Hey Maulin,
I hope you are using some tools to look at your gc.log file (There are
couple available online) or grepping for pauses.
Do you mind sharing your G1GC settings and some screenshots from your
gc.log analyzer's output ?
-SG
On Wed, Jan 31, 2018 at 9:16 AM, Erick Erickson
wrote:
> Jus
On my AWS t2.micro instance, which only has 1 GB memory, I installed Solr (4.7.1
- please don't ask) and tried to run it in sample directory as java -jar
start.jar. It exited shortly due to lack of memory.
How much memory does Solr require to run, with empty core?
TK
Hello S.G.
We do not complain about speed improvements at all, it is clear 7.x is faster
than its predecessor. The problem is stability and not recovering from weird
circumstances. In general, it is our high load cluster containing user
interaction logs that suffers the most. Our main text sear
On 1/31/2018 1:54 PM, TK Solr wrote:
On my AWS t2.micro instance, which only has 1 GB memory, I installed
Solr (4.7.1 - please don't ask) and tried to run it in sample
directory as java -jar start.jar. It exited shortly due to lack of
memory.
How much memory does Solr require to run, with emp
Hi Wendy,
I was thinking of query q=method:“x-ray*” “Solution NMR”
This should be equivalent to one with OR between them. If you want to put AND
between those two, query would be q=+method:”x-ray*” +”Solution NMR”
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticse
Thanks for the reply.
I see that the child doctransformer
(https://lucene.apache.org/solr/guide/6_6/transforming-result-documents.html#TransformingResultDocuments-_child_-ChildDocTransformerFactory)
has a childFilter= option which, when used, solves the issue/bug.
But such a childFilter does not
Hi,
I am an ex FAST employee and actually used Unity a lot myself, even hacking the
code
writing custom mixers etc :)
That is all cool, if you want to write a generic federation layer. In our case
we only ever
need to talk to Solr instances with exactly the same schema and doument types,
compat
Thanks Alessandro. Totally agree that from the logic I can't see why the
requested facet.method=uif is not accepted. I don't see anything in
solr.log also. However I find that the uif method somehow works with json
facet api in cloud mode, e.g:
curl http://mysolrcloud:8983/solr/mycollection/sele
Erick:
> ...one for each cluster and just merged the docs when it got them back
This would be the logical way. I'm afraid that "just merged the docs" is the
crux here, that would
make this an expensive task. You'd have to merge docs, facets, highlights etc,
handle the
different search phases (
@Walter: We have 6 fields declared in schema.xml for title each with different
type of analyzer. One without processing symbols, other stemmed and other
removing symbols, etc. So, if we have separate fields for each alias it will
be that many times the number of final fields declared in schema
For smaller length documents TFIDFSimilarity will weight towards shorter
documents. Another way to say this, if your documents are 5-10 terms, the
5 terms are going to win.
You might think about having per token, or token pair, weight. I would be
surprised if there was not something similar out t
Hi,
Have you manage to get the regex for this string in Chinese: 预支款管理及账务处理办法 ?
Regards,
Edwin
On 4 January 2018 at 18:04, Zheng Lin Edwin Yeo
wrote:
> Hi Emir,
>
> An example of the string in Chinese is 预支款管理及账务处理办法
>
> The number of characters is 12, but the expected length should be 36.
>
@Tim Casey: Yeah... TFIDFSimilarity weighs towards shorter documents. This
is done through the fieldnorm component in the class. The issue is when the
field is multivalued. Consider the field has two string each of 4 tokens.
The fieldNorm from the lucene TFIDFSimilarity class considers the total su
I was the first search engineer at Netflix and moved their search from a
home-grown engine to Solr. It worked very well with a single title field and
aliases.
I think your schema is too complicated for movie search.
Stemming is not useful. It doesn’t help search and it can hurt. You don’t want
@Walter: Perhaps you are right on not to consider stemming. Instead fuzzy
search will cover these along with the misspellings.
In case of symbols, we want the titles matching the symbols ranked higher
than the others. Perhaps we can use this field only for boosting.
Certain movies have around 4-6
You need to tokenize the full name in several different ways and then
search both (all) tokenization versions with different boosts.
This way you can tokenize as full string (perhaps lowercased) and then
also on white space and then maybe even with phonetic mapping to catch
spellings.
You can see
53 matches
Mail list logo