Hi Li,
If you could supply some more info from your logs would help.
We also had some similar issue. There were some bugs related to SolrCloud
that were solved on solr 4.10.4 and further on solr 5.x.
I would suggest you compare your logs with defects on 4.10.4 release notes
to see if they are the s
Yes, it definately seems to be the main problem for us. I did some simple tests
of the encoding and decoding calculations in DefaultSimilarity, and my findings
are:
* For input between 1.0 and 0.5, a difference of 0.01 in the input causes the
output to change by a value of 0 or 0.125 depending
Yes, we do edismax per field boosting, with explicit boosting of the title
field. So it sure makes length normalization less relevant. But not
*completely* irrelevant, which is why I still want to have it as part of the
scoring, just with much less impact that it currently has.
/Jimi
__
Yes, the example was contrived. Partly because our documents are mostly in
Swedish text, but mostly because I thought that the example should be simple
enough so it focused on the thing discussed (even though I simplifyed it to
such a degree that I left out the current main problem with the fiel
Hi
I am trying to facet results on my nest documents. The solr document did
not say much on how to pivot with json api with nest documents. Could
someone show me some examples? Thanks very much.
Yangrui
ok. I understand that. So, you would say documents traverse through network.
If i specify some 100 docs to be dispalyed on my first page, will it effect
performance. While docs gets traversed, will there be any high volume
traffic and effects performance of the application.
And whats the time sol
Hi Shawn,
Yes, I'm using the Extracting Request Handler.
The 0.7GB/hr is the indexing rate at which the size of the original
documents which get ingested into Solr. This means that for every hour,
only 0.7GB of my documents gets ingested into Solr. It will require 10
hours just to index documents
I have 5 zookeeper and 2 solr machines and after a month or two whole
clustre shutdown i dont know why. The logs i get in zookeeper are attached
below. otherwise i dont get any error. All this is based on linux VM.
2016-03-11 16:50:18,159 [myid:5] - WARN [SyncThread:5:FileTxnLog@334] -
fsync-ing
On 4/20/2016 8:10 PM, Zheng Lin Edwin Yeo wrote:
> I'm currently running 4 threads concurrently to run the indexing, Which
> means I run the script in command prompt in 4 different command windows.
> The ID has been configured in such a way that it will not overwrite each
> other during the indexin
Or should this be higher rated about NY, since it's shorter:
* New York
Another though on length norms: with the advent of multi-field dismax with
per-field boosting, people tend to explicitly boost the title field so that
the traditional length normalization is less relevant.
-- Jack Krupansky
Hi Shawn,
I'm currently running 4 threads concurrently to run the indexing, Which
means I run the script in command prompt in 4 different command windows.
The ID has been configured in such a way that it will not overwrite each
other during the indexing. Is that considered multi-threading?
The ra
Thanks for your reply.
I have managed to solve the problem. The reason is that we have to use this
"/" instead of this "\", even in Windows, and to include the data folder as
well.
This is the working one:
dataDir=D:/collection1/data
Regards,
Edwin
On 20 April 2016 at 21:39, Bram Van Dam wrot
Sure, here are some real world examples from my time at Netflix.
Is this movie twice as much about “new york”?
* New York, New York
Which one of these is the best match for “blade runner”:
* Blade Runner: The Final Cut
* Blade Runner: Theatrical & Director’s Cut
* Blade Runner: Workprint
http:
Maybe it's a cultural difference, but I can't imagine why on a query for
"John", any of those titles would be treated as anything other than equals
- namely, that they are all about John. Maybe the issue is that this seems
like a contrived example, and I'm asking for a realistic example. Or, maybe
Hi Jim,
fieldNorm encode/decode thing cause some precision loss.
This may be a problem when dealing with very short documents.
You can find many discussions on this topic.
ahmet
On Thursday, April 21, 2016 3:10 AM, "jimi.hulleg...@svensktnaringsliv.se"
wrote:
Ok sure, I can try and give som
Ok sure, I can try and give some examples :)
Lets say that we have the following documents:
Id: 1
Title: John Doe
Id: 2
Title: John Doe Jr.
Id: 3
Title: John Lennon: The Life
Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book
Id: 5
Title: I Rode With Stonewall: Being C
I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
O
I am talking about the title field. And for the title field, a sweetspot
interval of 1 to 50 makes very little sense. I want to have a fieldNorm value
that differentiates between for example 2, 3, 4 and 5 terms in the title, but
only very little.
The 20% number I got by simply calculating the d
The driver documentation talks about "sessionVariables" that might be
possible to pass through the connection URL:
https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-configuration-properties.html
Alternatively, there might be a way to configure driver via JNDI and
set some variable
Hi All,
We are using SolrCloud 4.6.1. We have observed following behaviors
recently. A Solr node in a Solrcloud cluster is up but some of the cores
on the nodes are marked as down in Zookeeper. If the cores are parts of a
multi-sharded collection with one replica, the queries to that collectio
Hi Jimi,
Please define a meaningful document-lenght range like min=1 max=50.
By the way you need to reindex every time you change something.
Regarding 20% score change, I am not sure how you calculated that number and I
assume it is correct.
What really matters is the relative order of documents
Hi Jimi,
Contribution to the documentation is very important.
It would be great if you can prepare a good text explaining things with common
sense and easy to understand. Please include your documentation proposal as a
commend to the confluence wiki [1].
[1] https://cwiki.apache.org/confluence
Well, it's been a long time since I took any data structures and algorithms
course (2000, basically), and after the recent Solr 6 feature chat, I was very
curious whether there was real computational goodness behind the move towards a
JDBC interface based on Streaming Expressions. This led me
Hang on... It didn't work out as I wanted. But the problem seems to be in the
encoding of the fieldNorm value. The decoded value is so coarse, so that when
it is decoded the result is that two values that were quite close to each other
originally, can become quite far apart after encoding and de
Thanks Ahmet! The second I read that part about the "albino elephant" query I
remembered that I had read that before, but just forgotten about it. That
explanation is really good, and really should be part of the regular
documentation if you ask me. :)
/Jimi
-Original Message-
From: Ah
Hi Ahmet,
SweetSpotSimilarity seems quite nice. Some simple testing by throwing some
different values at the class gives quite good results. Setting ln_min=1,
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less
what I want. At least for the title field. I'm not sure wh
Thank you all for your very valuable suggestions.
I will try out the options shared once our set up is ready and probably get
back on my experience once it is done.
Thanks!
Mark.
On Wed, Apr 20, 2016 at 9:54 AM, Bram Van Dam wrote:
> > I have a requirement to index (mainly updation) 700 docs pe
Hi thanks for answering. My problem is that users do not distinguish what
color the color belongs to in the query. For example, "which black driver
has a white mercedes", it is difficult to distinguish which color belongs
to which field, because there can be thousands of car brands and
professions.
Hi all,
I have been stretching some SOLR's capabilities for nested documents handling
and I've come up with the following issue...
Let's say I have the following structure:
{
"blog-posts":{ //level 1
"leaf-fields":[
"date",
"author"],
"title":{
Hi Jimi,
Field based scoring, where you query multiple fields (title,body,keywords etc)
with multiple query terms, is an unsolved problem.
(E)dismax is a heuristic approach to attack the problem.
Please see the javadoc of DisjunctionMaxQuery :
https://lucene.apache.org/core/6_0_0/core/org/apac
Hi Jimi,
SweetSpotSimilarity allows you define a document length range, so that all
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ
document length punishment. If a document is longer than 100 do some punishment.
By
Yangrui,
First, have you indexed your documents with proper nested document structure
[https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers#UploadingDatawithIndexHandlers-NestedChildDocuments]?
From the peice of data you showed, it seems that you just put it righ
Viday,
No, not all of those 500 result docs will be brought to your client (browser,
etc.) Only as many documents as fit into the 1st "search result page" will be
brought.
There is a notion of "pagination" in Solr (as well as in most search engines).
The counts of occurrence might be appro
OK. Well, still, the fact that the score increases almost 20% because of just
one extra term in the field, is not really reasonable if you ask me. But you
seem to say that this is expected, reasonable and wanted behavior for most use
case?
I'm not sure that I feel comfortable replacing the defa
Hi,
I have been looking a bit at the tie parameter, and I think I understand how it
works, but I still have a few questions about it.
1. It is not documented anywhere (as far as I have seen) what the default value
is. Some testing indicates that the default value is 0, and it makes perfect
sen
FWIW, length for normalization is measured in terms (tokens), not
characters.
With TDIFS similarity (the default before 6.0), the normalization is based
on the inverse square root of the number of terms in the field:
return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));
That code is i
Thanks to those that watched live. If you missed it, here's the audio
recording if you'd like to listen in
http://opensourceconnections.com/blog/2016/04/19/solr-6-release/
Best
-Doug
On Tue, Apr 19, 2016 at 12:32 PM Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:
> Doh! Thanks Yonik
Hi,
In general I think that the fieldNorm factor in the score calculation is quite
good. But when the text is short I think that the effect is two big.
Ie with two documents that have a short text in the same field, just a few
characters extra in of the documents lower the fieldNorm factor too
Hi
When i queried a word in solr, documents having that keyword is displayed in
500 documents,lets say. Will all those documents traverse through network ?
Or how it happens ?
Please help me on this.
--
View this message in context:
http://lucene.472066.n3.nabble.com/Traversal-of-documents-th
> I have a requirement to index (mainly updation) 700 docs per second.
> Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> byes (6 fields out of which only 2 will undergo updation at the above
> rate). This collection has around 122Million docs and that count is pretty
> m
On 4/19/2016 10:12 PM, Zheng Lin Edwin Yeo wrote:
> Thanks for the information Shawn.
>
> I believe it could be due to the types of file that is being indexed.
> Currently, I'm indexing the EML files which are in HTML format, and they
> are more rich in content (with in line images and full text),
On 4/20/2016 6:01 AM, Zaccheo Bagnati wrote:
> I configured an ImportHandler on a MySQL table using jdbc driver. I'm
> wondering if is possible to set a session variable in the mysql connection
> before executing queries. e. g. "SET SESSION group_concat_max_len =
> 100;"
Normally the MySQL JDB
Have you considered simply mounting different disks under different
paths? It looks like you're using Windows, so I'm not sure if that's
possible, but it seems like a relatively basic task, so who knows.
You could mount Disk 1 as /path/to/collection1 and Disk 2 as
/path/to/collection2. That way yo
HI,
i would like to load solr documents (based on certain criteria) in
application cache (Hazelcast).
Is there any best way to do it other than firing paginated queries ? Thanks.
Regards,
Anil
Hi all,
I configured an ImportHandler on a MySQL table using jdbc driver. I'm
wondering if is possible to set a session variable in the mysql connection
before executing queries. e. g. "SET SESSION group_concat_max_len =
100;"
Thanks
Bye
Zaccheo
Thanks for your answer, David, and have a good vacation.
It seems more detailed heatmap is not a goods solution in my case because i
need to display cluster icon with number of items inside cluster. So if i
got very large amount of cells on map, some of the cells will overlap.
I also think about
46 matches
Mail list logo