custom shard or auto shard for SolrCloud?

2015-09-01 Thread Scott Chu
I post this question on Stackoverflow and would like some suggestion: 

solr - Custom sharding or auto Sharding on SolrCloud? - Stack Overflow
http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud


Scott Chu,scott@udngroup.com
2015/9/2 


concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
I post a question on Stackoverflow 
http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
However, since this is a mail-list, I repost the question below to request for 
suggestion and more subtle concept of SolrCloud's behavior on document routing.
I want to establish a SolrCloud clsuter for over 10 millions of news articles. 
After reading this article in Apache Solr Refernce guide: Shards and Indexing 
Data in SolrCloud, I have a plan as follows:
Add prefix ED2001! to document ID where ED means some newspaper source and 2001 
is the year part in published date of news article, i.e. I want to put all news 
articles of specific news paper source published in specific year to a shard.
Create collection with router.name set to compositeID.
Add documents?
Query Collection?
Practically, I got some questions:
How to add doucments based on this plan? Do I have to specify special 
parameters when updating the collection/core?
Is this called "custom sharding"? If not, what is "custom sharding"?
Is auto sharding a better choice for my case since there's a shard-splitting 
feature for auto sharding when the shard is too big?
Can I query without _router_ parameter?
EDIT @ 2015/9/2:
This is how I think SolrCloud will do: "The amount of news articles of specific 
newspaper source of specific year tends to be around a fix number, e.g. Every 
year ED has around 80,000 articles, so each shard's size won't increase 
dramatically. For the next year's news articles of ED, I only have to add 
prefix 'ED2016!' to document ID, SolrCloud will create a new shard for me 
(which contains all ED2016 articles), and later the Leader will spread the 
replica of this new shard to other nodes (per replica per node other than 
leader?)". Am I right? If yes, it seems no need for shard-splitting.


Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
 
solr-user,妳好

Do you mean I only have to put 10M documents in one index and copy it to many 
slaves in a classic Solr master-slave architecture to provide querying serivce 
on internet, and it won't have obvious downgrade of query performance? But I 
did have add 1M document into one index on master and provide 2 slaves to serve 
querying service on internet, the query performance is kinda sad. Why do you 
say: "at 10M documents there's rarely a need to shard at all?" Do I provide too 
few slaves? What amount of documents is suitable for a need for shard in 
SolrCloud?

- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-02, 23:00:29
Subject: Re: concept and choice: custom sharding or auto sharding?


Frankly, at 10M documents there's rarely a need to shard at all.
Why do you think you need to? This seems like adding
complexity for no good reason. Sharding should only really
be used when you have too many documents to fit on a single
shard as it adds some overhead, restricts some
possibilities (cross-core join for instance, a couple of
grouping options don't work in distributed mode etc.).

You can still run SolrCloud and have it manage multiple
_replicas_ of a single shard for HA/DR.

So this seems like an XY problem, you're asking specific
questions about shard routing because you think it'll
solve some problem without telling us what the problem
is.

Best,
Erick

On Wed, Sep 2, 2015 at 7:47 AM, scott chu  wrote:
> I post a question on Stackoverflow 
> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
> However, since this is a mail-list, I repost the question below to request 
> for suggestion and more subtle concept of SolrCloud's behavior on document 
> routing.
> I want to establish a SolrCloud clsuter for over 10 millions of news 
> articles. After reading this article in Apache Solr Refernce guide: Shards 
> and Indexing Data in SolrCloud, I have a plan as follows:
> Add prefix ED2001! to document ID where ED means some newspaper source and 
> 2001 is the year part in published date of news article, i.e. I want to put 
> all news articles of specific news paper source published in specific year to 
> a shard.
> Create collection with router.name set to compositeID.
> Add documents?
> Query Collection?
> Practically, I got some questions:
> How to add doucments based on this plan? Do I have to specify special 
> parameters when updating the collection/core?
> Is this called "custom sharding"? If not, what is "custom sharding"?
> Is auto sharding a better choice for my case since there's a shard-splitting 
> feature for auto sharding when the shard is too big?
> Can I query without _router_ parameter?
> EDIT @ 2015/9/2:
> This is how I think SolrCloud will do: "The amount of news articles of 
> specific newspaper source of specific year tends to be around a fix number, 
> e.g. Every year ED has around 80,000 articles, so each shard's size won't 
> increase dramatically. For the next year's news articles of ED, I only have 
> to add prefix 'ED2016!' to document ID, SolrCloud will create a new shard for 
> me (which contains all ED2016 articles), and later the Leader will spread the 
> replica of this new shard to other nodes (per replica per node other than 
> leader?)". Am I right? If yes, it seems no need for shard-splitting.


-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6086 / 病毒庫: 4409/10562 - 發佈日期: 09/02/15




 


Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
 
solr-user,妳好

Thanks! I'll go back to check my old environment and that article is really 
helpful.

BTW, I think I got wrong about compositeID. In the reference guide, it said 
compositeID needs numShards. That means what I describe in question 5 seems 
wrong cause I intend to plan one shard one whole year news article and I 
thought SolrCloud will create new shard for me itself when I add new year's 
articles. But since compositeID needs to specify numShards first, there's no 
way I can know how many years I will put in SolrCloud in advance . IT looks 
like if I want to use SolrCloud afte all, I may have to use auto sharding (i.e. 
implicit router).
- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-02, 23:30:53
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


bq: Why do you say: "at 10M documents there's rarely a need to shard at all?"

Because I routinely see 50M docs on a single node and I've seen over 300M docs
on a single node with sub-second responses. So if you're saying that
you see poor
performance at 1M docs then I suspect there's something radically
wrong with your
setup. Too little memory, very bad query patterns, whatever. If my
suspicion is true,
then sharding will just mask the underlying problem.

You need to quantify your performance concerns. It's one thing to say
"my node satisfies 50 queries-per-second with 500ms response time" and
another to say "My queries take 5,000 ms".

In the first case, you do indeed need to add more servers to increase QPS if
you need 500 QPS. And adding more slaves is the best way to do that.
In the second, you need to understand the slowdown because sharding
will be a band-aid.

This might help:
https://wiki.apache.org/solr/SolrPerformanceProblems

Best,
Erick



On Wed, Sep 2, 2015 at 8:19 AM, scott chu  wrote:
>
> solr-user,妳好
>
> Do you mean I only have to put 10M documents in one index and copy it to

> many slaves in a classic Solr master-slave architecture to provide querying
> serivce on internet, and it won't have obvious downgrade of query
> performance? But I did have add 1M document into one index on master and

> provide 2 slaves to serve querying service on internet, the query
> performance is kinda sad. Why do you say: "at 10M documents there's rarely a
> need to shard at all?" Do I provide too few slaves? What amount of documents
> is suitable for a need for shard in SolrCloud?
>
> - Original Message -
>
> From: Erick Erickson
> To: solr-user
> Date: 2015-09-02, 23:00:29
> Subject: Re: concept and choice: custom sharding or auto sharding?
>
> Frankly, at 10M documents there's rarely a need to shard at all.
> Why do you think you need to? This seems like adding
> complexity for no good reason. Sharding should only really
> be used when you have too many documents to fit on a single
> shard as it adds some overhead, restricts some
> possibilities (cross-core join for instance, a couple of
> grouping options don't work in distributed mode etc.).
>
> You can still run SolrCloud and have it manage multiple
> _replicas_ of a single shard for HA/DR.
>
> So this seems like an XY problem, you're asking specific
> questions about shard routing because you think it'll
> solve some problem without telling us what the problem
> is.
>
> Best,
> Erick
>
> On Wed, Sep 2, 2015 at 7:47 AM, scott chu  wrote:
>> I post a question on Stackoverflow
>> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
>> However, since this is a mail-list, I repost the question below to request
>> for suggestion and more subtle concept of SolrCloud's behavior on document
>> routing.
>> I want to establish a SolrCloud clsuter for over 10 millions of news
>> articles. After reading this article in Apache Solr Refernce guide: Shards
>> and Indexing Data in SolrCloud, I have a plan as follows:
>> Add prefix ED2001! to document ID where ED means some newspaper source and
>> 2001 is the year part in published date of news article, i.e. I want to put
>> all news articles of specific news paper source published in specific year
>> to a shard.
>> Create collection with router.name set to compositeID.
>> Add documents?
>> Query Collection?
>> Practically, I got some questions:
>> How to add doucments based on this plan? Do I have to specify special
>> parameters when updating the collection/core?
>> Is this called "custom sharding"? If not, what is "custom sharding"?
>> Is auto sharding a better choice for my case since there's a
>> shard-splitting feature for auto sharding when the shard is too big?
>&

Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-02 Thread scott chu
 
solr-user,妳好

Sorry ,wrong again. Auto sharding is not implicit router.
- Original Message - 
From: scott chu 
To: solr-user 
Date: 2015-09-02, 23:50:20
Subject: Re: Re: Re: concept and choice: custom sharding or auto sharding?


 
solr-user,妳好

Thanks! I'll go back to check my old environment and that article is really 
helpful.

BTW, I think I got wrong about compositeID. In the reference guide, it said 
compositeID needs numShards. That means what I describe in question 5 seems 
wrong cause I intend to plan one shard one whole year news article and I 
thought SolrCloud will create new shard for me itself when I add new year's 
articles. But since compositeID needs to specify numShards first, there's no 
way I can know how many years I will put in SolrCloud in advance . IT looks 
like if I want to use SolrCloud afte all, I may have to use auto sharding (i.e. 
implicit router).
- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-02, 23:30:53
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


bq: Why do you say: "at 10M documents there's rarely a need to shard at all?"

Because I routinely see 50M docs on a single node and I've seen over 300M docs
on a single node with sub-second responses. So if you're saying that
you see poor
performance at 1M docs then I suspect there's something radically
wrong with your
setup. Too little memory, very bad query patterns, whatever. If my
suspicion is true,
then sharding will just mask the underlying problem.

You need to quantify your performance concerns. It's one thing to say
"my node satisfies 50 queries-per-second with 500ms response time" and
another to say "My queries take 5,000 ms".

In the first case, you do indeed need to add more servers to increase QPS if
you need 500 QPS. And adding more slaves is the best way to do that.
In the second, you need to understand the slowdown because sharding
will be a band-aid.

This might help:
https://wiki.apache.org/solr/SolrPerformanceProblems

Best,
Erick



On Wed, Sep 2, 2015 at 8:19 AM, scott chu  wrote:
>
> solr-user,妳好
>
> Do you mean I only have to put 10M documents in one index and copy it to

> many slaves in a classic Solr master-slave architecture to provide querying
> serivce on internet, and it won't have obvious downgrade of query
> performance? But I did have add 1M document into one index on master and

> provide 2 slaves to serve querying service on internet, the query
> performance is kinda sad. Why do you say: "at 10M documents there's rarely a
> need to shard at all?" Do I provide too few slaves? What amount of documents
> is suitable for a need for shard in SolrCloud?
>
> - Original Message -
>
> From: Erick Erickson
> To: solr-user
> Date: 2015-09-02, 23:00:29
> Subject: Re: concept and choice: custom sharding or auto sharding?
>
> Frankly, at 10M documents there's rarely a need to shard at all.
> Why do you think you need to? This seems like adding
> complexity for no good reason. Sharding should only really
> be used when you have too many documents to fit on a single
> shard as it adds some overhead, restricts some
> possibilities (cross-core join for instance, a couple of
> grouping options don't work in distributed mode etc.).
>
> You can still run SolrCloud and have it manage multiple
> _replicas_ of a single shard for HA/DR.
>
> So this seems like an XY problem, you're asking specific
> questions about shard routing because you think it'll
> solve some problem without telling us what the problem
> is.
>
> Best,
> Erick
>
> On Wed, Sep 2, 2015 at 7:47 AM, scott chu  wrote:
>> I post a question on Stackoverflow
>> http://stackoverflow.com/questions/32343813/custom-sharding-or-auto-sharding-on-solrcloud:
>> However, since this is a mail-list, I repost the question below to request
>> for suggestion and more subtle concept of SolrCloud's behavior on document
>> routing.
>> I want to establish a SolrCloud clsuter for over 10 millions of news
>> articles. After reading this article in Apache Solr Refernce guide: Shards
>> and Indexing Data in SolrCloud, I have a plan as follows:
>> Add prefix ED2001! to document ID where ED means some newspaper source and
>> 2001 is the year part in published date of news article, i.e. I want to put
>> all news articles of specific news paper source published in specific year
>> to a shard.
>> Create collection with router.name set to compositeID.
>> Add documents?
>> Query Collection?
>> Practically, I got some questions:
>> How to add doucments based on this plan? Do I have to specify special
>> parameters when updating the collection/core?
>> Is this

Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread Scott Chu
Do you use master-slave or SolrCloud for that single shard? Erick suggest that 
I can still can use SolrCloud for HA/DR purpose cause Zookeeper can do the work 
for me. Should I just give up master-slave choice even there's only one single 
shard?

Scott Chu,scott@udngroup.com
2015/9/3 
- Original Message - 
From: Toke Eskildsen 
To: solr-user 
Date: 2015-09-03, 17:46:22
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


On Wed, 2015-09-02 at 08:30 -0700, Erick Erickson wrote:
> Because I routinely see 50M docs on a single node and I've seen over 300M docs
> on a single node with sub-second responses.

For what it's worth, we also do article-based search of newspaper based
material (old OCR'ed papers). We use a single replicated shard for that
and it works fine (response times < 1s for 98.5% of the searches), with
faceting on 4 fields as well as grouping. There are 66M articles in a
340GB shard.

It is always hard to compare indexes, but I agree with Erick that having
performance problems with 10M documents calls for locating the
bottlenecks, before trying to scale the problem away.

- Toke Eskildsen, State and University Library, Denmark




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6086 / 病毒庫: 4409/10564 - 發佈日期: 09/02/15


Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread scott chu
 
solr-user,妳好

If you switch to SolrCloud, will you still keep numShards parameter to 1? If 
you are migrating to SolrCloud and going to split that single shard into 
multple shards, Wouldn't you have to reindex the data? Is it possible just put 
that single shard into SolrCloud and call SPLITSHARD API to split it?

I ask this cause I'd like to try first use master-slave architecture, like Eric 
suggest that 10M is not a "vast" thing. Then later, I might migrate it to 
SolrCloud possibly because I want to take advange of the Zookeeper 
functionality for HA/DR.
- Original Message - 
From: Toke Eskildsen 
To: solr-user 
Date: 2015-09-03, 18:33:39
Subject: Re: Re: concept and choice: custom sharding or auto sharding?


On Thu, 2015-09-03 at 18:24 +0800, Scott Chu wrote:
> Do you use master-slave or SolrCloud for that single shard?

Due to legacy reasons we are just using 2 fully independent Solrs, each
indexing independently, with an Apache load balancer in front for the
searches. It does give us the occasional hiccup, so we'll be switching
to SolrCloud at some point.

- Toke Eskildsen, State and University Library, Denmark




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15



 


Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread scott chu
 
solr-user,妳好

I keep forgeting to mention one thing along the discussion session. Our data is 
Chinese news articles and we use CJK tokenizer (i.e. 2-gram) currently. The 
time spent to indexing is quite slow, compared to indexing english articles. 
That's why I am so worrying about indexing performance on 10M Chinese docs and 
turn to study SolrCloud. It could also be the reason why we index 1M docs kinda 
slow. Frankly, we didn't delve into writing a better-performance Chinese 
tokenizer in past years due to some policy reason (However, we did make a plan 
to write one next year using MMSeg  algorithm or 1-ngram+query-preprocessor). 

- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-04, 00:07:43
Subject: Re: Re: Re: concept and choice: custom sharding or auto sharding?


bq: If you switch to SolrCloud, will you still keep numShards parameter to 1

yes. Although if you want to add more replicas you might want to specify that.

For 10M documents, I wouldn't be very fancy. Indexing them shouldn't take
very long, and I think your time would be better spent on other things than
trying to get fancy with splitshard and the like. Just create a
SolrCloud cluster
with as many replicas as you want and index from scratch unless it's
prohibitively expensive.

I can index 200M docs on my local Mac Pro in a couple of hours. Is it really
worth trying to do something you'll probably never do again (i.e. SPLITSHARD)?

If you really don't want to re-index _and_ you have only one shard in the
master/slave setup, here's what I'd do to migrate
1> create a new SolrCloud cluster with exactly one node (i.e. the "leader").
2> shut it down
3> copy the index from your master/slave to the new node, completely
 replacing the data directory
4> bring the node back up and check it.
5> use the collecitons API ADDREPLICA command to bring up as many
replicas as you want, they'll pull down the index and from that point on
you should be good.
5a> In this case, it'll actually do a complete replication from the leader to
 the followers, but thereafter incremental updates will be sent to all

 the nodes in the cluster rather than the older style master/slave
 occasional replication.

Best,
Erick

On Thu, Sep 3, 2015 at 8:54 AM, scott chu  wrote:
>
> solr-user,妳好
>
> If you switch to SolrCloud, will you still keep numShards parameter to 1? If
> you are migrating to SolrCloud and going to split that single shard into

> multple shards, Wouldn't you have to reindex the data? Is it possible just
> put that single shard into SolrCloud and call SPLITSHARD API to split it?
>
> I ask this cause I'd like to try first use master-slave architecture, like
> Eric suggest that 10M is not a "vast" thing. Then later, I might migrate it
> to SolrCloud possibly because I want to take advange of the Zookeeper
> functionality for HA/DR.
>
> - Original Message -
> From: Toke Eskildsen
> To: solr-user
> Date: 2015-09-03, 18:33:39
> Subject: Re: Re: concept and choice: custom sharding or auto sharding?
>
> On Thu, 2015-09-03 at 18:24 +0800, Scott Chu wrote:
>> Do you use master-slave or SolrCloud for that single shard?
>
> Due to legacy reasons we are just using 2 fully independent Solrs, each
> indexing independently, with an Apache load balancer in front for the
> searches. It does give us the occasional hiccup, so we'll be switching
> to SolrCloud at some point.
>
> - Toke Eskildsen, State and University Library, Denmark
>
>
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15
>
>
>
>


-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6086 / 病毒庫: 4409/10567 - 發佈日期: 09/03/15




 


Re: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?

2015-09-03 Thread scott chu
 
solr-user,妳好

No, both. But first I have to face the indexing performance problem. Where can 
I see information about concurrent/parallel indexing on Solr? Thanks in advance.
- Original Message - 
From: Toke Eskildsen 
To: solr_user lucene_apache 
Date: 2015-09-04, 00:57:51
Subject: Re: Re: Re: Re: concept and choice: custom sharding or auto sharding?


scott chu  wrote:
?
> I keep forgeting to mention one thing along the discussion session.
> Our data is Chinese news articles and we use CJK tokenizer
> (i.e. 2-gram) currently. The time spent to indexing is quite slow,
> compared to indexing english articles. That's why I am so
> worrying about indexing performance on 10M Chinese docs
> and turn to study SolrCloud.

The performance problem is indexing and not searching? Solr supports concurrent 
indexing, so if you are able to send the data in parallel, just start as many 
indexing threads as you have cores. Of course that does not help if you are 
already doing that.

Also sanity check that you are not doing commits all the time.

- Toke Eskildsen


-
???
??? AVG ?? - www.avg.com
??: 2015.0.6086 / ???: 4409/10567 - : 09/03/15




 


Re: Re: Re: Re: Re: Re: concept and choice: custom sharding or autosharding?

2015-09-03 Thread scott chu
 
solr-user,妳好

Thanks for the info. I also find this: Parallel indexing for Apache Solr Search 
Integration | Nick Veenhof
http://nickveenhof.be/blog/parallel-indexing-apache-solr-search-integration
- Original Message - 
From: Toke Eskildsen 
To: solr_user lucene_apache 
Date: 2015-09-04, 01:26:38
Subject: Re: Re: Re: Re: Re: concept and choice: custom sharding or 
autosharding?


scott chu  wrote:
> No, both. But first I have to face the indexing performance problem.
> Where can I see information about concurrent/parallel indexing on Solr?

Depends on how you index. If you use a Java program,
http://lucene.apache.org/solr/5_2_0/solr-solrj/index.html?org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html
seems to do the trick (I haven't tried that one myself).

If you are sending updates using curl or similar, you just need to start more 
processes doing that.

If you are using DataImportHandler, I think you are out of luck. As far as I 
know, it does not support multiple index threads.

- Toke Eskildsen


-
???
??? AVG ?? - www.avg.com
??: 2015.0.6086 / ???: 4409/10567 - : 09/03/15




 


Re: concept and choice: custom sharding or auto sharding?

2015-09-23 Thread scott chu
 
Too busy these days on work. I'd like to continue this topic talk. 

I got 10M+ tradition-chinese news articles. Due to lack of time to write my own 
traditional-Chinese tokenizer, I use CJK tokenizer. However, CJK uses bigram 
and thus will create a very large index on my case. I don't know if I should 
jump directly to SolrCloud, or as Erik suggest. just create master-slave 
architecture with better tuninb of perfomance, to buid a 10M+ tradition chinese 
Solrbase?

Furthermore, if anyone has any good recommendation of traditional Chinese 
tokenizer?

- Original Message - 
From: Erick Erickson 
To: solr-user 
Date: 2015-09-04, 01:47:23
Subject: Re: Re: Re: Re: Re: concept and choice: custom sharding or auto 
sharding?


Ah, that may make my suggestions unworkable re: just reindexing.

Still, how much time are we talking about here? I've very often found
that indexing performance isn't gated by the Solr processing, but by
whatever is feeding Solr. A quick test is to fire up your indexing
and see if the CPU utilization by Solr is very high. As Toke said,
though, if you're using DIH you're out of luck.

Here's an article to get you started with SolrJ:
http://lucidworks.com/blog/indexing-with-solrj/

Best,
Erick

On Thu, Sep 3, 2015 at 10:26 AM, Toke Eskildsen  
wrote:
> scott chu  wrote:
>> No, both. But first I have to face the indexing performance problem.
>> Where can I see information about concurrent/parallel indexing on Solr?

>
> Depends on how you index. If you use a Java program,
> http://lucene.apache.org/solr/5_2_0/solr-solrj/index.html?org/apache/solr/client/solrj/impl/ConcurrentUpdateSolrServer.html
> seems to do the trick (I haven't tried that one myself).
>
> If you are sending updates using curl or similar, you just need to start more 
> processes doing that.
>
> If you are using DataImportHandler, I think you are out of luck. As far as I 
> know, it does not support multiple index threads.
>
> - Toke Eskildsen


-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6086 / 病毒庫: 4409/10571 - 發佈日期: 09/03/15




 


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-19 Thread Scott Chu
Hi Edwin,

I didn't use Jieba on Chinese (I use only CJK, very foundamental, I know) so I 
didn't experience this problem. 

I'd suggest you post your schema.xml so we can see how you define your content 
field and the field type it uses?

In the mean time, refer to these articles, maybe the answer or workaround can 
be deducted from them.

https://issues.apache.org/jira/browse/SOLR-3390

http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words

http://qnalist.com/questions/667066/highlighting-marks-wrong-words

Good luck!




Scott Chu,scott@udngroup.com
2015/10/20 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-13, 17:04:29
Subject: Highlighting content field problem when using JiebaTokenizerFactory


Hi,

I'm trying to use the JiebaTokenizerFactory to index Chinese characters in

Solr. It works fine with the segmentation when I'm using
the Analysis function on the Solr Admin UI.

However, when I tried to do the highlighting in Solr, it is not
highlighting in the correct place. For example, when I search of 自然環境与企業本身,
it highlight 認為自然環境与企業本身的

Even when I search for English character like responsibility, it highlight
  *responsibilit*y.

Basically, the highlighting goes off by 1 character/space consistently.

This problem only happens in content field, and not in any other fields.
Does anyone knows what could be causing the issue?

I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.


Regards,
Edwin



-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Scott Chu
Hi solr-user,

Can't judge the cause on fast glimpse of your definition but some suggestions I 
can give:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good job 
on CJK. I doubt this problem may be from those filters (note: I can understand 
you may use CJKWidthFilter to convert Japanese but doesn't understand why you 
use CJKBigramFilter and EdgeNGramFilter). Have you tried commenting out those 
filters, say leave only Jieba and StopFilter, and see if this problem disppears?

2.Does this problem occur only on Chinese search words? Does it happen on 
English search words?

3.To use FastVectorHighlighter, you seem to have to enable 3 term* parameters 
in field declaration? I see only one is enabled. Please refer to the answer in 
this stackoverflow question: 
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only


Scott Chu,scott@udngroup.com
2015/10/22 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-20, 12:04:11
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Here's my schema.xml for content and title, which uses text_chinese. The
problem only occurs in content, and not in title.


   


  
 






 
 





  
   


Here's my solrconfig.xml on the highlighting portion:

  
  
   explicit
   10
   json
   true
  text
  id, title, content_type, last_modified, url, score 

  on
   id, title, content, author, tag
  true
   true
   html
  200
true
signature
true
100
  
  


 
WORD
en
SG
 



Meanwhile, I'll take a look at the articles too.

Thank you.

Regards,
Edwin


On 20 October 2015 at 11:32, Scott Chu  wrote:

> Hi Edwin,
>
> I didn't use Jieba on Chinese (I use only CJK, very foundamental, I
> know) so I didn't experience this problem.
>
> I'd suggest you post your schema.xml so we can see how you define your
> content field and the field type it uses?
>
> In the mean time, refer to these articles, maybe the answer or workaround
> can be deducted from them.
>
> https://issues.apache.org/jira/browse/SOLR-3390
>
> http://qnalist.com/questions/661133/solr-is-highlighting-wrong-words
>
> http://qnalist.com/questions/667066/highlighting-marks-wrong-words
>
> Good luck!
>
>
>
>
> Scott Chu,scott@udngroup.com
> 2015/10/20
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-13, 17:04:29
> *Subject: *Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi,
>
> I'm trying to use the JiebaTokenizerFactory to index Chinese characters in
>
> Solr. It works fine with the segmentation when I'm using
> the Analysis function on the Solr Admin UI.
>
> However, when I tried to do the highlighting in Solr, it is not
> highlighting in the correct place. For example, when I search of 自然環境与企業本身,
> it highlight 認為自然環境与企業本身的
>
> Even when I search for English character like responsibility, it highlight
>  *responsibilit*y.
>
> Basically, the highlighting goes off by 1 character/space consistently.
>
> This problem only happens in content field, and not in any other fields.

> Does anyone knows what could be causing the issue?
>
> I'm using jieba-analysis-1.0.0, Solr 5.3.0 and Lucene 5.3.0.
>
>
> Regards,
> Edwin
>
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6140 / 病毒庫: 4447/10808 - 發佈日期: 10/12/15
>
>



-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4447/10853 - 發佈日期: 10/19/15


Is it possible to specigfy only one-character term synonym for 2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi solr-user,

I always uses CJKTokenizer on appropriate amount of Chinese news articles. Say 
in Chinese, character C1 has same meaning as character C2 (e.g 台=臺), Is it 
possible that I only add this line in synonym.txt:

C1,C2 (and in true exmaple: 台, 臺)

and by applying CJKTokenizer and SynonymFilter, I only have to query "C1Cm..."  
(say Cm is arbitrary Chinese character) and Solr will return documents that 
matche whether "C1Cm" or "C2Cm"?

Scott Chu,scott@udngroup.com
2015/10/22 


Re: Is it possible to specigfy only one-character term synonym for2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi solr-user,

Ya, I thought about replacing C1 with C2 in the underground raw data. However, 
it's a huge data set (over 10M news articles) so I give up this strategy 
eariler. My current temporary solution is going back to use 1-gram tokenizer 
((i.e.StandardTokenizer) so I can only set 1 rule. But it is kinda ugly, 
especially when applying highlight, e.g. search "C1C2" Solr returns highlight 
snippet such as "...C1C2...".

Scott Chu,scott@udngroup.com
2015/10/22 
- Original Message - 
From: Emir Arnautovic 
To: solr-user 
Date: 2015-10-22, 17:08:26
Subject: Re: Is it possible to specigfy only one-character term synonym 
for2-gram tokenizer?


Hi Scott,
I don't have experience with Chinese, but SynonymFilter works on tokens, 
so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If 
not, than you can try configuring PatternReplaceCharFilter to replace C1 
to C2 during indexing and searching and get a match.

Thanks,
Emir

On 22.10.2015 10:53, Scott Chu wrote:
> Hi solr-user,
> I always uses CJKTokenizer on appropriate amount of Chinese news 
> articles. Say in Chinese, character C1 has same meaning as 
> character C2 (e.g 台=臺), Is it possible that I only add this line in 
> synonym.txt:
> C1,C2 (and in true exmaple: 台, 臺)
> and by applying CJKTokenizer and SynonymFilter, I only have to query 
> "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will 
> return documents that matche whether "C1Cm" or "C2Cm"?
> Scott Chu,scott@udngroup.com <mailto:scott@udngroup.com>
> 2015/10/22
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15


Re: Is it possible to specigfy only one-character term synonymfor2-gram tokenizer?

2015-10-22 Thread Scott Chu
Hi Emir,

Very weirdly. I've reply to your email at home many times yesterday but they 
never show up in the solr-user email list again. Don't know why. So I reply 
this again at office. Hope this will show up.

Thanks to your explanation. I'll see PatternReplaceCharFilter as a workaround 
(As I know, Character filter are dealing with input stream before the 
tokenizer. In some way, indexed data no longer has original C1 if I do the 
replacement.) What I deal wth are published news articles and I don't know how 
the author of these articles feel about when they see C1 in their articles 
become C2 since some term containing C1 are proper nouns or terminologies. I'll 
talk to them to see if this is ok. Thanks anyway.

Scott Chu,scott@udngroup.com
2015/10/23 
- Original Message - 
From: Emir Arnautovic 
To: solr-user 
Date: 2015-10-22, 18:20:38
Subject: Re: Is it possible to specigfy only one-character term 
synonymfor2-gram tokenizer?


Hi Scott,
Using PatternReplaceCharFilter is not same as replacing raw data 
(replacing raw data is not proper solution as it does not solve issue 
when searching with "other" character). This is part of token 
standardization, no different than lower casing - it is standard 
approach as well when it comes to Latin characters:


Quick search of "MappingCharFilterFactory chinese" shows it is used - 
you should check if suitable for your case.

Thanks,
Emir

On 22.10.2015 11:48, Scott Chu wrote:
> Hi solr-user,
> Ya, I thought about replacing C1 with C2 in the underground raw data. 
> However, it's a huge data set (over 10M news articles) so I give up 
> this strategy eariler. My current temporary solution is going back to 
> use 1-gram tokenizer ((i.e.StandardTokenizer) so I can only set 1 
> rule. But it is kinda ugly, especially when applying highlight, e.g. 
> search "C1C2" Solr returns highlight snippet such as 
> "...C1C2...".
> Scott Chu,scott@udngroup.com <mailto:scott@udngroup.com>
> 2015/10/22
>
> - Original Message -
> *From: *Emir Arnautovic <mailto:emir.arnauto...@sematext.com>
> *To: *solr-user <mailto:solr-user@lucene.apache.org>
> *Date: *2015-10-22, 17:08:26
> *Subject: *Re: Is it possible to specigfy only one-character term
> synonym for2-gram tokenizer?
>
> Hi Scott,
> I don't have experience with Chinese, but SynonymFilter works on
> tokens,
> so if CJKTokenizer recognizes C1 and Cm as tokens, it should work. If
> not, than you can try configuring PatternReplaceCharFilter to
> replace C1
> to C2 during indexing and searching and get a match.
>
> Thanks,
> Emir
>
> On 22.10.2015 10:53, Scott Chu wrote:
> > Hi solr-user,
> > I always uses CJKTokenizer on appropriate amount of Chinese news
> > articles. Say in Chinese, character C1 has same meaning as
> > character C2 (e.g 台=臺), Is it possible that I only add this
> line in
> > synonym.txt:
> > C1,C2 (and in true exmaple: 台, 臺)
> > and by applying CJKTokenizer and SynonymFilter, I only have to
> query
> > "C1Cm..." (say Cm is arbitrary Chinese character) and Solr will
> > return documents that matche whether "C1Cm" or "C2Cm"?
> > Scott Chu,scott@udngroup.com
> <mailto:%20scott@udngroup.com> <mailto:scott@udngroup.com
> <mailto:%20scott@udngroup.com>>
> > 2015/10/22
> >
>
> -- 
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
>
> -
> 未在此訊息中找到病毒。
> 已透過 AVG 檢查 - www.avg.com
> 版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15
>

-- 
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/




-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6172 / 病毒庫: 4450/10867 - 發佈日期: 10/21/15


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-22 Thread Scott Chu
Hi Edwin,

Since you've tested all my suggestions and the problem is still there, I can't 
think of anything wrong with your configuration. Now I can only suspect two 
things:

1. You said the problem only happens on "contents" field, so maybe there're 
something wrong with the contents of that field. Doe it contain any special 
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions something 
about HTML stripping will cause highlight problem. Maybe you can try purify 
that fields to be closed to pure text and see if highlight comes ok.

2. Maybe something imcompatible between JiebaTokenizer and Solr highlighter. If 
you switch to other tokenizers, e.g. Standard, CJK, SmartChinese (I don't use 
this since I am dealing with Traditional Chinese but I see you are dealing with 
Simplified Chinese), or 3rd-party MMSeg and see if the problem goes away. 
However when I'm googling similar problem, I saw you asked same question on 
August at Huaban/Jieba-analysis and somebody said he also uses JiebaTokenizer 
but he doesn't have your problem. So I see this could be less suspect.

The theory of your problem could be something in indexing process causes wrong 
position info. for that field and when Solr do highlighting, it retrieves wrong 
position info. and mark wrong position of highlight target terms.

Scott Chu,scott@udngroup.com
2015/10/23 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-22, 22:22:14
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your response and suggestions.

With respond to your questions, here are the answers:

1. I take a look at Jieba. It uses a dictionary and it seems to do a good
job on CJK. I doubt this problem may be from those filters (note: I can
understand you may use CJKWidthFilter to convert Japanese but doesn't
understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
commenting out those filters, say leave only Jieba and StopFilter, and see

if this problem disppears?
*A) Yes, I have tried commenting out the other filters and only left with
Jieba and StopFilter. The problem is still there.*

2.Does this problem occur only on Chinese search words? Does it happen on
English search words?
*A) Yes, the same problem occurs on English words. For example, when I
search for "word", it will highlight in this way:  word*

3.To use FastVectorHighlighter, you seem to have to enable 3 term*
parameters in field declaration? I see only one is enabled. Please refer to
the answer in this stackoverflow question:
http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
*A) I have tried to enable all 3 terms in the FastVectorHighlighter too,
but the same problem persists as well.*


Regards,
Edwin


On 22 October 2015 at 16:25, Scott Chu  wrote:

> Hi solr-user,
>
> Can't judge the cause on fast glimpse of your definition but some
> suggestions I can give:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
> if this problem disppears?
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
>
> 3.To use FastVectorHighlighter, you seem to have to enable 3 term*
> parameters in field declaration? I see only one is enabled. Please refer to
> the answer in this stackoverflow question:
> http://stackoverflow.com/questions/25930180/solr-how-to-highlight-the-whole-search-phrase-only
>
>
> Scott Chu,scott@udngroup.com
> 2015/10/22
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-20, 12:04:11
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Here's my schema.xml for content and title, which uses text_chinese. The

> problem only occurs in content, and not in title.
>
>  omitNorms="true" termVectors="true"/>
>  omitNorms="true" termVectors="true"/>
>
>
>  positionIncrementGap="100">
> 
>  segMode="SEARCH"/>
> 
> 
>  words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
>  maxGramSize="15"/>
> 
> 
> 
>  segMode="SEARCH"/>
> 
> 
>  words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
> 
> 
> 
>
>
> Here's my solrconfig.xml on the highlighting portion:
>
> 
> 
> explic

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-26 Thread Scott Chu
Hi Edward,

Took a lot of time to see if there's anything can help you to define the 
cause of your problem. Maybe this might help you a bit: 

[SOLR-4722] Highlighter which generates a list of query term position(s) for 
each item in a list of documents, or returns null if highlighting is disabled. 
- AS...
https://issues.apache.org/jira/browse/SOLR-4722

This one is modified from FastVectorHighLighter, so ensure those 3 term* 
attributes are on.

Scott Chu,scott@udngroup.com
2015/10/27 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-23, 10:42:32
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can

try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese

but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and

see if the problem goes away. However when I'm googling similar problem, I

saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and

it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu  wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I

> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain

> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes

> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu,scott@udngroup.com
> 2015/10/23
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-22, 22:22:14
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your response and suggestions.
>
> With respond to your questions, here are the answers:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
>
> if this problem disppears?
> *A) Yes, I have tried commenting out the other filters and onl

Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-26 Thread Scott Chu

Take a look at Michael's 2 articles, they might help you calrify the idea of 
highlighting in Solr:

Changing Bits: Lucene's TokenStreams are actually graphs!
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Also take a look at 4th paragraph In his another article:

Changing Bits: A new Lucene highlighter is born
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Currently, I can't figure out the possible cause of your problem unless I got 
spare time to test it on my own, which is not available these days (Got some 
projects to close)!

If you find the solution or workaround, pls. let us know. Good luck again!

Scott Chu,scott@udngroup.com
2015/10/27 
- Original Message ----- 
From: Scott Chu 
To: solr-user 
Date: 2015-10-27, 10:27:45
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Edward,

Took a lot of time to see if there's anything can help you to define the 
cause of your problem. Maybe this might help you a bit: 

[SOLR-4722] Highlighter which generates a list of query term position(s) for 
each item in a list of documents, or returns null if highlighting is disabled. 
- AS...
https://issues.apache.org/jira/browse/SOLR-4722

This one is modified from FastVectorHighLighter, so ensure those 3 term* 
attributes are on.

Scott Chu,scott@udngroup.com
2015/10/27 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-23, 10:42:32
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can

try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese

but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and

see if the problem goes away. However when I'm googling similar problem, I

saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and

it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu  wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I

> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain

> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes

> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu,sco

Re: Kate Winslet vs Winslet Kate

2015-11-03 Thread scott chu
solr-user,妳好

With repsect to querying, Dismax makes solr query syntax quite like Google's, 
you type simple keywords, you can boost them, you can use +/- just like 
Google's. Meaning they give users a lot of covenince and less boolean knowlege 
to establish intended query string. Normal Lucene search syntax are treated as 
escaped characters except AND & OR. You can say Dismax gives some room for 
phrase querying. eDismax improve something of Dismax but it depends on if you 
need thorse improvement or not. You can see it here: 
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser


- Original Message - 
From: Yangrui Guo 
To: solr-user@lucene.apache.org 
Date: 2015-11-01, 23:58:27
Subject: Re: Kate Winslet vs Winslet Kate


Could you tell me more about the edismax approach? I'm new to it. Thanks a

lot

On Sunday, November 1, 2015, Erick Erickson  wrote:

> If your goal is to have docs with "kate" and "winslet"
> in the _name_ field be scored higher, just make that
> explicit as
> name:(kate AND winslet)
> perhaps boosting as
> name:(kate AND winslet)^10
> or add it as a clause
> q=kate AND winslet OR name:(kate AND winslet)^10
> or even
> q=kate AND winslet OR name:(kate AND winslet)^10 OR name:"kate winslet"^20
>
>
> Or use edismax to do this kind of thing for you, that's
> its purpose.
>
> Best,
> Erick
>
> On Sun, Nov 1, 2015 at 7:06 AM, Yangrui Guo  > wrote:
> > I debugged the query and found the query has been translated into
> > _text_:Kate AND _text_:Winslet, which _text_ is the default search field.
> > Because my documents use parent/child relation it appeared that if
> there's
> > no exact match of Kate Winslet, solr will return all documents contains
> > "Kate" and "Winslet" in anywhere. However it will more sense if solr can
> > rank docs that have "Kate" and "Winslet" in the same field higher. Of
> > course I can use some NLP tricks with named entity recognition but it
> would
> > be more expensive to develop.
> >
> > On Sunday, November 1, 2015, Paul Libbrecht  > wrote:
> >
> >> Alexandre,
> >>
> >> I guess you are talking about that post:
> >>
> >>
> >>
> http://lucidworks.com/blog/2015/06/06/query-autofiltering-extended-language-logic-search/
> >>
> >> I think it is very often impossible to solve properly.
> >>
> >> Words such as "direction" have very many meanings and would come in
> >> different fields.
> >> In IMDB, words such as the names of persons would come in at least
> >> different roles; similarly, the actors' role's name is likely to match
> >> the family name of persons...
> >>
> >> Paul
> >>
> >>
> >>
> >> > As others indicated having intelligence to recognize the terms (e.g.
> >> > Kate should be in name) or some user indication to do so can make
> thing
> >> > more precise but is rarely done.
> >> > Alexandre Rafalovitch  >
> >> > 1 novembre 2015 13:07
> >> > Which is what I believe Ted Sullivan is working on and presented at

> >> > the latest Lucene/Solr Revolution. His presentation does not seem to
> >> > be up, but he was writing about it on:
> >> > http://lucidworks.com/blog/author/tedsullivan/
> >>
> >> > Erick Erickson  >
> >> > 1 novembre 2015 07:40
> >> > Yeah, that's actually a tough one. You have no control over what the
> >> > user types,
> >> > you have to try to guess what they meant.
> >>
> >>
>



-
未在此訊息中找到病毒。
已透過 AVG 檢查 - www.avg.com
版本: 2015.0.6173 / 病毒庫: 4455/10925 - 發佈日期: 10/31/15


  


[scottchu] Cab I migrate solrcloud by just copying whole package folder?

2016-05-16 Thread Scott Chu
On my office pc, I install Solr 5 on d:\solr5 and create myconfigsets and 
mynodes under it. Then run a solrcloud with 2 nodes and embedded zk nodes by 
executing these commands:

cd /d d:\solr5
bin\solr start -c -s mynode\node1
bin\solr start -c -s mynode\node2 -p 7973 -z localhost:9983
bin\solr create_collection -c cugna -d myconfigsets\cugna -shards 1 
-replicationFactor 2

They run well and then I insert 50k no. of docs to cugna. They are all under 
d:\solr5 folder where I unzip solr 5 package.

~~~
Later I copy d:\solr5 to usb and bring it home. I copy it to d:\solr5 at home 
pc(i.e. with same path). When I try to start nodes, the server log shows:

2016-05-16 14:57:18.830 ERROR (qtp5592464-15) [   ] o.a.s.s.HttpSolrCall 
null:org.apache.solr.common.SolrException: Error loading config name for 
collection cugna
...
2016-05-16 14:57:18.829 INFO  (qtp5592464-15) [   ] o.a.s.s.HttpSolrCall 
[admin] webapp=null path=/admin/collections 
params={action=CLUSTERSTATUS&wt=json} status=500 QTime=9
2016-05-16 14:57:18.830 ERROR (qtp5592464-15) [   ] o.a.s.s.HttpSolrCall 
null:org.apache.solr.common.SolrException: Error loading config name for 
collection cugna
...
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
KeeperErrorCode = NoNode for /collections/cugna

Does this mean Solrcloud can not migrate by copying whole Solr package? How can 
I do it?

Note: I want to do this because if I add all docs in lab and later when go 
live, I wish just copy whole folder to production environment without adding 
all docs again?


Re: [scottchu] Cab I migrate solrcloud by just copying whole package folder?

2016-05-16 Thread Scott Chu

But I'm using embeded zk nodes provided by solr start command. I thought they 
are all under d:\solr5. How can I run that embedded zk node independently?

Scott Chu,scott@udngroup.com
2016/5/16 (週一)
- Original Message - 
From: Binoy Dalal 
To: scott.chu ; solr-user 
CC: 
Date: 2016/5/16 (週一) 23:29
Subject: Re: [scottchu] Cab I migrate solrcloud by just copying whole package 
folder?


What you copied is just the index. 
Your configurations are stored on zookeeper. 
You need to upload these to your zookeeper on your other machine and link 
it to your collection. 
Then it'll work. 

On Mon, 16 May 2016, 20:55 Scott Chu,  wrote: 

> On my office pc, I install Solr 5 on d:\solr5 and create myconfigsets and 
> mynodes under it. Then run a solrcloud with 2 nodes and embedded zk nodes 
> by executing these commands: 
> 
> cd /d d:\solr5 
> bin\solr start -c -s mynode\node1 
> bin\solr start -c -s mynode\node2 -p 7973 -z localhost:9983 
> bin\solr create_collection -c cugna -d myconfigsets\cugna -shards 1 
> -replicationFactor 2 
> 
> They run well and then I insert 50k no. of docs to cugna. They are all 
> under d:\solr5 folder where I unzip solr 5 package. 
> 
> ~~~ 
> Later I copy d:\solr5 to usb and bring it home. I copy it to d:\solr5 at 
> home pc(i.e. with same path). When I try to start nodes, the server log 
> shows: 
> 
> 2016-05-16 14:57:18.830 ERROR (qtp5592464-15) [ ] 
> o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error 
> loading config name for collection cugna 
> ... 
> 2016-05-16 14:57:18.829 INFO (qtp5592464-15) [ ] 
> o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections 

> params={action=CLUSTERSTATUS&wt=json} status=500 QTime=9 
> 2016-05-16 14:57:18.830 ERROR (qtp5592464-15) [ ] 
> o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error 
> loading config name for collection cugna 
> ... 
> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for /collections/cugna 
> 
> Does this mean Solrcloud can not migrate by copying whole Solr package? 
> How can I do it? 
> 
> Note: I want to do this because if I add all docs in lab and later when go 
> live, I wish just copy whole folder to production environment without 
> adding all docs again? 
> 
-- 
Regards, 
Binoy Dalal 



- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4568/12243 - 發佈日期: 05/16/16


Re: [scottchu] Cab I migrate solrcloud by just copying whole package folder?

2016-05-16 Thread Scott Chu

Thanks to Binoy and Erick. I'll go use external zk tomorrow and do what you 
suggest.

Scott Chu,scott@udngroup.com
2016/5/16 (週一)
- Original Message - 
From: Erick Erickson 
To: solr-user ; scott.chu 
CC: 
Date: 2016/5/16 (週一) 23:41
Subject: Re: [scottchu] Cab I migrate solrcloud by just copying whole package 
folder?


bq: How can I run that embedded zk node independently? 

You don't. You run ZK independently. It _seems_ like 
you should be able to copy the zk_data directory over "to 
the right place" and have it found, I suspect somehow 
you're not. Take a look at where it is on your source 
machine and see that it's there on the destination machine 

Embedded ZK was never intended as anything other 
than a way to get started, pretty soon you'll need to get 
familiar with running ZK externally anyway so... 

Download Zookeeper. I'm on a Mac so the commands 
are slightly different, but it's just going to the Zk install 
directory and './zkServer.sh start'. 

Best, 
Erick 

On Mon, May 16, 2016 at 8:32 AM, Scott Chu  wrote: 

> 
> But I'm using embeded zk nodes provided by solr start command. I thought they 
> are all under d:\solr5. How can I run that embedded zk node independently? 
> 
> Scott Chu,scott@udngroup.com 
> 2016/5/16 (週一) 
> - Original Message - 
> From: Binoy Dalal 
> To: scott.chu ; solr-user 
> CC: 
> Date: 2016/5/16 (週一) 23:29 
> Subject: Re: [scottchu] Cab I migrate solrcloud by just copying whole package 
> folder? 
> 
> 
> What you copied is just the index. 
> Your configurations are stored on zookeeper. 
> You need to upload these to your zookeeper on your other machine and link 
> it to your collection. 
> Then it'll work. 
> 
> On Mon, 16 May 2016, 20:55 Scott Chu,  wrote: 
> 
>> On my office pc, I install Solr 5 on d:\solr5 and create myconfigsets and 
>> mynodes under it. Then run a solrcloud with 2 nodes and embedded zk nodes 
>> by executing these commands: 
>> 
>> cd /d d:\solr5 
>> bin\solr start -c -s mynode\node1 
>> bin\solr start -c -s mynode\node2 -p 7973 -z localhost:9983 
>> bin\solr create_collection -c cugna -d myconfigsets\cugna -shards 1 

>> -replicationFactor 2 
>> 
>> They run well and then I insert 50k no. of docs to cugna. They are all 
>> under d:\solr5 folder where I unzip solr 5 package. 
>> 
>> ~~~ 
>> Later I copy d:\solr5 to usb and bring it home. I copy it to d:\solr5 at 
>> home pc(i.e. with same path). When I try to start nodes, the server log 

>> shows: 
>> 
>> 2016-05-16 14:57:18.830 ERROR (qtp5592464-15) [ ] 
>> o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error 
>> loading config name for collection cugna 
>> ... 
>> 2016-05-16 14:57:18.829 INFO (qtp5592464-15) [ ] 
>> o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections 
> 
>> params={action=CLUSTERSTATUS&wt=json} status=500 QTime=9 
>> 2016-05-16 14:57:18.830 ERROR (qtp5592464-15) [ ] 
>> o.a.s.s.HttpSolrCall null:org.apache.solr.common.SolrException: Error 
>> loading config name for collection cugna 
>> ... 
>> Caused by: org.apache.zookeeper.KeeperException$NoNodeException: 
>> KeeperErrorCode = NoNode for /collections/cugna 
>> 
>> Does this mean Solrcloud can not migrate by copying whole Solr package? 

>> How can I do it? 
>> 
>> Note: I want to do this because if I add all docs in lab and later when go 
>> live, I wish just copy whole folder to production environment without 
>> adding all docs again? 
>> 
> -- 
> Regards, 
> Binoy Dalal 
> 
> 
> 
> - 
> 未在此訊息中找到病毒。 
> 已透過 AVG 檢查 - www.avg.com 
> 版本: 2015.0.6201 / 病毒庫: 4568/12243 - 發佈日期: 05/16/16 


- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4568/12243 - 發佈日期: 05/16/16 


Switching zk node cause load conf error

2016-05-20 Thread Scott Chu

I intially start up 3 zk nodes and upload config 'cugna'. Then I start 2 
Solrcloud nodes, create collection with 2 shards, add a lot of docs ok.
Later I find 3 zk nodes occupy too many CPU and memory, so I start a standalone 
zk node and use Solr's zkcli to upload same config 'cugna'.
This time I start same 2 Solrcloud nodes but point to this standalone zk node. 
They seem to start ok.
However, when I open localhost:8983, it shows can't load config for shard1 and 
shard2.
Does this mean I can't switch other zk nodes once after I create index? 
The same config on co-op and standalone seem to have difference?

Scott Chu,scott@udngroup.com
2016/5/21 (週六)


Re: Switching zk node cause load conf error

2016-05-20 Thread Scott Chu

Even worse when I startup Solrclouds nodes and point back the original co-op zk 
nodes. It shows can't load conf for shard 2. It seems pointing to standalone zk 
node previously already hurt some config data so that the Solrcloud nodes can 
no long talk to original co-op zk nodes perfectly. Is it possible to repair the 
config data?

Scott Chu,scott@udngroup.com
2016/5/21 (週六)
- Original Message - 
From: scott.chu 
To: solr-user 
CC: 
Date: 2016/5/21 (週六) 00:44
Subject: Switching zk node cause load conf error



I intially start up 3 zk nodes and upload config 'cugna'. Then I start 2 
Solrcloud nodes, create collection with 2 shards, add a lot of docs ok. 
Later I find 3 zk nodes occupy too many CPU and memory, so I start a standalone 
zk node and use Solr's zkcli to upload same config 'cugna'. 
This time I start same 2 Solrcloud nodes but point to this standalone zk node. 
They seem to start ok. 
However, when I open localhost:8983, it shows can't load config for shard1 and 
shard2. 
Does this mean I can't switch other zk nodes once after I create index? 
The same config on co-op and standalone seem to have difference? 

Scott Chu,scott@udngroup.com 
2016/5/21 (週六) 


- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4568/12265 - 發佈日期: 05/20/16


Re: Import html data in mysql and map schemas using only SolrCELL+TIKA+DIH [scottchu]

2016-05-20 Thread Scott Chu

For this project, I intend to use Solr 5.5 or Solr 6. I know how to modify 
config to go back to use ClassicIndex, ie. manual schema.xml.

Scott Chu,scott@udngroup.com
2016/5/21 (週六)
- Original Message - 
From: Siddhartha Singh Sandhu 
To: solr-user ; scott.chu 
CC: 
Date: 2016/5/21 (週六) 03:33
Subject: Re: Import html data in mysql and map schemas using only 
SolrCELL+TIKA+DIH [scottchu]


You will have to configure your schema.xml in Solr. 

What version are you using? 

On Fri, May 20, 2016 at 2:17 AM, scott.chu  wrote: 


> 
> I have a mysql table with over 300M blog articles. The records are in html 
> format. Is it possible to import these records using only Solr 
> CELL+TIKA+DIH to some Solr with schema? I mean when importing, I can map 

> schema on mysql to schema in Solr? 
> 
> scott.chu,scott@udngroup.com 
> 2016/5/20 (週五) 
> 



- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4568/12265 - 發佈日期: 05/20/16


Re: How to use "fq"

2016-05-23 Thread Scott Chu

Yonik has a very well article about term qp:

Solr Terms Query for matching many terms - Solr 'n Stuff
http://yonik.com/solr-terms-query/


Scott Chu,scott@udngroup.com
2016/5/23 (週一)
- Original Message - 
From: Erik Hatcher 
To: solr-user 
CC: 
Date: 2016/5/23 (週一) 21:14
Subject: Re: How to use "fq"


Try the {!terms} query parser. That should make it work well for you. Let us 
know how it does. 

   Erik 

> On May 23, 2016, at 08:52, Steven White  wrote: 
> 
> Hi everyone, 
> 
> I'm trying to figure out what's the best way for me to use "fq" when the 

> list of items is large (up to 200, but I have few cases with up to 1000). 
> 
> My current usage is like so: &fq=category:(1 OR 2 OR 3 OR 4 ... 200) 
> 
> When I tested with up to 1000, I hit the "too many boolean clauses", so my 
> fix was to increase the value of maxBooleanClauses. However, reading [1] 
> warns that increasing the value of maxBooleanClauses has negative impact. 
> The link offers an alternative usage like so: 
> fq=category:1&fq=category:2... But I cannot use it because I need my "fq" 
> to be treated as OR (my default is set to AND). 
> 
> I'm trying to understand what's the best way for me to coded this so I 
> don't get a performance or memory hit. 
> 
> Thanks 
> 
> Steve 
> 
> [1] 
> http://solr.pl/en/2011/12/19/do-i-have-to-look-for-maxbooleanclauses-when-using-filters/
>  


- 
??? 
??? AVG ?? - www.avg.com 
??: 2015.0.6201 / ???: 4568/12281 - : 05/23/16 


What to do best when expaning from 2 nodes to 4 nodes? [scottchu]

2016-05-23 Thread Scott Chu
I just created a 90gb index collection with 1 shard and 2 replicas on 2 nodes. 
I am to migrate from 2 nodes to 4 node. I am wondering what's the best stragedy 
to split this single shard? Furthermore, If I am ok to reindex, what's the best 
adequate experienced value of numShards and replicationFactor? Lastly, I think 
there's no other way but reindex if I want my data to be evenly distributed 
into every shard I create, right?

Scott Chu,scott@udngroup.com
2016/5/23 (週一)

P.S. For those who are curious of why I add [scottchu] in subject, the reason 
is that I want my email filter to route those emails that answer to my question 
to specific folder.


What to do best when expaning from 2 nodes to 4 nodes? (fix typo) [scottchu]

2016-05-23 Thread Scott Chu

Sorry for the typo. I rewrite my question again:

I just created a 90gb index collection with 1 shard and 2 replicas on 2 nodes. 
I am to migrate from 2 nodes to 4 nodes. I am wondering what's the best 
strategy to split this single shard? Furthermore, if I am ok to reindex, what's 
the best adequate experienced values of numShards and replicationFactor? 
Lastly, If I add new shard(s), I think there's no other way but reindex if I 
want my data to be evenly distributed into every shard, right? 


Scott Chu,scott@udngroup.com
2016/5/23 (週一)


Re: Import html data in mysql and map schemas using onlySolrCELL+TIKA+DIH [scottchu]

2016-05-24 Thread Scott Chu

Justa let everybody know. I use DIH+template (without TIKA and Solr Cell, I 
really don't understand that part in reference guide) to achieve what I want. 
But still need to test more various form of HTML source.

Scott Chu,scott@udngroup.com
2016/5/24 (週二)

p.s. There're really many many extensive, worthy stuffs in Solr. If the project 
team can provide some "dictionary" of them, It would be a "Santa Claus" for we 
solr users. Ha! Just a X'mas wish! Sigh! I know it's quite not possbile. I 
really like to study them one after another, to learn about all of them. 
However, Internet IT goes too fast to have time to congest all of the great 
stuffs in Solr.
- Original Message - 
From: scott.chu 
To: solr-user 
CC: 
Date: 2016/5/21 (週六) 03:39
Subject: Re: Import html data in mysql and map schemas using 
onlySolrCELL+TIKA+DIH [scottchu]



For this project, I intend to use Solr 5.5 or Solr 6. I know how to modify 
config to go back to use ClassicIndex, ie. manual schema.xml. 

Scott Chu,scott@udngroup.com 
2016/5/21 (週六) 
- Original Message - 
From: Siddhartha Singh Sandhu 
To: solr-user ; scott.chu 
CC: 
Date: 2016/5/21 (週六) 03:33 
Subject: Re: Import html data in mysql and map schemas using only 
SolrCELL+TIKA+DIH [scottchu] 


You will have to configure your schema.xml in Solr. 

What version are you using? 

On Fri, May 20, 2016 at 2:17 AM, scott.chu  wrote: 


> 
> I have a mysql table with over 300M blog articles. The records are in html 
> format. Is it possible to import these records using only Solr 
> CELL+TIKA+DIH to some Solr with schema? I mean when importing, I can map 

> schema on mysql to schema in Solr? 
> 
> scott.chu,scott@udngroup.com 
> 2016/5/20 (週五) 
> 



- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4568/12265 - 發佈日期: 05/20/16 


- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4568/12265 - 發佈日期: 05/20/16


What if adding 3rd node exceeds replication Factor? [scottchu]

2016-05-25 Thread Scott Chu
I start 2 nodes and a zk ensemble to manager these nodes. Then I create a 
collection with numShards=1 and replicationFactor=2 on 1st node. It spread onto 
2 nodes (meaning 1 leader, 1 replica). Now I want to add 3rd node but don't do 
splitsharding. Before I try I want to ask some questions: Do I just start node3 
and join in same zk ensemble, Solrcloud will automatically create replica on 
3rd node? Or do I have to manually call some API to add replica to 3rd node? 
Either way, doesn't this exceed the replicationFactor?

Scott Chu,scott@udngroup.com
2016/5/25 (週三)


Re: Recommended api/lib to search Solr using PHP

2016-06-02 Thread Scott Chu
27; only, i.e. without any 
argument, to create a 'query' object.
Say the query object's variable name is $query.
4. Use $query->setXXX methods (please see the official document to see what XXX 
can be) to set necessary values.
5. Issue '$result = $client->select($query)' statement to execute query.
6. You can check if there're any document(s) returned by checking value of 
$result->getNumFound();.
7. If there're any document(s) returned, $result will essentially be an array 
of document object. Use 'foreach' to iterate it.
8. document object is essentially array of field. Say its variable name is 
$doc. You can access field by 3 ways:
a> Use 'foreach' to iterate it.
b> Use $doc['fieldname'] to get specific field, e.g. $doc['ID'].
c> Use $doc->fieldname to get specfic field, e.g. $doc->ID.

How to paginate in PHP
===
1. Basically, just use $qurey->setStart(...)->setRows(...) in loop by setting 
appropriate start row offset and pagination row size for your pagination.
2. More advanced, you can try write reuse object codes as shown in official 
document and do pagination to save memory and prevent more GC in php engine.

That's all for now. Rest will be your own homework.

Scott Chu,scott@udngroup.com
2016/6/2 (週四)
- Original Message - 
From: Shawn Heisey 
To: solr-user 
CC: 
Date: 2016/5/31 (週二) 02:57
Subject: Re: Recommended api/lib to search Solr using PHP


On 5/30/2016 12:32 PM, GW wrote: 
> I would say look at the urls for searches you build in the query tool 
> 
> In my case 
> 
> http://172.16.0.1:8983/solr/#/products/query 
> 
> When you build queries with the Query tool, for example an edismax query, 
> the URL is there for you to copy. 
> Use the url structure with curl in your programming/scripting. The results 
> come back as REST data. 
> 
> This is what I do with PHP and it's pretty tight. 

Be careful with URLs in the admin UI. 

URLs with "#" in them will *only* work in a browser. They are not the 
REST endpoints. 

When you run a query in the admin UI, it will give you a URL to make the 
same query, but it will NOT be the URL in the address bar of the 
browser. There is a link right above the query results. 

Thanks, 
Shawn 



- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4591/12328 - 發佈日期: 05/30/16


Re(2): Are there issues with the use of SolrCloud / embedded Zookeeperin non-HA deployments?

2016-07-28 Thread Scott Chu

Indeed, embedded Zookeeper is not stable even when I use it in my development 
servers. In production environment, always use different servers for Zookeeper 
and SolrCloud. Also, avoid using Windows system as best as you can. One reason 
is that zkServer.cmd tells the process that run Zookeeper by judging the DOS 
window title. However, according to what verison of Windows you use and how you 
start DOS window, it could be wrong. I have to modify the script to make it run 
ok under Windows (Note: I've posted some detail information/steps on this mail 
list before).

I've done same thing as Markus did. No matter single or multiple Solr 
instances, I always uses SolrCloud 5 although we have some old Solr 3.5 using 
HA proxy and master/slave configuration. But they all run under Linux.


Scott Chu,scott@udngroup.com
2016/7/29 (週五)
- Original Message - 
From: Markus Jelsma 
To: solr-user 
CC: 
Date: 2016/7/28 (週四) 23:44
Subject: RE: Are there issues with the use of SolrCloud / embedded Zookeeperin 
non-HA deployments?


Hello - all our production environments as deployed as a cloud, even when just 
a single Solr instance is used. We did this for the purpose having a single 
method of deployment / provisioning and just because we have the option to add 
replica's with ease if we need to. 

We never use embedded Zookeeper. 

Markus 


-Original message- 
> From:Andy C  
> Sent: Thursday 28th July 2016 17:38 
> To: solr-user@lucene.apache.org 
> Subject: Are there issues with the use of SolrCloud / embedded Zookeeper in 
> non-HA deployments? 
> 
> We have integrated Solr 5.3.1 into our product. During installation 
> customers have the option of setting up a single Solr instance, or for high 
> availability deployments, multiple Solr instances in a master/slave 
> configuration. 
> 
> We are looking at migrating to SolrCloud for HA deployments, but are 
> wondering if it makes sense to also use SolrCloud in non-HA deployments? 

> 
> Our thought is that this would simplify things. We could use the same 
> approach for deploying our schema.xml and other configuration files on all 
> systems, we could always use the SolrJ CloudSolrClient class to communicate 
> with Solr, etc. 
> 
> Would it make sense to use the embedded Zookeeper instance in this 
> situation? I have seen warning that the embedded Zookeeper should not be 

> used in production deployments, but the reason generally given is that if 
> Solr goes down Zookeeper will also go down, which doesn't seem relevant 
> here. Are there other reasons not to use the embedded Zookeeper? 
> 
> More generally, are there downsides to using SolrCloud with a single 
> Zookeeper node and single Solr node? 
> 
> Would appreciate any feedback. 
> 
> Thanks, 
> Andy 
> 


- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4627/12698 - 發佈日期: 07/28/16 


Re: Question about Simple Post tool

2016-08-01 Thread Scott Chu

I don't think it's possible purely using the out-of-box post.jar. But why not 
disassemble post.jar (or get the source from internet) and modify it yourself. 
It seems not that hard.

Scott Chu,scott@udngroup.com
2016/8/1 (週一)
- Original Message - 
From: Jamal, Sarfaraz 
To: solr-user 
CC: 
Date: 2016/8/1 (週一) 22:05
Subject: Question about Simple Post tool


Hi Guys, 

I have a quick question. 

I read the appropriate documentation and it seems that it is possible, but I 
might be getting the syntax wrong. 

I wish to use the simple Post Tool to pass in a URL that brings back a word 
document, and I Want to index the return of that url using TIka - 

Is that possible? Or do I have to get the file onto my file system first? 

Thanks, 

Sas 


- 
未在此訊息中找到病毒。 
已透過 AVG 檢查 - www.avg.com 
版本: 2015.0.6201 / 病毒庫: 4627/12724 - 發佈日期: 08/01/16


Re: Doing Shingle but also keep special single word

2010-08-20 Thread scott chu

Hi, Brendan,

   Thanks for reply. The real case is that I can't predict when there's a 
new important special word that users are interesting cause I am building a 
daily news article data. Therefore, I don't know when & what single words 
should include into that new field.  I've ever thought about manually 
maintaining a special word dictionary but it costs too much effort, so I 
gave up that idea.


However, you suggestion still sound a good trade-off to me, I'll take into 
account seriously.


Scott

- Original Message - 
From: "Brendan Grainger" 

To: 
Sent: Friday, August 20, 2010 10:06 PM
Subject: Re: Doing Shingle but also keep special single word


Hi Scott,

Is there a reason why you wouldn't just index these special words into 
another field and then search over both fields? That would also have the 
nice property of being able to boost on the special word field if you 
wanted.


HTH
Brendan

On Aug 20, 2010, at 6:19 AM, scott chu (朱炎詹) wrote:

I am building index with Shingle filter. We know it's minimum 2-gram but I 
also want keep some special single word, e.g. IBM, Microsoft, etc. i.e. I 
want to do a minimum 2-gram but also want to have these single word in my 
index, Is it possible?


Scott





Re: Doing Shingle but also keep special single word

2010-08-22 Thread scott chu
Isn't set outputUnigrams="true" will make index size about twice than when 
it's set to false?


Scott

- Original Message - 
From: "Ahmet Arslan" 

To: 
Sent: Saturday, August 21, 2010 1:15 AM
Subject: Re: Doing Shingle but also keep special single word



I am building index with Shingle
filter. We know it's minimum 2-gram but I also want keep
some special single word, e.g. IBM, Microsoft, etc. i.e. I
want to do a minimum 2-gram but also want to have these
single word in my index, Is it possible?


outputUnigrams="true" parameter does not work for you?

After that you can cast words="keepwords.txt" ignoreCase="true"/> with keepwords.txt=IBM, 
Microsoft.









Re: Hardware Specs Question

2010-09-03 Thread scott chu

well balanced system
=
Agree. Here we'll start a performance & load test this month. I've defined a 
test criteria of 'qps', 'RTpQ' & worse case according to our use case & past 
experience. Our goal is pursuing this criteria & adjust hardware & system 
configuration to find a well balanced scalable Solr aritecture.


However, the past discussion of this thread has several good suggestion for 
our test. Thanks to all who provides their experience & suggestion.


Scott

- Original Message - 
From: "Toke Eskildsen" 

To: 
Sent: Friday, September 03, 2010 6:43 PM
Subject: Re: Hardware Specs Question



On Fri, 2010-09-03 at 11:07 +0200, Dennis Gearon wrote:

If you really want to see performance, try external DRAM disks.
Whew! 800X faster than a disk.


As sexy as they are, the DRAM drives does not buy much more extra
performance. At least not at the search stage. For searching, SSDs are
not that far from holding the index fully in RAM (about 3/4 the speed in
our tests but YMMV). The CPU is the bottleneck.

That was with Lucene 2.4 so the relative numbers might have changed, but
the old lesson still stands: A well balanced system is key.






Does anyone notice this site?

2010-10-25 Thread scott chu

I happen to bump into this site: http://www.solr.biz/

They said they are also developing a search engine? Is this any connection 
to open source "Solr"? 



How can I do this in Solr?

2010-03-25 Thread scott chu
I have a input xml data file & it has a  'Reporters' tag looks like this:



AAA
manager


BBB
coordinator



You see name & title are paired. As I know, Solr only support a field with 
mutliple value of primitive type, e.g. string. But in my case, it's a field 
with mutiple value of another paired name-title values. How can I configure 
Solr to deal with this case?

Best Regards,

Scott Chu