Solr.cmd cannot create collection in Solr 5.2.1

2015-10-26 Thread Adrian Liew
Hi all,

I have setup a 3 server Zookeeper cluster by following the instructions 
provided from Zookeeper site:

I am having experiences trying to zkCli.bat into the Zookeeper services on 3 
EC2 instances once after I have started the ZK services on all 3 servers.

For example, I have setup my three servers to have the IPs:
Server1 - 172.18.111.111:2181
Server2 - 172.18.111.112:2182
Server3 - 172.18.112.112:2183

I am using Solr v.5.2.1.

When I have started Solr to run with the Zookeeper services and attempting to 
upload the configuration to Zookeeper, I get the following failure message 
reported by solr.cmd when attempting to create a collection using solr.cmd.

D:\Solr-5.2.1-Instance\bin>solr.cmd create_collection -c sitecore_core_index -n
sitecore_common_config -shards 1 -replicationFactor 3
Connecting to ZooKeeper at 172.18.111.111:2181,172.18.111.112:2182,172.18.112.11
2:2183
Re-using existing configuration directory sitecore_common_config

Creating new collection 'sitecore_core_index' using command:
http://172.18.112.112:8983/solr/admin/collections?action=CREATE&name=sitecore_co
re_index&numShards=1&replicationFactor=3&maxShardsPerNode=1&collection.configNam
e=sitecore_common_config

{
  "responseHeader":{
"status":0,
"QTime":1735},
  "failure":{"":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrExce
ption:Error from server at http://172.18.111.112:8983/solr: Error CREATEing Solr
Core 'sitecore_core_index_shard1_replica2': Unable to create core [sitecore_core
_index_shard1_replica2] Caused by: Can't find resource 'solrconfig.xml' in class
path or '/configs/sitecore_common_config', cwd=D:\\Solr-5.2.1-Instance\\server"}
}

I do a  check to see if solrconfig.xml is present in the Zookeeper, if I run 
zkCli.bat -cmd list on the each of the server, I can see that solrconfig.xml is 
listed:

DATA:

/configs (1)
  /configs/sitecore_common_config (1)
   /configs/sitecore_common_config/conf (8)
/configs/sitecore_common_config/conf/currency.xml (0)
DATA: ...supressed...
/configs/sitecore_common_config/conf/protwords.txt (0)
DATA: ...supressed...
/configs/sitecore_common_config/conf/solrconfig.xml (0)
DATA: ...supressed...
/configs/sitecore_common_config/conf/synonyms.txt (0)
DATA: ...supressed...
/configs/sitecore_common_config/conf/stopwords.txt (0)
DATA: ...supressed...
/configs/sitecore_common_config/conf/schema.xml (0)
DATA: ...supressed...
/configs/sitecore_common_config/conf/_rest_managed.json (0)
DATA:
{"initArgs":{},"managedList":[]}

/configs/sitecore_common_config/conf/lang (1)
 /configs/sitecore_common_config/conf/lang/stopwords_en.txt (0)
 DATA: ...supressed...
/zookeeper (1)
DATA:

/overseer (6)
DATA:

Has anyone come across this issue before for Solr 5.2.1?

Regards,
Adrian


Re: getting cached terms inside UpdateRequestProcessor...

2015-10-26 Thread Roxana Danger
Sorry the delay on my reply. I have tried to use the Documents screen for
executing my update request processor (with JSON), but it needs the
document Id to be specified. So, this is not a good solution as I need to
update all indexed documents... I will try on importEnd event. Any other
idea?


On 23 October 2015 at 07:49, Roxana Danger 
wrote:

> Hi Alexandre,
>
> No, this is what is missing... I assume I can select the docs with a query
> (e.g. *:*, in the content.stream?), but I haven't found which parameter is
> the right one to use...
>
> Thanks,
> Roxana
>
>
> On 22 October 2015 at 18:27, Alexandre Rafalovitch 
> wrote:
>
>> You need to tell the second call which documents to update. Are you doing
>> that?
>>
>> There may also be a wrinkle in the URP order, but let's get the first step
>> working first.
>> On 22 Oct 2015 12:59 pm, "Roxana Danger" 
>> wrote:
>>
>> > yes, it's working now... but I can not use the updateprocessor chain. I
>> > need to use first the DIH and then use UPR, but I am not having luck in
>> > updating my docs with the URL:
>> > http://localhost:8983/solr/reed_jobs/update/jtdetails?commit=true
>> >
>> > Do you manage to use an updateProcessor chain after use the DIH without
>> > using the update.chain parameter?
>> >
>> > Cheers,
>> > Roxana
>> >
>> >
>> > On 22 October 2015 at 17:42, Shawn Heisey  wrote:
>> >
>> > > On 10/22/2015 10:32 AM, Erik Hatcher wrote:
>> > > > Setting “update.chain” in the DataImportHandler handler defined in
>> > > solrconfig.xml should allow you to specify which update chain is used.
>> > Can
>> > > you confirm that works, Shawn?
>> > >
>> > > I tried this a couple of years ago without luck.  Does it work now?
>> > >
>> > >
>> > >
>> >
>> http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201308.mbox/%3c6c93c1a4-63ac-4cad-9f5b-c74f497c6...@gmail.com%3E
>> > >
>> > > In the first email of the thread, I indicated I had tried 4.4 and
>> > > 4.5-SNAPSHOT.
>> > >
>> > > Thanks,
>> > > Shawn
>> > >
>> > >
>> >
>> >
>> > --
>> > Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street,
>> London,
>> > WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk] > >
>> > The
>> > UK's #1 job site.  [image: Follow us on
>> Twitter]
>> > 
>> >  [image:
>> > Like us on Facebook] 
>> >  It's time to Love
>> Mondays »
>> > 
>> >
>>
>
>
>
> --
> Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street,
> London, WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]
>  The UK's #1 job site.  
> [image:
> Follow us on Twitter] 
>  [image: Like us on Facebook]
> 
>  It's time to Love Mondays »
> 
>



-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »



Re: EdgeNGramFilterFactory for Chinese characters

2015-10-26 Thread Zheng Lin Edwin Yeo
Hi Tomoko,

Thank you for your advice. Will look into the java source code of the Token
Filters.

Regards,
Edwin


On 26 October 2015 at 13:16, Tomoko Uchida 
wrote:

> > Will try to see if there is anyway to managed it by only a single field?
>
> Of course you can try to create custom Tokenizer or TokenFilter that
> perfectly meets your needs.
> I would copy the source codes of EdgeNGramTokenFilter and modify
> incrementToken() method. It seems reasonable way for me.
> incrementToken() of EdgeNGramTokenFilter cannot be overrided, it is defined
> as "final" on Solr 5, so subclassing will not work.
> And corresponding custom TokenFilterFactory class is also needed. (See
> EdgeNGramFilterFactory.)
>
> If you are not familiar with both of Java and internal architecture of
> Lucene/Solr,
> custom classes can brought intricate bugs/problems into your system. Be
> sure to keep them under control.
>
> Anyway, checkout and look into java sources of TokenFilters included in
> Solr if you have not yet.
>
> Thanks,
> Tomoko
>
> 2015-10-26 11:19 GMT+09:00 Zheng Lin Edwin Yeo :
>
> > Hi Tomoko,
> >
> > Thank you for your recommendation.
> >
> > I wasn't in favour of using copyField at first to have 2 separate fields
> > for English and Chinese tokens, as it  not only increase the index size,
> > but also slow down the performance for both indexing and querying.
> >
> > Will try to see if there is anyway to managed it by only a single field?
> >
> > Regards.
> > Edwin
> >
> >
> > On 25 October 2015 at 22:59, Tomoko Uchida  >
> > wrote:
> >
> > > Hi, Edwin,
> > >
> > > > This means it is better to have 2 separate fields for English and
> > Chinese
> > > words?
> > >
> > > Yes. I mean,
> > > 1. Define FIELD_1 that use HMMChineseTokenizerFactory to extract
> English
> > > and Chinese tokens.
> > > 2. Define FIELD_2 that use PatternTokenizerFactory to extract English
> > > tokens and EdgeNGramFilter to break up tokens to sub-strings.
> > > There might be some possible tokenizer/filter chains to extract
> > English
> > > tokens, please try and find the best way ;)
> > > 3. Index original text to FIELD_1 to search tokens as they are. (for
> both
> > > of English and Chinese words)
> > > 4. Index original text to FIELD_2 to perform prefix match. (for English
> > > words)
> > > 5. Search FIELD_1 and FIELD_2 by using edismax query parser, etc.
> > >
> > > You can use copyField to index original text data to FIELD_1 and
> FIELD_2.
> > > Downside of this method is that increase index size as you know.
> > >
> > > If you want to manage that *by one field*, I think you can create
> custom
> > > token filter on your own... but it may be slightly advanced.
> > >
> > > Thanks,
> > > Tomoko
> > >
> > > 2015-10-25 22:48 GMT+09:00 Zheng Lin Edwin Yeo :
> > >
> > > > Hi Tomoko,
> > > >
> > > > Thank you for your reply.
> > > >
> > > > > If you need to perform partial (prefix) match for **only English
> > > words**,
> > > > > you can create a separate field that keeps only English words (I've
> > > never
> > > > > tried that, but might be possible by PatternTokenizerFactory or
> other
> > > > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to
> the
> > > > field.
> > > >
> > > > This means it is better to have 2 separate fields for English and
> > Chinese
> > > > words?
> > > > Not quite sure what you mean by that.
> > > >
> > > > Regards,
> > > > Edwin
> > > >
> > > >
> > > >
> > > > On 25 October 2015 at 11:42, Tomoko Uchida <
> > tomoko.uchida.1...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > > I have rich-text documents that are in both English and Chinese,
> > and
> > > > > > currently I have EdgeNGramFilterFactory enabled during indexing,
> > as I
> > > > > need
> > > > > > it for partial matching for English words. But this means it will
> > > also
> > > > > > break up each of the Chinese characters into different tokens.
> > > > >
> > > > > EdgeNGramFilterFactory creates sub-strings (prefixes) from each
> > token.
> > > > Its
> > > > > behavior is independent of language.
> > > > > If you need to perform partial (prefix) match for **only English
> > > words**,
> > > > > you can create a separate field that keeps only English words (I've
> > > never
> > > > > tried that, but might be possible by PatternTokenizerFactory or
> other
> > > > > tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to
> the
> > > > field.
> > > > >
> > > > > Hope it helps,
> > > > > Tomoko
> > > > >
> > > > > 2015-10-23 13:04 GMT+09:00 Zheng Lin Edwin Yeo <
> edwinye...@gmail.com
> > >:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Would like to check, is it good to use EdgeNGramFilterFactory for
> > > > indexes
> > > > > > that contains Chinese characters?
> > > > > > Will it affect the accuracy of the search for Chinese words?
> > > > > >
> > > > > > I have rich-text documents that are in both English and Chinese,
> > and
> > > > > > currently I have EdgeNGramFilterFactory enabled during indexing,
> > as I
> > > > > 

copy data between collection

2015-10-26 Thread Chaushu, Shani
Hi,
Is there an API to copy all the documents from one collection to another 
collection in the same solr server simply?
I'm using solr cloud 4.10
Thanks,
Shani

-
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.


Re: copy data between collection

2015-10-26 Thread Upayavira
Hi Shani,

There isn't a SolrCloud way to do it. A proper 'clone this collection'
feature would be a very useful thing.

However, I have managed to do it, in a way that involves some caveats:
 * you should only do this on a collection that has no replicas. Add
 replicas *after* cloning the index
 * if you must do it on a sharded index, then you will need to do it
 once for each shard. No guarantees though

All SolrCloud nodes are all already enabled as 'replication masters' so
that new replicas can pull a full index from the current leader. We're
gonna use this feature to pull our index (assuming single shard):

http://:8983/solr/_shard1_replica1/replication?command=fetchindex&masterUrl=http://:8983/solr/_shard1_replica1/replication

This basically says to the core behind your new collection: "Go to the
core behind the old collection, and pull its entire index".

This worked for me. I added a replica afterwards, and the index cloned
correctly. However, when I did it against a collection that had a
replica already, the replica *didn't* notice, meaning the leader/replica
were now out of sync, i.e: Really make sure you do this replication
before you add replicas to your new collection.

Hope this helps.

Upayavira

On Mon, Oct 26, 2015, at 11:21 AM, Chaushu, Shani wrote:
> Hi,
> Is there an API to copy all the documents from one collection to another
> collection in the same solr server simply?
> I'm using solr cloud 4.10
> Thanks,
> Shani
> 
> -
> Intel Electronics Ltd.
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.


Re: Does docValues impact termfreq ?

2015-10-26 Thread Emir Arnautovic
If I got it right, you are using term query, use function to get TF as 
score, iterate all documents in results and sum up total number of 
occurrences of specific term in index? Is this only way you use index or 
this is side functionality?


Thanks,
Emir

On 24.10.2015 22:28, Aki Balogh wrote:

Certainly, yes. I'm just doing a word count, ie how often does a specific
term come up in the corpus?
On Oct 24, 2015 4:20 PM, "Upayavira"  wrote:


yes, but what do you want to do with the TF? What problem are you
solving with it? If you are able to share that...

On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:

Yes, sorry, I am not being clear.

We are not even doing scoring, just getting the raw TF values. We're
doing
this in solr because it can scale well.

But with large corpora, retrieving the word counts takes some time, in
part
because solr is splitting up word count by document and generating a
large
request. We then get the request and just sum it all up. I'm wondering if
there's a more direct way.
On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:


Can you explain more what you are using TF for? Because it sounds

rather

like scoring. You could disable field norms and IDF and scoring would

be

mostly TF, no?

Upayavira

On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:

Thanks, let me think about that.

We're using termfreq to get the TF score, but we don't know which

term

we'll need the TF for. So we'd have to do a corpuswide summing of
termfreq
for each potential term across all documents in the corpus. It seems

like

it'd require some development work to compute that, and our code

would be

fragile.

Let me think about that more.

It might make sense to just move to solrcloud, it's the right
architectural
decision anyway.


On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:


If you just want word length, then do work during indexing - index

a

field for the word length. Then, I believe you can do faceting -

e.g.

with the json faceting API I believe you can do a sum()

calculation on

a

field rather than the more traditional count.

Thinking aloud, there might be an easier way - index a field that

is

the

same for all documents, and facet on it. Instead of counting the

number

of documents, calculate the sum() of your word count field.

I *think* that should work.

Upayavira

On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:

Hi Jack,

I'm just using solr to get word count across a large number of

documents.

It's somewhat non-standard, because we're ignoring relevance,

but it

seems
to work well for this use case otherwise.

My understanding then is:
1) since termfreq is pre-processed and fetched, there's no good

way

to

speed it up (except by caching earlier calculations)

2) there's no way to have solr sum up all of the termfreqs

across all

documents in a search and just return one number for total

termfreqs


Are these correct?

Thanks,
Aki


On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky

wrote:


That's what a normal query does - Lucene takes all the terms

used

in

the

query and sums them up for each document in the response,

producing a

single number, the score, for each document. That's the way

Solr is

designed to be used. You still haven't elaborated why you are

trying

to use

Solr in a way other than it was intended.

-- Jack Krupansky

On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <

a...@marketmuse.com>

wrote:

Gotcha - that's disheartening.

One idea: when I run termfreq, I get all of the termfreqs for

each

document

one-by-one.

Is there a way to have solr sum it up before creating the

request,

so I

only receive one number in the response?


On Sat, Oct 24, 2015 at 11:05 AM, Upayavira 

wrote:

If you mean using the term frequency function query, then

I'm

not

sure

there's a huge amount you can do to improve performance.

The term frequency is a number that is used often, so it is

stored

in

the index pre-calculated. Perhaps, if your data is not

changing,

optimising your index would reduce it to one segment, and

thus

might

ever so slightly speed the aggregation of term frequencies,

but I

doubt

it'd make enough difference to make it worth doing.

Upayavira

On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:

Thanks, Jack. I did some more research and found similar

results.

In our application, we are making multiple (think: 50)

concurrent

requests
to calculate term frequency on a set of documents in

"real-time". The

faster that results return, the better.

Most of these requests are unique, so cache only helps

slightly.

This analysis is happening on a single solr instance.

Other than moving to solr cloud and splitting out the

processing

onto

multiple servers, do you have any suggestions for what

might

speed up

termfreq at query time?

Thanks,
Aki


On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky

wrote:


Term frequency applies only to the indexed terms of a

tokenized

field.

DocValues is really just a copy of the o

Re: Does docValues impact termfreq ?

2015-10-26 Thread Aki Balogh
Hi Emir,

This is correct. This is the only way we use the index.

Thanks,
Aki

On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> If I got it right, you are using term query, use function to get TF as
> score, iterate all documents in results and sum up total number of
> occurrences of specific term in index? Is this only way you use index or
> this is side functionality?
>
> Thanks,
> Emir
>
>
> On 24.10.2015 22:28, Aki Balogh wrote:
>
>> Certainly, yes. I'm just doing a word count, ie how often does a specific
>> term come up in the corpus?
>> On Oct 24, 2015 4:20 PM, "Upayavira"  wrote:
>>
>> yes, but what do you want to do with the TF? What problem are you
>>> solving with it? If you are able to share that...
>>>
>>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>>>
 Yes, sorry, I am not being clear.

 We are not even doing scoring, just getting the raw TF values. We're
 doing
 this in solr because it can scale well.

 But with large corpora, retrieving the word counts takes some time, in
 part
 because solr is splitting up word count by document and generating a
 large
 request. We then get the request and just sum it all up. I'm wondering
 if
 there's a more direct way.
 On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:

 Can you explain more what you are using TF for? Because it sounds
>
 rather
>>>
 like scoring. You could disable field norms and IDF and scoring would
>
 be
>>>
 mostly TF, no?
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>
>> Thanks, let me think about that.
>>
>> We're using termfreq to get the TF score, but we don't know which
>>
> term
>>>
 we'll need the TF for. So we'd have to do a corpuswide summing of
>> termfreq
>> for each potential term across all documents in the corpus. It seems
>>
> like
>>>
 it'd require some development work to compute that, and our code
>>
> would be
>>>
 fragile.
>>
>> Let me think about that more.
>>
>> It might make sense to just move to solrcloud, it's the right
>> architectural
>> decision anyway.
>>
>>
>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:
>>
>> If you just want word length, then do work during indexing - index
>>>
>> a
>>>
 field for the word length. Then, I believe you can do faceting -
>>>
>> e.g.
>>>
 with the json faceting API I believe you can do a sum()
>>>
>> calculation on
>>>
 a
>
>> field rather than the more traditional count.
>>>
>>> Thinking aloud, there might be an easier way - index a field that
>>>
>> is
>>>
 the
>
>> same for all documents, and facet on it. Instead of counting the
>>>
>> number
>>>
 of documents, calculate the sum() of your word count field.
>>>
>>> I *think* that should work.
>>>
>>> Upayavira
>>>
>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>>>
 Hi Jack,

 I'm just using solr to get word count across a large number of

>>> documents.
>
>> It's somewhat non-standard, because we're ignoring relevance,

>>> but it
>>>
 seems
 to work well for this use case otherwise.

 My understanding then is:
 1) since termfreq is pre-processed and fetched, there's no good

>>> way
>>>
 to
>
>> speed it up (except by caching earlier calculations)

 2) there's no way to have solr sum up all of the termfreqs

>>> across all
>>>
 documents in a search and just return one number for total

>>> termfreqs
>>>

 Are these correct?

 Thanks,
 Aki


 On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
 
 wrote:

 That's what a normal query does - Lucene takes all the terms
>
 used
>>>
 in
>
>> the
>>>
 query and sums them up for each document in the response,
>
 producing a
>
>> single number, the score, for each document. That's the way
>
 Solr is
>>>
 designed to be used. You still haven't elaborated why you are
>
 trying
>
>> to use
>>>
 Solr in a way other than it was intended.
>
> -- Jack Krupansky
>
> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
>
 a...@marketmuse.com>
>>>
 wrote:
>>>
 Gotcha - that's disheartening.
>>
>> One idea: when I run termfreq, I get all of the termfreqs for
>>
> each
>
>> document
>
>> one-by-one.
>>
>> Is there a way to have solr sum it up before creating the
>>
> request,
>

Re: Does docValues impact termfreq ?

2015-10-26 Thread Scott Stults
Aki, does the sumtotaltermfreq function do what you need?


On Mon, Oct 26, 2015 at 9:43 AM, Aki Balogh  wrote:

> Hi Emir,
>
> This is correct. This is the only way we use the index.
>
> Thanks,
> Aki
>
> On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
> > If I got it right, you are using term query, use function to get TF as
> > score, iterate all documents in results and sum up total number of
> > occurrences of specific term in index? Is this only way you use index or
> > this is side functionality?
> >
> > Thanks,
> > Emir
> >
> >
> > On 24.10.2015 22:28, Aki Balogh wrote:
> >
> >> Certainly, yes. I'm just doing a word count, ie how often does a
> specific
> >> term come up in the corpus?
> >> On Oct 24, 2015 4:20 PM, "Upayavira"  wrote:
> >>
> >> yes, but what do you want to do with the TF? What problem are you
> >>> solving with it? If you are able to share that...
> >>>
> >>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
> >>>
>  Yes, sorry, I am not being clear.
> 
>  We are not even doing scoring, just getting the raw TF values. We're
>  doing
>  this in solr because it can scale well.
> 
>  But with large corpora, retrieving the word counts takes some time, in
>  part
>  because solr is splitting up word count by document and generating a
>  large
>  request. We then get the request and just sum it all up. I'm wondering
>  if
>  there's a more direct way.
>  On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:
> 
>  Can you explain more what you are using TF for? Because it sounds
> >
>  rather
> >>>
>  like scoring. You could disable field norms and IDF and scoring would
> >
>  be
> >>>
>  mostly TF, no?
> >
> > Upayavira
> >
> > On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> >
> >> Thanks, let me think about that.
> >>
> >> We're using termfreq to get the TF score, but we don't know which
> >>
> > term
> >>>
>  we'll need the TF for. So we'd have to do a corpuswide summing of
> >> termfreq
> >> for each potential term across all documents in the corpus. It seems
> >>
> > like
> >>>
>  it'd require some development work to compute that, and our code
> >>
> > would be
> >>>
>  fragile.
> >>
> >> Let me think about that more.
> >>
> >> It might make sense to just move to solrcloud, it's the right
> >> architectural
> >> decision anyway.
> >>
> >>
> >> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:
> >>
> >> If you just want word length, then do work during indexing - index
> >>>
> >> a
> >>>
>  field for the word length. Then, I believe you can do faceting -
> >>>
> >> e.g.
> >>>
>  with the json faceting API I believe you can do a sum()
> >>>
> >> calculation on
> >>>
>  a
> >
> >> field rather than the more traditional count.
> >>>
> >>> Thinking aloud, there might be an easier way - index a field that
> >>>
> >> is
> >>>
>  the
> >
> >> same for all documents, and facet on it. Instead of counting the
> >>>
> >> number
> >>>
>  of documents, calculate the sum() of your word count field.
> >>>
> >>> I *think* that should work.
> >>>
> >>> Upayavira
> >>>
> >>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> >>>
>  Hi Jack,
> 
>  I'm just using solr to get word count across a large number of
> 
> >>> documents.
> >
> >> It's somewhat non-standard, because we're ignoring relevance,
> 
> >>> but it
> >>>
>  seems
>  to work well for this use case otherwise.
> 
>  My understanding then is:
>  1) since termfreq is pre-processed and fetched, there's no good
> 
> >>> way
> >>>
>  to
> >
> >> speed it up (except by caching earlier calculations)
> 
>  2) there's no way to have solr sum up all of the termfreqs
> 
> >>> across all
> >>>
>  documents in a search and just return one number for total
> 
> >>> termfreqs
> >>>
> 
>  Are these correct?
> 
>  Thanks,
>  Aki
> 
> 
>  On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
>  
>  wrote:
> 
>  That's what a normal query does - Lucene takes all the terms
> >
>  used
> >>>
>  in
> >
> >> the
> >>>
>  query and sums them up for each document in the response,
> >
>  producing a
> >
> >> single number, the score, for each document. That's the way
> >
>  Solr is
> >>>
>  designed to be used. You still haven't elaborated why you are
> >
>  trying
> >
> >> to use
> >>>
>  Solr in a way other than it was intended.
> >

solr-user-subscribe

2015-10-26 Thread Margherita Di Leo
-- 
Margherita Di Leo


Re: Does docValues impact termfreq ?

2015-10-26 Thread Emir Arnautovic

Hi Aki,
IMO this is underuse of Solr (not to mention SolrCloud). I would 
recommend doing in memory document parsin (if you need something from 
Lucene/Solr analysis classes, use it) and use some other cache like 
solution to store term/total frequency pairs (you can try Redis).


That way you will have updatable, fast total frequency lookups.

Thanks,
Emir

On 26.10.2015 14:43, Aki Balogh wrote:

Hi Emir,

This is correct. This is the only way we use the index.

Thanks,
Aki

On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:


If I got it right, you are using term query, use function to get TF as
score, iterate all documents in results and sum up total number of
occurrences of specific term in index? Is this only way you use index or
this is side functionality?

Thanks,
Emir


On 24.10.2015 22:28, Aki Balogh wrote:


Certainly, yes. I'm just doing a word count, ie how often does a specific
term come up in the corpus?
On Oct 24, 2015 4:20 PM, "Upayavira"  wrote:

yes, but what do you want to do with the TF? What problem are you

solving with it? If you are able to share that...

On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:


Yes, sorry, I am not being clear.

We are not even doing scoring, just getting the raw TF values. We're
doing
this in solr because it can scale well.

But with large corpora, retrieving the word counts takes some time, in
part
because solr is splitting up word count by document and generating a
large
request. We then get the request and just sum it all up. I'm wondering
if
there's a more direct way.
On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:

Can you explain more what you are using TF for? Because it sounds
rather
like scoring. You could disable field norms and IDF and scoring would
be
mostly TF, no?

Upayavira

On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:


Thanks, let me think about that.

We're using termfreq to get the TF score, but we don't know which


term

we'll need the TF for. So we'd have to do a corpuswide summing of

termfreq
for each potential term across all documents in the corpus. It seems


like

it'd require some development work to compute that, and our code

would be

fragile.

Let me think about that more.

It might make sense to just move to solrcloud, it's the right
architectural
decision anyway.


On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:

If you just want word length, then do work during indexing - index
a

field for the word length. Then, I believe you can do faceting -

e.g.

with the json faceting API I believe you can do a sum()

calculation on

a

field rather than the more traditional count.

Thinking aloud, there might be an easier way - index a field that


is

the

same for all documents, and facet on it. Instead of counting the
number

of documents, calculate the sum() of your word count field.

I *think* that should work.

Upayavira

On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:


Hi Jack,

I'm just using solr to get word count across a large number of


documents.

It's somewhat non-standard, because we're ignoring relevance,

but it

seems

to work well for this use case otherwise.

My understanding then is:
1) since termfreq is pre-processed and fetched, there's no good


way

to

speed it up (except by caching earlier calculations)

2) there's no way to have solr sum up all of the termfreqs


across all

documents in a search and just return one number for total

termfreqs

Are these correct?

Thanks,
Aki


On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky

wrote:

That's what a normal query does - Lucene takes all the terms
used

in

the

query and sums them up for each document in the response,
producing a

single number, the score, for each document. That's the way

Solr is

designed to be used. You still haven't elaborated why you are

trying

to use

Solr in a way other than it was intended.

-- Jack Krupansky

On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <


a...@marketmuse.com>

wrote:

Gotcha - that's disheartening.

One idea: when I run termfreq, I get all of the termfreqs for


each

document

one-by-one.

Is there a way to have solr sum it up before creating the


request,

so I

only receive one number in the response?


On Sat, Oct 24, 2015 at 11:05 AM, Upayavira 


wrote:

If you mean using the term frequency function query, then

I'm

not

sure

there's a huge amount you can do to improve performance.

The term frequency is a number that is used often, so it is


stored

in

the index pre-calculated. Perhaps, if your data is not

changing,

optimising your index would reduce it to one segment, and

thus

might

ever so slightly speed the aggregation of term frequencies,

but I

doubt

it'd make enough difference to make it worth doing.

Upayavira

On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:


Thanks, Jack. I did some more research and found similar


results.

In our application, we are making multiple (think: 50)

concurrent

requests

to calculate te

Re: Solr Pagination

2015-10-26 Thread Upayavira


On Sun, Oct 25, 2015, at 05:43 PM, Salman Ansari wrote:
> Thanks guys for your responses.
> 
> That's a very very large cache size.  It is likely to use a VERY large
> amount of heap, and autowarming up to 4096 entries at commit time might
> take many *minutes*.  Each filterCache entry is maxDoc/8 bytes.  On an
> index core with 70 million documents, each filterCache entry is at least
> 8.75 million bytes.  Multiply that by 16384, and a completely full cache
> would need about 140GB of heap memory.  4096 entries will require 35GB.
>  I don't think this cache is actually storing that many entries, or you
> would most certainly be running into OutOfMemoryError exceptions.
> 
> True, however, I have tried with the default filtercache at the beginning
> but the problem was still there. So, I don't think that is how I should
> increase the performance of my Solr. Moreover, as you mentioned, when I
> change the configuration, I should be running out of memory but that did
> not happen. Do you think my Solr has not taken the latest configs? I have
> restarted the Solr btw.
> 
> Lately I have been trying different ways to improve this and I have
> created
> a brand new index on the same machine using 2 shards and it had few
> entries
> (about 5) and the performance was booming, I got the results back in 42
> ms
> sometimes. What concerns me is that may be I am loading too much into one
> index so that is why this is killing the performance. Is there a
> recommended index size/document number and size that I should be looking
> at
> to tune this? Any other ideas other than increasing the memory size as I
> have already tried this?

The optimal index size is down to the size of segments on disk. New
segments are created when hard commits occur, and existing on-disk
segments may get merged in the background when the segment count gets
too high. Now, if those on-disk segments get too large, copying them
around at merge time can get prohibitive, especially if your index is
changing frequently.

Splitting such an index into shards is one approach to dealing with this
issue.

Upayavira


using a custom update for all documents

2015-10-26 Thread Roxana Danger
Hello everyone,
Is there a way to update all the documents in the solr index using a custom
update processor?
Thank you,
Roxana


Re: using a custom update for all documents

2015-10-26 Thread Upayavira


On Mon, Oct 26, 2015, at 02:58 PM, Roxana Danger wrote:
> Hello everyone,
> Is there a way to update all the documents in the solr index using a
> custom
> update processor?


You want to re-index all documents?

If so, that's not really how update processors work. They trigger when a
new document is posted. You would need, somehow, to post every document
again in order to trigger the update processor's ability to do its work.

Upayavira


Re: using a custom update for all documents

2015-10-26 Thread Roxana Danger
Thank you very much, Upayavira.

I am indexing my documents with a DIH, but I need to use my custom update
processor after the commit, not in the update.chain.
Do update processors only work while adding new documents?

In this case, I have tried two alternatives:
1) using the onImportEnd event. However I need to restart solr to see my
current updates. Is there any way to inform solr that it needs to reload
the searcher?
2) re-importing the indexed documents using SolrEntityProcessor and adding
my custom chain processor. Which is the best approach?

Best,
Roxana


On 26 October 2015 at 15:06, Upayavira  wrote:

>
>
> On Mon, Oct 26, 2015, at 02:58 PM, Roxana Danger wrote:
> > Hello everyone,
> > Is there a way to update all the documents in the solr index using a
> > custom
> > update processor?
>
>
> You want to re-index all documents?
>
> If so, that's not really how update processors work. They trigger when a
> new document is posted. You would need, somehow, to post every document
> again in order to trigger the update processor's ability to do its work.
>
> Upayavira
>



-- 
Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
UK's #1 job site.  [image: Follow us on Twitter]

 [image:
Like us on Facebook] 
 It's time to Love Mondays »



Re: using a custom update for all documents

2015-10-26 Thread Alexandre Rafalovitch
Roxana,

You've been asked a couple of times by several people to explain your
business needs (level higher than Solr itself). As it is, you are
slowly getting deeper and deeper into Solr's internals, where there
might be an easier question if we know what you are trying to achieve.

It is your choice of course, but it might be easier to step back to
reconfirm what you are trying to actually achieve beyond the Solr
technicalities.

Regards,
   Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 26 October 2015 at 11:29, Roxana Danger
 wrote:
> Thank you very much, Upayavira.
>
> I am indexing my documents with a DIH, but I need to use my custom update
> processor after the commit, not in the update.chain.
> Do update processors only work while adding new documents?
>
> In this case, I have tried two alternatives:
> 1) using the onImportEnd event. However I need to restart solr to see my
> current updates. Is there any way to inform solr that it needs to reload
> the searcher?
> 2) re-importing the indexed documents using SolrEntityProcessor and adding
> my custom chain processor. Which is the best approach?
>
> Best,
> Roxana
>
>
> On 26 October 2015 at 15:06, Upayavira  wrote:
>
>>
>>
>> On Mon, Oct 26, 2015, at 02:58 PM, Roxana Danger wrote:
>> > Hello everyone,
>> > Is there a way to update all the documents in the solr index using a
>> > custom
>> > update processor?
>>
>>
>> You want to re-index all documents?
>>
>> If so, that's not really how update processors work. They trigger when a
>> new document is posted. You would need, somehow, to post every document
>> again in order to trigger the update processor's ability to do its work.
>>
>> Upayavira
>>
>
>
>
> --
> Roxana Danger | Data Scientist Dragon Court, 27-29 Macklin Street, London,
> WC2B 5LX Tel: 020 7067 4568 [image: reed.co.uk]  The
> UK's #1 job site.  [image: Follow us on Twitter]
> 
>  [image:
> Like us on Facebook] 
>  It's time to Love Mondays »
> 


Re: Solr.cmd cannot create collection in Solr 5.2.1

2015-10-26 Thread Shawn Heisey
On 10/26/2015 2:23 AM, Adrian Liew wrote:
> {
>   "responseHeader":{
> "status":0,
> "QTime":1735},
>   
> "failure":{"":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrExce
> ption:Error from server at http://172.18.111.112:8983/solr: Error CREATEing 
> Solr
> Core 'sitecore_core_index_shard1_replica2': Unable to create core 
> [sitecore_core
> _index_shard1_replica2] Caused by: Can't find resource 'solrconfig.xml' in 
> class
> path or '/configs/sitecore_common_config', 
> cwd=D:\\Solr-5.2.1-Instance\\server"}
> }
>
> I do a  check to see if solrconfig.xml is present in the Zookeeper, if I run 
> zkCli.bat -cmd list on the each of the server, I can see that solrconfig.xml 
> is listed:
>
> DATA:
>
> /configs (1)
>   /configs/sitecore_common_config (1)
>/configs/sitecore_common_config/conf (8)
> /configs/sitecore_common_config/conf/currency.xml (0)

I think the problem is that you included the conf directory in what you
uploaded to zookeeper.  The config files (solrconfig.xml, schema.xml,
etc) should be sitting right in the directory you upload, not inside a
conf subdirectory.  This is somewhat counterintuitive when compared to
what happens when NOT running in cloud mode, but the logic is fairly
simple:  The conf directory is what gets uploaded to zookeeper.

A question for fellow committers:  Is it too much handholding for us to
look in a conf directory in zookeeper?  My bias is that we should not do
that, but I do not see it as particularly harmful.

Thanks,
Shawn



Re: Solr.cmd cannot create collection in Solr 5.2.1

2015-10-26 Thread Upayavira
On Mon, Oct 26, 2015, at 04:10 PM, Shawn Heisey wrote:
> On 10/26/2015 2:23 AM, Adrian Liew wrote:
> > {
> >   "responseHeader":{
> > "status":0,
> > "QTime":1735},
> >   
> > "failure":{"":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrExce
> > ption:Error from server at http://172.18.111.112:8983/solr: Error CREATEing 
> > Solr
> > Core 'sitecore_core_index_shard1_replica2': Unable to create core 
> > [sitecore_core
> > _index_shard1_replica2] Caused by: Can't find resource 'solrconfig.xml' in 
> > class
> > path or '/configs/sitecore_common_config', 
> > cwd=D:\\Solr-5.2.1-Instance\\server"}
> > }
> >
> > I do a  check to see if solrconfig.xml is present in the Zookeeper, if I 
> > run zkCli.bat -cmd list on the each of the server, I can see that 
> > solrconfig.xml is listed:
> >
> > DATA:
> >
> > /configs (1)
> >   /configs/sitecore_common_config (1)
> >/configs/sitecore_common_config/conf (8)
> > /configs/sitecore_common_config/conf/currency.xml (0)
> 
> I think the problem is that you included the conf directory in what you
> uploaded to zookeeper.  The config files (solrconfig.xml, schema.xml,
> etc) should be sitting right in the directory you upload, not inside a
> conf subdirectory.  This is somewhat counterintuitive when compared to
> what happens when NOT running in cloud mode, but the logic is fairly
> simple:  The conf directory is what gets uploaded to zookeeper.
> 
> A question for fellow committers:  Is it too much handholding for us to
> look in a conf directory in zookeeper?  My bias is that we should not do
> that, but I do not see it as particularly harmful.

Or to have the upconfig command barf if there isn't a solrconfig.xml
file in the directory concerned. That'd give quick feedback that
something is being done wrong.

Upayavira


Payload doesn't apply to WordDelimiterFilterFactory-generated tokens

2015-10-26 Thread Jamie Johnson
I came across this post (
http://lucene.472066.n3.nabble.com/Payload-doesn-t-apply-to-WordDelimiterFilterFactory-generated-tokens-td3136748.html)
and tried to find a JIRA for this task.  Was one ever created?  If not I'd
be happy to create it if this is still something that makes sense or if
instead there is another recommended approach for supporting cloning
attributes like payload from the source token stream in the
WordDelimiterFilterFactory.


Re: copy data between collection

2015-10-26 Thread Jeff Wartes

The “copy” command in this tool automatically does what Upayavira
describes, including bringing the replicas up to date. (if any)
https://github.com/whitepages/solrcloud_manager


I’ve been using it as a mechanism for copying a collection into a new
cluster (different ZK), but it should work within
a cluster too. The same caveats apply - see the entry in the README.

I’ve also been doing some collection backup/restore stuff that could be
used to copy a collection within a cluster, (back up your collection, then
restore into a new collection with a different name) but I only just
pushed that, and haven’t bundled it into a release yet.

In all cases, you’re responsible for managing the actual collection
definitions yourself.

An alternative tool I’m aware of is this one:
https://github.com/bloomreach/solrcloud-haft

This says it’s only tested with Solr 4.6, but I’d think it should work.
The Solr APIs for replication haven’t changed much. I haven’t used it, but
it looks like it has some stuff around saving ZK data that could be
useful, and that’s one thing I haven’t focused on myself yet.



On 10/26/15, 4:46 AM, "Upayavira"  wrote:

>Hi Shani,
>
>There isn't a SolrCloud way to do it. A proper 'clone this collection'
>feature would be a very useful thing.
>
>However, I have managed to do it, in a way that involves some caveats:
> * you should only do this on a collection that has no replicas. Add
> replicas *after* cloning the index
> * if you must do it on a sharded index, then you will need to do it
> once for each shard. No guarantees though
>
>All SolrCloud nodes are all already enabled as 'replication masters' so
>that new replicas can pull a full index from the current leader. We're
>gonna use this feature to pull our index (assuming single shard):
>
>http://:8983/solr/_shard1_replica1/replicat
>ion?command=fetchindex&masterUrl=http://:8983/solr/lection>_shard1_replica1/replication
>
>This basically says to the core behind your new collection: "Go to the
>core behind the old collection, and pull its entire index".
>
>This worked for me. I added a replica afterwards, and the index cloned
>correctly. However, when I did it against a collection that had a
>replica already, the replica *didn't* notice, meaning the leader/replica
>were now out of sync, i.e: Really make sure you do this replication
>before you add replicas to your new collection.
>
>Hope this helps.
>
>Upayavira
>
>On Mon, Oct 26, 2015, at 11:21 AM, Chaushu, Shani wrote:
>> Hi,
>> Is there an API to copy all the documents from one collection to another
>> collection in the same solr server simply?
>> I'm using solr cloud 4.10
>> Thanks,
>> Shani
>> 
>> -
>> Intel Electronics Ltd.
>> 
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.



Zookeeper issue causing all nodes to fail

2015-10-26 Thread philippa griggs
Hello all,


We have been experiencing some major solr issues.


Solr 5.2.1 10 Shards each with a replica (20 nodes in total).

Three external zookeepers 3.4.6


Node 19 went down, a short while after this occurred all our nodes were wiped 
out.  The cloud diagram, live_nodes and clusterstate.json all showed different 
nodes as being down/active and when refreshed it changed.


Looking at the logs across all the nodes there were zookeeper errors (a couple 
of examples below), however there was no out of the ordinary errors in 
zookeeper.



WARN  - 2015-10-22 09:39:48.536; [   ] org.apache.solr.cloud.ZkController$4; 
listener throws error

org.apache.solr.common.SolrException: 
org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = 
Session expired for /configs/XXX/params.json

   at 
org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:163)

   at 
org.apache.solr.core.SolrConfig.refreshRequestParams(SolrConfig.java:926)

   at org.apache.solr.core.SolrCore$11.run(SolrCore.java:2580)

   at org.apache.solr.cloud.ZkController$4.run(ZkController.java:2376)

Caused by: org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /configs/XXX/params.json

   at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

   at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1045)

   at 
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:294)

   at 
org.apache.solr.common.cloud.SolrZkClient$4.execute(SolrZkClient.java:291)

   at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)

   at 
org.apache.solr.common.cloud.SolrZkClient.exists(SolrZkClient.java:291)

   at 
org.apache.solr.core.RequestParams.getFreshRequestParams(RequestParams.java:153)

   ... 3 more


ERROR - 2015-10-26 11:28:13.141; [XXX shard6  ] 
org.apache.solr.common.SolrException; There was a problem trying to register as 
the leader:org.apache.solr.common.SolrException: Could not register as the 
leader because creating the ephemeral registration node in ZooKeeper failed

   at 
org.apache.solr.cloud.ShardLeaderElectionContextBase.runLeaderProcess(ElectionContext.java:154)

   at 
org.apache.solr.cloud.ShardLeaderElectionContext.runLeaderProcess(ElectionContext.java:330)

   at 
org.apache.solr.cloud.LeaderElector.runIamLeaderProcess(LeaderElector.java:198)

   at 
org.apache.solr.cloud.LeaderElector.checkIfIamLeader(LeaderElector.java:159)

   at 
org.apache.solr.cloud.LeaderElector.joinElection(LeaderElector.java:348)

   at 
org.apache.solr.cloud.ZkController.joinElection(ZkController.java:1075)

   at org.apache.solr.cloud.ZkController.register(ZkController.java:888)

   at 
org.apache.solr.cloud.ZkController$RegisterCoreAsync.call(ZkController.java:226)

   at java.util.concurrent.FutureTask.run(Unknown Source)

   at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148)

   at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)

   at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)

   at java.lang.Thread.run(Unknown Source)



ERROR - 2015-10-26 11:29:45.757; [XXX shard6  ] 
org.apache.solr.common.SolrException; Error while trying to recover. 
core=XXX:org.apache.zookeeper.KeeperException$SessionExpiredException: 
KeeperErrorCode = Session expired for /overseer/queue/qn-

   at org.apache.zookeeper.KeeperException.create(KeeperException.java:127)

   at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)

   at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783)

   at 
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:380)

   at 
org.apache.solr.common.cloud.SolrZkClient$9.execute(SolrZkClient.java:377)

   at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)

   at 
org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:377)

   at 
org.apache.solr.cloud.DistributedQueue.createData(DistributedQueue.java:380)

   at 
org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:364)

   at org.apache.solr.cloud.ZkController.publish(ZkController.java:1219)

   at org.apache.solr.cloud.ZkController.publish(ZkController.java:1129)

   at org.apache.solr.cloud.ZkController.publish(ZkController.java:1125)

   at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:348)

   at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:229)




WARN  - 2015-10-26 11:32:03.116; [   XXX] org.apache.solr.cloud.ZkController$4; 
listener throws error

org.apache.solr.common.SolrException: Unable to reload core [XXX]

   at org.apache.solr.core.CoreContainer.reload(Co

Re: Solr.cmd cannot create collection in Solr 5.2.1

2015-10-26 Thread Shalin Shekhar Mangar
+1 let's have upconfig complain loudly if the directory being uploaded
doesn't have a solrconfig.xml

On Mon, Oct 26, 2015 at 9:56 PM, Upayavira  wrote:
> On Mon, Oct 26, 2015, at 04:10 PM, Shawn Heisey wrote:
>> On 10/26/2015 2:23 AM, Adrian Liew wrote:
>> > {
>> >   "responseHeader":{
>> > "status":0,
>> > "QTime":1735},
>> >   
>> > "failure":{"":"org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrExce
>> > ption:Error from server at http://172.18.111.112:8983/solr: Error 
>> > CREATEing Solr
>> > Core 'sitecore_core_index_shard1_replica2': Unable to create core 
>> > [sitecore_core
>> > _index_shard1_replica2] Caused by: Can't find resource 'solrconfig.xml' in 
>> > class
>> > path or '/configs/sitecore_common_config', 
>> > cwd=D:\\Solr-5.2.1-Instance\\server"}
>> > }
>> >
>> > I do a  check to see if solrconfig.xml is present in the Zookeeper, if I 
>> > run zkCli.bat -cmd list on the each of the server, I can see that 
>> > solrconfig.xml is listed:
>> >
>> > DATA:
>> >
>> > /configs (1)
>> >   /configs/sitecore_common_config (1)
>> >/configs/sitecore_common_config/conf (8)
>> > /configs/sitecore_common_config/conf/currency.xml (0)
>>
>> I think the problem is that you included the conf directory in what you
>> uploaded to zookeeper.  The config files (solrconfig.xml, schema.xml,
>> etc) should be sitting right in the directory you upload, not inside a
>> conf subdirectory.  This is somewhat counterintuitive when compared to
>> what happens when NOT running in cloud mode, but the logic is fairly
>> simple:  The conf directory is what gets uploaded to zookeeper.
>>
>> A question for fellow committers:  Is it too much handholding for us to
>> look in a conf directory in zookeeper?  My bias is that we should not do
>> that, but I do not see it as particularly harmful.
>
> Or to have the upconfig command barf if there isn't a solrconfig.xml
> file in the directory concerned. That'd give quick feedback that
> something is being done wrong.
>
> Upayavira



-- 
Regards,
Shalin Shekhar Mangar.


Two seperate intance of Solr on the same machine

2015-10-26 Thread Steven White
Hi,

For reasons I have no control over, I'm required to run 2 (maybe more)
instances of Solr on the same server (Windows and Linux).  To be more
specific, I will need to start each instance like so:

  > solr\bin start -p 8983 -s ..\instance_one
  > solr\bin start -p 8984 -s ..\instance_two
  > solr\bin start -p 8985 -s ..\instance_three

Each of those instances is a stand alone Solr (no ZK here at all).

I have tested this over and over and did not see any issue.  However, I did
notice that each instance is writing to the same solr\server\logs\ files
(will this be an issue?!!)

Is the above something I should avoid?  If so, why?

Thanks in advanced !!

Steve


Re: Solr.cmd cannot create collection in Solr 5.2.1

2015-10-26 Thread Shawn Heisey
On 10/26/2015 11:27 AM, Shalin Shekhar Mangar wrote:
> +1 let's have upconfig complain loudly if the directory being uploaded
> doesn't have a solrconfig.xml

+1

I like this idea, as long as there's some kind of "force" option to
bypass that check.

I can imagine situations where somebody might want to upconfig a subset
of the files in a config, like stopwords.txt, as an update rather than a
full upload.  Do we ask people in that situation to switch to putfile
and explicitly name the znode path, or just have them add an option to
their existing command to force the upload even though it's incomplete? 
If I'm writing that script, I would prefer to have the latter option
available.

Thanks,
Shawn



Re: Solr.cmd cannot create collection in Solr 5.2.1

2015-10-26 Thread Upayavira


On Mon, Oct 26, 2015, at 06:04 PM, Shawn Heisey wrote:
> On 10/26/2015 11:27 AM, Shalin Shekhar Mangar wrote:
> > +1 let's have upconfig complain loudly if the directory being uploaded
> > doesn't have a solrconfig.xml
> 
> +1
> 
> I like this idea, as long as there's some kind of "force" option to
> bypass that check.
> 
> I can imagine situations where somebody might want to upconfig a subset
> of the files in a config, like stopwords.txt, as an update rather than a
> full upload.  Do we ask people in that situation to switch to putfile
> and explicitly name the znode path, or just have them add an option to
> their existing command to force the upload even though it's incomplete? 
> If I'm writing that script, I would prefer to have the latter option
> available.

In the end, this needs to be replaced by HTTP APIs. In the meantime, a
switch to disable the solrconfig check sounds reasonable.

Wanna create the ticket?

Upayavira


Query differently or change fieldtype

2015-10-26 Thread Brian Narsi
I have the following field type on a field ClientName:











  


For data where

ClientName = st jude medical inc

When querying I get the following:

1) st --> result = st jude medical inc (works correctly)
2) st j  --> No results are returned (NOT correct) - Expect to find st jude
medical inc
3) st ju m --> No results are returned (NOT correct) - Expect to find st
jude medical inc
4) st ju me --> result = st jude medical inc (works correctly)
5) st ju inc --> No results are returned (NOT correct) - Expect to find st
jude medical inc

Is my field type definition correct? Or do I need to query differently?

Thanks


Re: Query differently or change fieldtype

2015-10-26 Thread Upayavira
Use the analysis tab on the admin UI to see what analysis is doing to
your terms.

Then bear in mind that a query parser will split on space. So, you might
want to do clientName:"st ju me" to make the tokenisation happen within
the analysis chain rather than the query parser.

Upayavira

On Mon, Oct 26, 2015, at 06:24 PM, Brian Narsi wrote:
> I have the following field type on a field ClientName:
> 
>  positionIncrementGap="100">
> 
> 
> 
>  maxGramSize="25"/>
> 
> 
> 
> 
> 
>   
> 
> 
> For data where
> 
> ClientName = st jude medical inc
> 
> When querying I get the following:
> 
> 1) st --> result = st jude medical inc (works correctly)
> 2) st j  --> No results are returned (NOT correct) - Expect to find st
> jude
> medical inc
> 3) st ju m --> No results are returned (NOT correct) - Expect to find st
> jude medical inc
> 4) st ju me --> result = st jude medical inc (works correctly)
> 5) st ju inc --> No results are returned (NOT correct) - Expect to find
> st
> jude medical inc
> 
> Is my field type definition correct? Or do I need to query differently?
> 
> Thanks


Re: Query differently or change fieldtype

2015-10-26 Thread Ray Niu
I found the conf minGramSize="2",which will only create index with at least
2 chars,j will not match
also StandardTokenizerFactory will tokenize st j to st and j

2015年10月26日星期一,Brian Narsi  写道:

> I have the following field type on a field ClientName:
>
>  positionIncrementGap="100">
> 
> 
> 
>  maxGramSize="25"/>
> 
> 
> 
> 
> 
>   
>
>
> For data where
>
> ClientName = st jude medical inc
>
> When querying I get the following:
>
> 1) st --> result = st jude medical inc (works correctly)
> 2) st j  --> No results are returned (NOT correct) - Expect to find st jude
> medical inc
> 3) st ju m --> No results are returned (NOT correct) - Expect to find st
> jude medical inc
> 4) st ju me --> result = st jude medical inc (works correctly)
> 5) st ju inc --> No results are returned (NOT correct) - Expect to find st
> jude medical inc
>
> Is my field type definition correct? Or do I need to query differently?
>
> Thanks
>


Re: Solr.cmd cannot create collection in Solr 5.2.1

2015-10-26 Thread Shawn Heisey
On 10/26/2015 12:08 PM, Upayavira wrote:
> In the end, this needs to be replaced by HTTP APIs. In the meantime, a
> switch to disable the solrconfig check sounds reasonable.
>
> Wanna create the ticket?

Created:

https://issues.apache.org/jira/browse/SOLR-8214

Thanks,
Shawn



Re: Query differently or change fieldtype

2015-10-26 Thread Brian Narsi
That is right Ray, that is exactly what I found out and that is why I am
asking the question.

On Mon, Oct 26, 2015 at 2:19 PM, Ray Niu  wrote:

> I found the conf minGramSize="2",which will only create index with at least
> 2 chars,j will not match
> also StandardTokenizerFactory will tokenize st j to st and j
>
> 2015年10月26日星期一,Brian Narsi  写道:
>
> > I have the following field type on a field ClientName:
> >
> >  > positionIncrementGap="100">
> > 
> > 
> > 
> >  > maxGramSize="25"/>
> > 
> > 
> > 
> > 
> > 
> >   
> >
> >
> > For data where
> >
> > ClientName = st jude medical inc
> >
> > When querying I get the following:
> >
> > 1) st --> result = st jude medical inc (works correctly)
> > 2) st j  --> No results are returned (NOT correct) - Expect to find st
> jude
> > medical inc
> > 3) st ju m --> No results are returned (NOT correct) - Expect to find st
> > jude medical inc
> > 4) st ju me --> result = st jude medical inc (works correctly)
> > 5) st ju inc --> No results are returned (NOT correct) - Expect to find
> st
> > jude medical inc
> >
> > Is my field type definition correct? Or do I need to query differently?
> >
> > Thanks
> >
>


Re: Google didn't help on this one!

2015-10-26 Thread Mikhail Khludnev
Mark, wiki and defauit configs are formally correct and just might be not
specific enough for older spellcheckers. I closed SOLR-8063.



On Wed, Sep 16, 2015 at 5:30 PM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Raised https://issues.apache.org/jira/browse/SOLR-8063
>
>
> On Wed, Sep 16, 2015 at 3:35 PM, Mark Fenbers 
> wrote:
>
>> Indeed!   should be changed to  in the "Spell Checking"
>> document (https://cwiki.apache.org/confluence/display/solr/Spell+Checking)
>> and in all the baseline solrconfig.xml files provided in the distribution.
>> In addition, ' internal' should be
>> removed/changed in the same document and same solrconfig.xml files because
>> "internal" is not defined in AbstractLuceneSpellchecker.java!  Once I
>> edited these two problems in my own solrconfig.xml, the stacktrace errors
>> went away!!  Yay!
>>
>> But I'm not out of the woods yet!  I'll resume later, after our system
>> upgrade today.
>>
>> Thanks!
>> Mark
>>
>> On 9/16/2015 8:03 AM, Mikhail Khludnev wrote:
>>
>>>
>>> https://github.com/apache/lucene-solr/blob/trunk/solr/core/src/java/org/apache/solr/spelling/AbstractLuceneSpellChecker.java#L97
>>> this mean that
>>>
>>> 0.5
>>>
>>> should be replaced to
>>>
>>> 0.5
>>>
>>>
>>>
>>>
>>>
>>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Query differently or change fieldtype

2015-10-26 Thread Ray Niu
I think this is how StandardTokenizerFactory works, if you want different
behavior, you should try to use a different tokenizer, also like Upayavira
said,use the analysis tab on the admin UI to see what analysis is doing to your
terms.

2015-10-26 12:33 GMT-07:00 Brian Narsi :

> That is right Ray, that is exactly what I found out and that is why I am
> asking the question.
>
> On Mon, Oct 26, 2015 at 2:19 PM, Ray Niu  wrote:
>
> > I found the conf minGramSize="2",which will only create index with at
> least
> > 2 chars,j will not match
> > also StandardTokenizerFactory will tokenize st j to st and j
> >
> > 2015年10月26日星期一,Brian Narsi  写道:
> >
> > > I have the following field type on a field ClientName:
> > >
> > >  > > positionIncrementGap="100">
> > > 
> > > 
> > > 
> > >  > > maxGramSize="25"/>
> > > 
> > > 
> > > 
> > > 
> > > 
> > >   
> > >
> > >
> > > For data where
> > >
> > > ClientName = st jude medical inc
> > >
> > > When querying I get the following:
> > >
> > > 1) st --> result = st jude medical inc (works correctly)
> > > 2) st j  --> No results are returned (NOT correct) - Expect to find st
> > jude
> > > medical inc
> > > 3) st ju m --> No results are returned (NOT correct) - Expect to find
> st
> > > jude medical inc
> > > 4) st ju me --> result = st jude medical inc (works correctly)
> > > 5) st ju inc --> No results are returned (NOT correct) - Expect to find
> > st
> > > jude medical inc
> > >
> > > Is my field type definition correct? Or do I need to query differently?
> > >
> > > Thanks
> > >
> >
>


CloudSolrClient query /admin/info/system

2015-10-26 Thread Kevin Risden
I am trying to use CloudSolrClient to query information about the Solr
server including version information. I found /admin/info/system and it
seems to provide the information I am looking for. However, it looks like
CloudSolrClient cannot query /admin/info since INFO_HANDLER_PATH [1] is not
part of the ADMIN_PATHS in CloudSolrClient.java [2]. Was this possibly
missed as part of SOLR-4943 [3]?

Is this an issue or is there a better way to query this information?

As a side note, ZK_PATH also isn't listed in ADMIN_PATHS. I'm not sure what
issues that could cause. Is there a reason that ADMIN_PATHS in
CloudSolrClient would be different than the paths in CommonParams [1]?

[1]
https://github.com/apache/lucene-solr/blob/trunk/solr/solrj/src/java/org/apache/solr/common/params/CommonParams.java#L168
[2]
https://github.com/apache/lucene-solr/blob/trunk/solr/solrj/src/java/org/apache/solr/client/solrj/impl/CloudSolrClient.java#L808
[3] https://issues.apache.org/jira/browse/SOLR-4943

Kevin Risden
Hadoop Tech Lead | Avalon Consulting, LLC 
M: 732 213 8417
LinkedIn  | Google+
 | Twitter



Querying Dynamic Fields

2015-10-26 Thread Patrick Hoeffel
I have a simple Solr schema that uses dynamic fields to create most of my 
fields. This works great. Unfortunately, I now need to ask Solr to give me the 
names of the fields in the schema. I'm using:

http://localhost:8983/solr/core/schema/fields

This returns the statically defined fields, but does not return the ones that 
were created matching my dynamic definitions, such as *_s, *_i, *_txt, etc.

I know Solr is aware of these fields, because I can query against them.

What is the secret sauce to query their names and data types?

Thanks,

Patrick Hoeffel
Senior Software Engineer
Intelligent Software Solutions (www.issinc.com)
(719) 452-7371 (direct)
(719) 210-3706 (mobile)

"Bringing Knowledge to Light"



Best strategy for indexing multiple tables with multiple fields

2015-10-26 Thread Daniel Valdivia
Hi, I’m new to the solr world, I’m in need of some experienced advice as I see 
I can do a lot of cool stuff with Solr, but I’m not sure which path to take so 
I don’t shoot myself in the foot with all this power :P

I have several tables (225) in my application, which I’d like to add into a 
single index (multiple type of documents in the same index with unique id) 
however, each table has a different number of columns, from 5 to 30 columns, do 
you recomend indexing each column separately or joining all columns into a 
single “big document”?

I’m trying to provide my users with a simple experience where they type their 
search query in a simple search box and I list all the possible documents 
across different tables that match their query, not sure if that strategy is 
the best, or perhaps a core per table?

So far these are my considered strategies:

unique_id , table , megafield: All of the columns in the record get mixed into 
a single megafield and indexes (cons: no faceting?)
a core per table: Each table gets a core, all the fields get indexed (except 
numbers and foreign keys), I’m not sure if having 200 cores will play nice with 
Solr
Single core, all fields get indexed ( possible 1,000’s of columns), this sounds 
expensive and not so efficient to me

My application has around 2M records

Thanks in advance for any advise.

Cheers

RE: Querying Dynamic Fields

2015-10-26 Thread Matt Kuiper (Springblox)
Give the following a try -

http://localhost:8983/solr/core_name/admin/luke?numTerms=0 

Matt

Matt Kuiper

-Original Message-
From: Patrick Hoeffel [mailto:patrick.hoef...@issinc.com] 
Sent: Monday, October 26, 2015 4:56 PM
To: solr-user@lucene.apache.org
Subject: Querying Dynamic Fields

I have a simple Solr schema that uses dynamic fields to create most of my 
fields. This works great. Unfortunately, I now need to ask Solr to give me the 
names of the fields in the schema. I'm using:

http://localhost:8983/solr/core/schema/fields

This returns the statically defined fields, but does not return the ones that 
were created matching my dynamic definitions, such as *_s, *_i, *_txt, etc.

I know Solr is aware of these fields, because I can query against them.

What is the secret sauce to query their names and data types?

Thanks,

Patrick Hoeffel
Senior Software Engineer
Intelligent Software Solutions (www.issinc.com)
(719) 452-7371 (direct)
(719) 210-3706 (mobile)

"Bringing Knowledge to Light"



Solr hard commit

2015-10-26 Thread Rallavagu

All,

Are memory mapped files (mmap) flushed to disk during "hard commit"? If 
yes, should we disable OS level (Linux for example) memory mapped flush?


I am referring to following for mmap files for Lucene/Solr

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Linux level flush

http://www.cyberciti.biz/faq/linux-stop-flushing-of-mmaped-pages-to-disk/

Solr's hard and soft commit

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks in advance.


Re: Two seperate intance of Solr on the same machine

2015-10-26 Thread Pushkar Raste
It depends on your case. If you don't mind logs from 3 different instances
inter-mingled with each other you should be fine.
You add "-Dsolr.log=" to make logs to go different
directories. If you want logs to go to same directory but different files
try updating log4j.properties.

On 26 October 2015 at 13:33, Steven White  wrote:

> Hi,
>
> For reasons I have no control over, I'm required to run 2 (maybe more)
> instances of Solr on the same server (Windows and Linux).  To be more
> specific, I will need to start each instance like so:
>
>   > solr\bin start -p 8983 -s ..\instance_one
>   > solr\bin start -p 8984 -s ..\instance_two
>   > solr\bin start -p 8985 -s ..\instance_three
>
> Each of those instances is a stand alone Solr (no ZK here at all).
>
> I have tested this over and over and did not see any issue.  However, I did
> notice that each instance is writing to the same solr\server\logs\ files
> (will this be an issue?!!)
>
> Is the above something I should avoid?  If so, why?
>
> Thanks in advanced !!
>
> Steve
>


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-26 Thread Scott Chu
Hi Edward,

Took a lot of time to see if there's anything can help you to define the 
cause of your problem. Maybe this might help you a bit: 

[SOLR-4722] Highlighter which generates a list of query term position(s) for 
each item in a list of documents, or returns null if highlighting is disabled. 
- AS...
https://issues.apache.org/jira/browse/SOLR-4722

This one is modified from FastVectorHighLighter, so ensure those 3 term* 
attributes are on.

Scott Chu,scott@udngroup.com
2015/10/27 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-23, 10:42:32
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can

try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese

but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and

see if the problem goes away. However when I'm googling similar problem, I

saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and

it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu  wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I

> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain

> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes

> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu,scott@udngroup.com
> 2015/10/23
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-22, 22:22:14
> *Subject: *Re: Highlighting content field problem when using
> JiebaTokenizerFactory
>
> Hi Scott,
>
> Thank you for your response and suggestions.
>
> With respond to your questions, here are the answers:
>
> 1. I take a look at Jieba. It uses a dictionary and it seems to do a good
> job on CJK. I doubt this problem may be from those filters (note: I can
> understand you may use CJKWidthFilter to convert Japanese but doesn't
> understand why you use CJKBigramFilter and EdgeNGramFilter). Have you tried
> commenting out those filters, say leave only Jieba and StopFilter, and see
>
> if this problem disppears?
> *A) Yes, I have tried commenting out the other filters and only left with
> Jieba and StopFilter. The problem is still there.*
>
> 2.Does this problem occur only on Chinese search words? Does it happen on
> English search words?
> *A) Yes, the same problem occurs on English words. For example, when I
> search for "word", it will highligh

Re: Best strategy for indexing multiple tables with multiple fields

2015-10-26 Thread Walter Underwood
Most of the time, the best approach is to denormalize everything into one big 
virtual table. Think about a making a view, where each row is one document in 
Solr. That row needs everything that will be searched and everything that will 
be displayed, but nothing else.

I’ve heard of installations with tens of thousands of fields. A thousand fields 
might be cumbersome, but it won’t break Solr.

If the tables contain different kinds of things, you might have different 
collections (one per document), or one collection with a “type” field for each 
kind of document. 

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 26, 2015, at 4:08 PM, Daniel Valdivia  wrote:
> 
> Hi, I’m new to the solr world, I’m in need of some experienced advice as I 
> see I can do a lot of cool stuff with Solr, but I’m not sure which path to 
> take so I don’t shoot myself in the foot with all this power :P
> 
> I have several tables (225) in my application, which I’d like to add into a 
> single index (multiple type of documents in the same index with unique id) 
> however, each table has a different number of columns, from 5 to 30 columns, 
> do you recomend indexing each column separately or joining all columns into a 
> single “big document”?
> 
> I’m trying to provide my users with a simple experience where they type their 
> search query in a simple search box and I list all the possible documents 
> across different tables that match their query, not sure if that strategy is 
> the best, or perhaps a core per table?
> 
> So far these are my considered strategies:
> 
> unique_id , table , megafield: All of the columns in the record get mixed 
> into a single megafield and indexes (cons: no faceting?)
> a core per table: Each table gets a core, all the fields get indexed (except 
> numbers and foreign keys), I’m not sure if having 200 cores will play nice 
> with Solr
> Single core, all fields get indexed ( possible 1,000’s of columns), this 
> sounds expensive and not so efficient to me
> 
> My application has around 2M records
> 
> Thanks in advance for any advise.
> 
> Cheers



Solr Code structure Documentation

2015-10-26 Thread G.Sarwar
Hi all,

I am new at SOLR and i would like to understand how its searching mechanism
is working. I know its using apache lucene as a backend but how the calls
are exactly working from Query page and which code/algorithm is being called
and how. I have sucessfully configured it and running it on my eclipse by
following the insturctions on this link.
https://wiki.apache.org/solr/HowToContribute

I have also looked into the appendix section (working with solr codebase) of
"SOLR in action" book and its almost the same as above link. Now i need to
know how exactly the code is working on the backend for searching and
getting the results back(front end to backend calls). I know that the solr
frontend is based on Angular JS, but i couldn't find any help/documentation
about how it's calling the Java backend for retrieving results back.

Any help in this regard would be great. 

Many thanks,
Sarwar



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Code-structure-Documentation-tp4236634.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr Code structure Documentation

2015-10-26 Thread Alexandre Rafalovitch
Well, the source code is all there, if you need to know _exactly_. Run it
under Debug. Run it under paid IntelliJ with Chronos if you will be doing
it a lot.

Same with Admin to Solr, just open a developer console in the browser and
you have every web call documented just when you want them.

Also, javadoc has some documentation, e.g. on lucene file formats. They are
part of the distribution or you can use disposable version on my
solr-start.com site.

Regards,
Alex
On 26 Oct 2015 10:43 pm, "G.Sarwar"  wrote:

> Hi all,
>
> I am new at SOLR and i would like to understand how its searching mechanism
> is working. I know its using apache lucene as a backend but how the calls
> are exactly working from Query page and which code/algorithm is being
> called
> and how. I have sucessfully configured it and running it on my eclipse by
> following the insturctions on this link.
> https://wiki.apache.org/solr/HowToContribute
>
> I have also looked into the appendix section (working with solr codebase)
> of
> "SOLR in action" book and its almost the same as above link. Now i need to
> know how exactly the code is working on the backend for searching and
> getting the results back(front end to backend calls). I know that the solr
> frontend is based on Angular JS, but i couldn't find any help/documentation
> about how it's calling the Java backend for retrieving results back.
>
> Any help in this regard would be great.
>
> Many thanks,
> Sarwar
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Solr-Code-structure-Documentation-tp4236634.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Highlighting content field problem when using JiebaTokenizerFactory

2015-10-26 Thread Scott Chu

Take a look at Michael's 2 articles, they might help you calrify the idea of 
highlighting in Solr:

Changing Bits: Lucene's TokenStreams are actually graphs!
http://blog.mikemccandless.com/2012/04/lucenes-tokenstreams-are-actually.html

Also take a look at 4th paragraph In his another article:

Changing Bits: A new Lucene highlighter is born
http://blog.mikemccandless.com/2012/12/a-new-lucene-highlighter-is-born.html

Currently, I can't figure out the possible cause of your problem unless I got 
spare time to test it on my own, which is not available these days (Got some 
projects to close)!

If you find the solution or workaround, pls. let us know. Good luck again!

Scott Chu,scott@udngroup.com
2015/10/27 
- Original Message - 
From: Scott Chu 
To: solr-user 
Date: 2015-10-27, 10:27:45
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Edward,

Took a lot of time to see if there's anything can help you to define the 
cause of your problem. Maybe this might help you a bit: 

[SOLR-4722] Highlighter which generates a list of query term position(s) for 
each item in a list of documents, or returns null if highlighting is disabled. 
- AS...
https://issues.apache.org/jira/browse/SOLR-4722

This one is modified from FastVectorHighLighter, so ensure those 3 term* 
attributes are on.

Scott Chu,scott@udngroup.com
2015/10/27 
- Original Message - 
From: Zheng Lin Edwin Yeo 
To: solr-user 
Date: 2015-10-23, 10:42:32
Subject: Re: Highlighting content field problem when using JiebaTokenizerFactory


Hi Scott,

Thank you for your respond.

1. You said the problem only happens on "contents" field, so maybe there're
something wrong with the contents of that field. Doe it contain any special
thing in them, e.g. HTML tags or symbols. I recall SOLR-42 mentions
something about HTML stripping will cause highlight problem. Maybe you can

try purify that fields to be closed to pure text and see if highlight comes
ok.
*A) I check that the SOLR-42 is mentioning about the
HTMLStripWhiteSpaceTokenizerFactory, which I'm not using. I believe that
tokenizer is already deprecated too. I've tried with all kinds of content
for rich-text documents, and all of them have the same problem.*

2. Maybe something imcompatible between JiebaTokenizer and Solr
highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
SmartChinese (I don't use this since I am dealing with Traditional Chinese

but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and

see if the problem goes away. However when I'm googling similar problem, I

saw you asked same question on August at Huaban/Jieba-analysis and somebody
said he also uses JiebaTokenizer but he doesn't have your problem. So I see
this could be less suspect.
*A) I was thinking about the incompatible issue too, as I previously
thought that JiebaTokenizer is optimised for Solr 4.x, so it may have issue
in 5.x. But the person from Hunban/Jieba-analysis said that he doesn't have
this problem in Solr 5.1. I also face the same problem in Solr 5.1, and
although I'm using Solr 5.3.0 now, the same problem persist. *

I'm looking at the indexing process too, to see if there's any problem
there. But just can't figure out why it only happen to JiebaTokenizer, and

it only happen for content field.


Regards,
Edwin


On 23 October 2015 at 09:41, Scott Chu  wrote:

> Hi Edwin,
>
> Since you've tested all my suggestions and the problem is still there, I

> can't think of anything wrong with your configuration. Now I can only
> suspect two things:
>
> 1. You said the problem only happens on "contents" field, so maybe
> there're something wrong with the contents of that field. Doe it contain

> any special thing in them, e.g. HTML tags or symbols. I recall SOLR-42
> mentions something about HTML stripping will cause highlight problem. Maybe
> you can try purify that fields to be closed to pure text and see if
> highlight comes ok.
>
> 2. Maybe something imcompatible between JiebaTokenizer and Solr
> highlighter. If you switch to other tokenizers, e.g. Standard, CJK,
> SmartChinese (I don't use this since I am dealing with Traditional Chinese
> but I see you are dealing with Simplified Chinese), or 3rd-party MMSeg and
> see if the problem goes away. However when I'm googling similar problem, I
> saw you asked same question on August at Huaban/Jieba-analysis and somebody
> said he also uses JiebaTokenizer but he doesn't have your problem. So I see
> this could be less suspect.
>
> The theory of your problem could be something in indexing process causes

> wrong position info. for that field and when Solr do highlighting, it
> retrieves wrong position info. and mark wrong position of highlight target
> terms.
>
> Scott Chu,scott@udngroup.com
> 2015/10/23
>
> - Original Message -
> *From: *Zheng Lin Edwin Yeo 
> *To: *solr-user 
> *Date: *2015-10-22, 22:22:14
> *Subject: *Re: Highlighting content field problem when using
>

Re: Best strategy for indexing multiple tables with multiple fields

2015-10-26 Thread Erick Erickson
Well, all I can say is "if you find yourself trying to do lots of
joins in Solr, go
back to the drawing board" ;).

Solr is a great search engine, but its ability to be used like a RDBMS
is...er...limited.
RDBMSs are great at what they do, but make pretty rotten search engines.

Rather than think of this as porting your tables over to Solr, you
probably want to
consider making analyzing what you need the end Solr-powered app to _do_
and designing your Solr architecture from there.

If I were going to offer the single most important (IMO) piece of
advice, it'd be
"define the use cases, then extract data from your DB to support them". And
really push back on silly use-cases ;)

Best,
Erick

On Mon, Oct 26, 2015 at 7:35 PM, Walter Underwood  wrote:
> Most of the time, the best approach is to denormalize everything into one big 
> virtual table. Think about a making a view, where each row is one document in 
> Solr. That row needs everything that will be searched and everything that 
> will be displayed, but nothing else.
>
> I’ve heard of installations with tens of thousands of fields. A thousand 
> fields might be cumbersome, but it won’t break Solr.
>
> If the tables contain different kinds of things, you might have different 
> collections (one per document), or one collection with a “type” field for 
> each kind of document.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
>> On Oct 26, 2015, at 4:08 PM, Daniel Valdivia  wrote:
>>
>> Hi, I’m new to the solr world, I’m in need of some experienced advice as I 
>> see I can do a lot of cool stuff with Solr, but I’m not sure which path to 
>> take so I don’t shoot myself in the foot with all this power :P
>>
>> I have several tables (225) in my application, which I’d like to add into a 
>> single index (multiple type of documents in the same index with unique id) 
>> however, each table has a different number of columns, from 5 to 30 columns, 
>> do you recomend indexing each column separately or joining all columns into 
>> a single “big document”?
>>
>> I’m trying to provide my users with a simple experience where they type 
>> their search query in a simple search box and I list all the possible 
>> documents across different tables that match their query, not sure if that 
>> strategy is the best, or perhaps a core per table?
>>
>> So far these are my considered strategies:
>>
>> unique_id , table , megafield: All of the columns in the record get mixed 
>> into a single megafield and indexes (cons: no faceting?)
>> a core per table: Each table gets a core, all the fields get indexed (except 
>> numbers and foreign keys), I’m not sure if having 200 cores will play nice 
>> with Solr
>> Single core, all fields get indexed ( possible 1,000’s of columns), this 
>> sounds expensive and not so efficient to me
>>
>> My application has around 2M records
>>
>> Thanks in advance for any advise.
>>
>> Cheers
>


Re: Solr hard commit

2015-10-26 Thread Erick Erickson
You're really looking at this backwards. The MMapDirectory stuff is
for Solr (Lucene, really) _reading_ data from closed segment files.

When indexing, there are internal memory structures that are flushed
to disk on commit, but these have nothing to do with MMapDirectory.

So the question is really moot ;)

Best,
Erick

On Mon, Oct 26, 2015 at 5:47 PM, Rallavagu  wrote:
> All,
>
> Are memory mapped files (mmap) flushed to disk during "hard commit"? If yes,
> should we disable OS level (Linux for example) memory mapped flush?
>
> I am referring to following for mmap files for Lucene/Solr
>
> http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html
>
> Linux level flush
>
> http://www.cyberciti.biz/faq/linux-stop-flushing-of-mmaped-pages-to-disk/
>
> Solr's hard and soft commit
>
> https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
>
> Thanks in advance.


Re: Does docValues impact termfreq ?

2015-10-26 Thread Erick Erickson
Do be aware that docValues can only be used for non-text types,
i.e. numerics, strings and the like. Specifically, docValues are
_not_ possible for solr.textField and docValues don't support
analysis chains because the underlying primitive types don't. You'll
get an error if you try to specify docValues on a solr.TextField
type.

Does that change the discussion?

Best,
Erick

On Mon, Oct 26, 2015 at 7:36 AM, Emir Arnautovic
 wrote:
> Hi Aki,
> IMO this is underuse of Solr (not to mention SolrCloud). I would recommend
> doing in memory document parsin (if you need something from Lucene/Solr
> analysis classes, use it) and use some other cache like solution to store
> term/total frequency pairs (you can try Redis).
>
> That way you will have updatable, fast total frequency lookups.
>
> Thanks,
> Emir
>
> On 26.10.2015 14:43, Aki Balogh wrote:
>>
>> Hi Emir,
>>
>> This is correct. This is the only way we use the index.
>>
>> Thanks,
>> Aki
>>
>> On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
>> emir.arnauto...@sematext.com> wrote:
>>
>>> If I got it right, you are using term query, use function to get TF as
>>> score, iterate all documents in results and sum up total number of
>>> occurrences of specific term in index? Is this only way you use index or
>>> this is side functionality?
>>>
>>> Thanks,
>>> Emir
>>>
>>>
>>> On 24.10.2015 22:28, Aki Balogh wrote:
>>>
 Certainly, yes. I'm just doing a word count, ie how often does a
 specific
 term come up in the corpus?
 On Oct 24, 2015 4:20 PM, "Upayavira"  wrote:

 yes, but what do you want to do with the TF? What problem are you
>
> solving with it? If you are able to share that...
>
> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>
>> Yes, sorry, I am not being clear.
>>
>> We are not even doing scoring, just getting the raw TF values. We're
>> doing
>> this in solr because it can scale well.
>>
>> But with large corpora, retrieving the word counts takes some time, in
>> part
>> because solr is splitting up word count by document and generating a
>> large
>> request. We then get the request and just sum it all up. I'm wondering
>> if
>> there's a more direct way.
>> On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:
>>
>> Can you explain more what you are using TF for? Because it sounds
>> rather
>> like scoring. You could disable field norms and IDF and scoring would
>> be
>> mostly TF, no?
>>>
>>> Upayavira
>>>
>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>>>
 Thanks, let me think about that.

 We're using termfreq to get the TF score, but we don't know which

>>> term
>>
>> we'll need the TF for. So we'd have to do a corpuswide summing of

 termfreq
 for each potential term across all documents in the corpus. It seems

>>> like
>>
>> it'd require some development work to compute that, and our code
>>>
>>> would be
>>
>> fragile.

 Let me think about that more.

 It might make sense to just move to solrcloud, it's the right
 architectural
 decision anyway.


 On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:

 If you just want word length, then do work during indexing - index
 a
>>
>> field for the word length. Then, I believe you can do faceting -

 e.g.
>>
>> with the json faceting API I believe you can do a sum()

 calculation on
>>
>> a

 field rather than the more traditional count.
>
> Thinking aloud, there might be an easier way - index a field that
>
 is
>>
>> the

 same for all documents, and facet on it. Instead of counting the
 number
>>
>> of documents, calculate the sum() of your word count field.
>
> I *think* that should work.
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>
>> Hi Jack,
>>
>> I'm just using solr to get word count across a large number of
>>
> documents.

 It's somewhat non-standard, because we're ignoring relevance,
>
> but it
>>
>> seems
>>
>> to work well for this use case otherwise.
>>
>> My understanding then is:
>> 1) since termfreq is pre-processed and fetched, there's no good
>>
> way
>>
>> to

 speed it up (except by caching earlier calculations)
>>
>> 2) there's no way to have solr sum up all of the termfreqs
>>
> across all
>>
>> documents in a search and just return one number for total
>
> termfreqs
>>
>> Are the

Re: Solr hard commit

2015-10-26 Thread Rallavagu
Erick, Thanks for clarification. I was under impression that 
MMapDirectory is being used for both read/write operations. Now, I see 
how it is being used. Essentially, it only reads from MMapDirectory and 
writes directly to disk. So, the updated file(s) on the disk 
automatically read into memory as they are Memory mapped?


On 10/26/15 8:43 PM, Erick Erickson wrote:

You're really looking at this backwards. The MMapDirectory stuff is
for Solr (Lucene, really) _reading_ data from closed segment files.

When indexing, there are internal memory structures that are flushed
to disk on commit, but these have nothing to do with MMapDirectory.

So the question is really moot ;)

Best,
Erick

On Mon, Oct 26, 2015 at 5:47 PM, Rallavagu  wrote:

All,

Are memory mapped files (mmap) flushed to disk during "hard commit"? If yes,
should we disable OS level (Linux for example) memory mapped flush?

I am referring to following for mmap files for Lucene/Solr

http://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html

Linux level flush

http://www.cyberciti.biz/faq/linux-stop-flushing-of-mmaped-pages-to-disk/

Solr's hard and soft commit

https://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Thanks in advance.


Re: Two seperate intance of Solr on the same machine

2015-10-26 Thread Jack Krupansky
Each instance should be installed in a separate directory. IOW, don't try
running multiple Solr processes for the same data.

-- Jack Krupansky

On Mon, Oct 26, 2015 at 1:33 PM, Steven White  wrote:

> Hi,
>
> For reasons I have no control over, I'm required to run 2 (maybe more)
> instances of Solr on the same server (Windows and Linux).  To be more
> specific, I will need to start each instance like so:
>
>   > solr\bin start -p 8983 -s ..\instance_one
>   > solr\bin start -p 8984 -s ..\instance_two
>   > solr\bin start -p 8985 -s ..\instance_three
>
> Each of those instances is a stand alone Solr (no ZK here at all).
>
> I have tested this over and over and did not see any issue.  However, I did
> notice that each instance is writing to the same solr\server\logs\ files
> (will this be an issue?!!)
>
> Is the above something I should avoid?  If so, why?
>
> Thanks in advanced !!
>
> Steve
>