from:"TOM"

A problem of tracking the commits of Lucene using SHA num

2017-11-09 Thread TOM

Thanks for your patience and helps.

 Recently, I acquired a batch of commits?? SHA data of Lucene, of which the 
time span is from 2010 to 2015. In order to get original info, I tried to use 
these SHA data to track commits. First, I cloned Lucene repository to my local 
host, using the cmd git clone https:// 
https://github.com/apache/lucene-solr.git. Then, I used git show [commit SHA] 
to get commits?? history record, but failed with the CMD info like this:

>> git show be5672c0c242d658b7ce36f291b74c344de925c7

>> fatal: bad object be5672c0c242d658b7ce36f291b74c344de925c7

 

After that, I cloned another mirror of Apache Lucene & Solr 
(https://github.com/mdodsworth/lucene-solr, the update ended at 2014/08/30), 
and got the right record like this:



Moreover, I tried to track a commit using its title msg. However, for a same 
commit, e.g. LUCENE-5909: Fix stupid bug, I found different SHA nums from the 
two above mirror repositories 
(https://github.com/apache/lucene-solr/commit/3c0d111d07184e96a73ca6dc05c6227d839724e2
 and 
https://github.com/mdodsworth/lucene-solr/commit/4bc8dde26371627d11c299f65c399ecb3240a34c),
 which confused me.

In summary, 1) did the method to generate SHA num of commit change once before? 
2) because the second mirror repository ended its update since 2014, how can I 
track the whole commits of my dataset?

 

Thanks so much!

A problem of tracking the commits of Lucene using SHA num

2017-11-16 Thread TOM

Thanks for your patience and helps.

 Recently, I acquired a batch of commits?? SHA data of Lucene, of which the 
time span is from 2010 to 2015. In order to get original info, I tried to use 
these SHA data to track commits. First, I cloned Lucene repository to my local 
host, using the cmd git clone https:// 
https://github.com/apache/lucene-solr.git. Then, I used git show [commit SHA] 
to get commits?? history record, but failed with the CMD info like this:

>> git show be5672c0c242d658b7ce36f291b74c344de925c7

>> fatal: bad object be5672c0c242d658b7ce36f291b74c344de925c7

 

After that, I cloned another mirror of Apache Lucene & Solr 
(https://github.com/mdodsworth/lucene-solr, the update ended at 2014/08/30), 
and got the right record like this:



Moreover, I tried to track a commit using its title msg. However, for a same 
commit, e.g. LUCENE-5909: Fix stupid bug, I found different SHA nums from the 
two above mirror repositories 
(https://github.com/apache/lucene-solr/commit/3c0d111d07184e96a73ca6dc05c6227d839724e2
 and 
https://github.com/mdodsworth/lucene-solr/commit/4bc8dde26371627d11c299f65c399ecb3240a34c),
 which confused me.

In summary, 1) did the method to generate SHA num of commit change once before? 
2) because the second mirror repository ended its update since 2014, how can I 
track the whole commits of my dataset?

 

Thanks so much!

Re: A problem of tracking the commits of Lucene using SHA num

2017-11-20 Thread TOM

Dear Shawn and Chris,
Thanks very much for your replies and helps.
And so sorry for my mistakes of first-time use of Mailing Lists.

On 11/9/2017 5:13 PM, Shawn wrote:
> Where did this information originate?

My SHA data come from the paper On the Naturalness of Buggy Code(Baishakhi Ray, 
et al. ICSE ??16), and download from
http://odd-code.github.io/Data.html.

On 11/9/2017 6:10 PM, Chris wrote:
> Also -- What exactly are you trying to do? what is your objective?

I want to analysis buggy codes?? statistical properties through some
learning models on Ray??s experimental dataset. Since its large size,
Ray did not make the entire data online. What I can acquire is a batch
of commits?? SHA data and some other info. So, I need to pick out
the old commits which are correlated to these SHAs.

On 17/9/2017 1:47 PM, Shawn wrote:
> The commit data you're using is nearly useless, because the repository
> where it originated has been gone for nearly two years. If you can find
> out how it was generated, you can build a new version from the current
> repository -- either on github or from Apache's official servers.

Thanks for all of your suggestions and helps, I am going to try other ways.
Thanks so much.

Best,
Xian

dataimport handler

2014-01-22 Thread tom

Hi,
I am trying to use dataimporthandler(Solr 4.6) from oracle database, but I
have some issues in mapping the data.
I have 3 columns in the test_table,
 column1,
 column2,
 id

dataconfig.xml

  


   


Issue is,
- if I remove the id column from the table, index fails, solr is looking for
id column even though it is not mapped in dataconfig.xml.
- if I add, it directly maps the id column form the db to solr id, it
ignores the column1, even though it is mapped.

my problem is I don't have ID in every table, I should be able to map the
column I choose from the table to solr Id,  any solution will be greatly
appreciated.

`Tom




--
View this message in context: 
http://lucene.472066.n3.nabble.com/dataimport-handler-tp4112830.html
Sent from the Solr - User mailing list archive at Nabble.com.

TemplateTransformer returns null values

2014-01-30 Thread tom

Hi,
I am trying a simple transformer on data input using DIH, Solr 4.6. when I
run the below query while DIH I get null values for new_url. what is wrong?
even tried with "${document_solr.id}"

the name is 

data-config.xml:




   






below stack trace:
8185946 [Thread-29] INFO  org.apache.solr.search.SolrIndexSearcher  û
Opening Searcher@5a5f4cb7 realtime
8185960 [Thread-29] INFO  org.apache.solr.handler.dataimport.JdbcDataSource 
û Creating a connection for entity document_solr with URL:
jdbc:oracle:thin:@vluedb01:1521:iedwdev
8186225 [Thread-29] INFO  org.apache.solr.handler.dataimport.JdbcDataSource 
û Time taken forgetConnection():265
8186226 [Thread-29] DEBUG org.apache.solr.handler.dataimport.JdbcDataSource 
û Executing SQL: select DOC_IDN as id, BILL_IDN as bill_id from
document_solr
8186291 [Thread-29] TRACE org.apache.solr.handler.dataimport.JdbcDataSource 
û Time taken for sql :64
8186301 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer 
û The name is
8186303 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer 
û The name is
8186303 [Thread-29] DEBUG org.apache.solr.handler.dataimport.LogTransformer 
û The name is


`Tom




--
View this message in context: 
http://lucene.472066.n3.nabble.com/TemplateTransformer-returns-null-values-tp4114539.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TemplateTransformer returns null values

2014-01-30 Thread tom

Thanks Alexandre for quick response,

I tried both the ways but still no luck null values, anything I am doing
fundamentally wrong?
 
query="select DOC_IDN, BILL_IDN from document_fact" >


and

query="select DOC_IDN as id ,BILL_IDN as bill_id from document_fact" >
   




--
View this message in context: 
http://lucene.472066.n3.nabble.com/TemplateTransformer-returns-null-values-tp4114539p4114544.html
Sent from the Solr - User mailing list archive at Nabble.com.

Faceting return value of a function query?

2014-11-03 Thread Tom

Hi,

I'm new to Solr, and I'm having a problem with faceting. I would really
appreciate it if you could help :)

I have a set of documents in JSON format, which I could post to my Solr
core using the post.jar tool. Each document contains two fields, namely
"startDate" and "endDate", both of which are of type "date".

Conceptually, I would like to have a third field "timeSpan" that is
automatically generated from the return value of function query
"ms(endDate, startDate)", and do range facet on it, i.e. compute the
distribution of "timeSpan", among either all of or a filtered subset of the
documents.

I have tried to find ways of both directly faceting the function return
values and automatically generate the "timeSpan" field during indexing, but
without luck yet.

Suggestions are greatly appreciated!

Best,
Yubing

Re: Faceting return value of a function query?

2014-11-03 Thread Tom

Hi Erik,

Thanks for the reply! Do you mean parse and modify the documents before
sending them to Solr?

Cheers,
Yubing

On Mon, Nov 3, 2014 at 8:48 PM, Erick Erickson 
wrote:

> Wouldn't it be easiest to compute the span at index time? Then it's
> very straight-forward.
>
> Best,
> Erick
>
> On Mon, Nov 3, 2014 at 8:18 PM, Yubing (Tom) Dong 董玉冰
>  wrote:
> > Hi,
> >
> > I'm new to Solr, and I'm having a problem with faceting. I would really
> > appreciate it if you could help :)
> >
> > I have a set of documents in JSON format, which I could post to my Solr
> > core using the post.jar tool. Each document contains two fields, namely
> > "startDate" and "endDate", both of which are of type "date".
> >
> > Conceptually, I would like to have a third field "timeSpan" that is
> > automatically generated from the return value of function query
> > "ms(endDate, startDate)", and do range facet on it, i.e. compute the
> > distribution of "timeSpan", among either all of or a filtered subset of
> the
> > documents.
> >
> > I have tried to find ways of both directly faceting the function return
> > values and automatically generate the "timeSpan" field during indexing,
> but
> > without luck yet.
> >
> > Suggestions are greatly appreciated!
> >
> > Best,
> > Yubing
>

Re: Faceting return value of a function query?

2014-11-03 Thread Tom

I see. Thank you! :-)

Sent from my Android phone
On Nov 3, 2014 9:35 PM, "Erick Erickson"  wrote:

> Yep. It's almost always easier and faster if you can pre-compute as
> much as possible during indexing time. It'll take longer to   index of
> course, but the ratio of writing to the index to searching is usually
> hugely in favor of doing the work during indexing.
>
> Best,
> Erick
>
> On Mon, Nov 3, 2014 at 8:52 PM, Yubing (Tom) Dong 董玉冰
>  wrote:
> > Hi Erik,
> >
> > Thanks for the reply! Do you mean parse and modify the documents before
> > sending them to Solr?
> >
> > Cheers,
> > Yubing
> >
> > On Mon, Nov 3, 2014 at 8:48 PM, Erick Erickson 
> > wrote:
> >
> >> Wouldn't it be easiest to compute the span at index time? Then it's
> >> very straight-forward.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Nov 3, 2014 at 8:18 PM, Yubing (Tom) Dong 董玉冰
> >>  wrote:
> >> > Hi,
> >> >
> >> > I'm new to Solr, and I'm having a problem with faceting. I would
> really
> >> > appreciate it if you could help :)
> >> >
> >> > I have a set of documents in JSON format, which I could post to my
> Solr
> >> > core using the post.jar tool. Each document contains two fields,
> namely
> >> > "startDate" and "endDate", both of which are of type "date".
> >> >
> >> > Conceptually, I would like to have a third field "timeSpan" that is
> >> > automatically generated from the return value of function query
> >> > "ms(endDate, startDate)", and do range facet on it, i.e. compute the
> >> > distribution of "timeSpan", among either all of or a filtered subset
> of
> >> the
> >> > documents.
> >> >
> >> > I have tried to find ways of both directly faceting the function
> return
> >> > values and automatically generate the "timeSpan" field during
> indexing,
> >> but
> >> > without luck yet.
> >> >
> >> > Suggestions are greatly appreciated!
> >> >
> >> > Best,
> >> > Yubing
> >>
>

Re: Faceting return value of a function query?

2014-11-05 Thread Tom

Turns out that update processors perfectly suit me needs. I ended up using
the StatelessScriptUpdateProcessor with a simple js script :-)

On Mon Nov 03 2014 at 下午10:40:52 Yubing (Tom) Dong 董玉冰 <
tom.tung@gmail.com> wrote:

> I see. Thank you! :-)
>
> Sent from my Android phone
> On Nov 3, 2014 9:35 PM, "Erick Erickson"  wrote:
>
>> Yep. It's almost always easier and faster if you can pre-compute as
>> much as possible during indexing time. It'll take longer to   index of
>> course, but the ratio of writing to the index to searching is usually
>> hugely in favor of doing the work during indexing.
>>
>> Best,
>> Erick
>>
>> On Mon, Nov 3, 2014 at 8:52 PM, Yubing (Tom) Dong 董玉冰
>>  wrote:
>> > Hi Erik,
>> >
>> > Thanks for the reply! Do you mean parse and modify the documents before
>> > sending them to Solr?
>> >
>> > Cheers,
>> > Yubing
>> >
>> > On Mon, Nov 3, 2014 at 8:48 PM, Erick Erickson > >
>> > wrote:
>> >
>> >> Wouldn't it be easiest to compute the span at index time? Then it's
>> >> very straight-forward.
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Mon, Nov 3, 2014 at 8:18 PM, Yubing (Tom) Dong 董玉冰
>> >>  wrote:
>> >> > Hi,
>> >> >
>> >> > I'm new to Solr, and I'm having a problem with faceting. I would
>> really
>> >> > appreciate it if you could help :)
>> >> >
>> >> > I have a set of documents in JSON format, which I could post to my
>> Solr
>> >> > core using the post.jar tool. Each document contains two fields,
>> namely
>> >> > "startDate" and "endDate", both of which are of type "date".
>> >> >
>> >> > Conceptually, I would like to have a third field "timeSpan" that is
>> >> > automatically generated from the return value of function query
>> >> > "ms(endDate, startDate)", and do range facet on it, i.e. compute the
>> >> > distribution of "timeSpan", among either all of or a filtered subset
>> of
>> >> the
>> >> > documents.
>> >> >
>> >> > I have tried to find ways of both directly faceting the function
>> return
>> >> > values and automatically generate the "timeSpan" field during
>> indexing,
>> >> but
>> >> > without luck yet.
>> >> >
>> >> > Suggestions are greatly appreciated!
>> >> >
>> >> > Best,
>> >> > Yubing
>> >>
>>
>

possible spellcheck bug in 3.5 causing erroneous suggestions

2012-03-22 Thread tom


hi folks,

i think i found a bug in the spellchecker but am not quite sure:
this is the query i send to solr:

http://lh:8983/solr/CompleteIndex/select?
&rows=0
&echoParams=all
&spellcheck=true
&spellcheck.onlyMorePopular=true
&spellcheck.extendedResults=no
&q=a+bb+ccc++

and this is the result:




0
4

all
true
all
no
a bb ccc 
0
true






1
2
4

abb



1
5
8

ccc



1
5
8

ccc



1
10
14

dvd






now, i know  this is just a technical query and i have done it for a 
test regarding suggestions and i discovered the oddity just by chance 
and was not regarding the test i did:
my question is regarding, how the suggestions 1 and 2 come 
about. from what i understand from the wiki, that the entries in 
spellcheck/suggestions are only (misspelled) substrings from the user query.


the setup/context is thus:
- the words a ccc exists 11 times in the index but 1 and 2 dont

http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 

0name="QTime">1name="ccc">11
-  analyzer for the spellchecker yields the terms as entered, i.e. 
a|bb|ccc|

-  the config is thus



textSpell


default
spell
./spellchecker




does anyone have a clue what's going on?

Re: possible spellcheck bug in 3.5 causing erroneous suggestions

2012-03-22 Thread tom


same

On 22.03.2012 10:00, Markus Jelsma wrote:

Can you try spellcheck.q ?


On Thu, 22 Mar 2012 09:57:19 +0100, tom  wrote:

hi folks,

i think i found a bug in the spellchecker but am not quite sure:
this is the query i send to solr:

http://lh:8983/solr/CompleteIndex/select?
&rows=0
&echoParams=all
&spellcheck=true
&spellcheck.onlyMorePopular=true
&spellcheck.extendedResults=no
&q=a+bb+ccc++

and this is the result:




0
4

all
true
all
no
a bb ccc 
0
true






1
2
4

abb



1
5
8

ccc



1
5
8

ccc



1
10
14

dvd






now, i know  this is just a technical query and i have done it for a
test regarding suggestions and i discovered the oddity just by chance
and was not regarding the test i did:
my question is regarding, how the suggestions 1 and 2 come
about. from what i understand from the wiki, that the entries in
spellcheck/suggestions are only (misspelled) substrings from the user
query.

the setup/context is thus:
- the words a ccc exists 11 times in the index but 1 and 2 dont


http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 




0111
-  analyzer for the spellchecker yields the terms as entered, i.e.
a|bb|ccc|
-  the config is thus



textSpell


default
spell
./spellchecker




does anyone have a clue what's going on?

Re: possible spellcheck bug in 3.5 causing erroneous suggestions

2012-03-27 Thread tom


so any one has a clue what's (might be) going wrong ?

or do i have to debug and myself and post a jira issue?

PS: unfortunately i cant give anyone the index for testing due to NDA.

cheers

On 22.03.2012 10:17, tom wrote:

same

On 22.03.2012 10:00, Markus Jelsma wrote:

Can you try spellcheck.q ?


On Thu, 22 Mar 2012 09:57:19 +0100, tom  wrote:

hi folks,

i think i found a bug in the spellchecker but am not quite sure:
this is the query i send to solr:

http://lh:8983/solr/CompleteIndex/select?
&rows=0
&echoParams=all
&spellcheck=true
&spellcheck.onlyMorePopular=true
&spellcheck.extendedResults=no
&q=a+bb+ccc++

and this is the result:




0
4

all
true
all
no
a bb ccc 
0
true






1
2
4

abb



1
5
8

ccc



1
5
8

ccc



1
10
14

dvd






now, i know  this is just a technical query and i have done it for a
test regarding suggestions and i discovered the oddity just by chance
and was not regarding the test i did:
my question is regarding, how the suggestions 1 and 2 come
about. from what i understand from the wiki, that the entries in
spellcheck/suggestions are only (misspelled) substrings from the user
query.

the setup/context is thus:
- the words a ccc exists 11 times in the index but 1 and 2 dont


http://lh:8983/solr/CompleteIndex/terms?terms=on&terms.fl=spell&terms.prefix=ccc&terms.mincount=0 




0111
-  analyzer for the spellchecker yields the terms as entered, i.e.
a|bb|ccc|
-  the config is thus



textSpell


default
spell
./spellchecker




does anyone have a clue what's going on?

solrj and replication

2012-06-20 Thread tom


hi,

i was just wondering if i need to do smth special if i want to have an 
embedded slave to get replication working ?


my setup is like so:
- in my clustered application that uses embedded solr(j) (for 
performance). the cores are configured as slaves that should connect to 
a master which runs in a jetty.

- the embedded codes dont expose any of the solr servlets

note: that the slave config, if started in jetty, does proper 
replication, while when embedded it doesnt.


using solr 3.5

thx

tom

Re: solrj and replication

2012-06-21 Thread tom

ok tested it myself and a slave runnning embedded works, just not within 
my application -- yet...


On 20.06.2012 18:14, tom wrote:

hi,

i was just wondering if i need to do smth special if i want to have an 
embedded slave to get replication working ?


my setup is like so:
- in my clustered application that uses embedded solr(j) (for 
performance). the cores are configured as slaves that should connect 
to a master which runs in a jetty.

- the embedded codes dont expose any of the solr servlets

note: that the slave config, if started in jetty, does proper 
replication, while when embedded it doesnt.


using solr 3.5

thx

tom

suggester/autocomplete locks file preventing replication

2012-06-21 Thread tom


hi,

i'm using the suggester with a file like so:

  

  suggest
  name="classname">org.apache.solr.spelling.suggest.Suggester
  name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup

  
  
  
  content
  0.05
  true
  100
  autocomplete.dictionary

  

when trying to replicate i get the following error message on the slave 
side:


 2012-06-21 14:34:50,781 ERROR 
[pool-3-thread-1  ] 
handler.ReplicationHandler- SnapPull failed
org.apache.solr.common.SolrException: Unable to rename:   
autocomplete.dictionary.20120620120611
at 
org.apache.solr.handler.SnapPuller.copyTmpConfFiles2Conf(SnapPuller.java:642)
at 
org.apache.solr.handler.SnapPuller.downloadConfFiles(SnapPuller.java:526)
at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:299)
at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268)

at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)

at java.lang.Thread.run(Thread.java:619)

so i dug around it and found out that the solr's java process holds a 
lock on the autocomplete.dictionary file. any reason why this is so?


thx,

running:
solr 3.5
win7

Re: suggester/autocomplete locks file preventing replication

2012-06-21 Thread tom


BTW: a core unload doesnt release the lock either ;(


On 21.06.2012 14:39, tom wrote:

hi,

i'm using the suggester with a file like so:

  

  suggest
  name="classname">org.apache.solr.spelling.suggest.Suggester
  name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup

  
  
  
  content
  0.05
  true
  100
  autocomplete.dictionary

  

when trying to replicate i get the following error message on the 
slave side:


 2012-06-21 14:34:50,781 ERROR 
[pool-3-thread-1  ] 
handler.ReplicationHandler- SnapPull failed
org.apache.solr.common.SolrException: Unable to rename:   
autocomplete.dictionary.20120620120611
at 
org.apache.solr.handler.SnapPuller.copyTmpConfFiles2Conf(SnapPuller.java:642)
at 
org.apache.solr.handler.SnapPuller.downloadConfFiles(SnapPuller.java:526)
at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:299)
at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268)

at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)

at java.lang.Thread.run(Thread.java:619)

so i dug around it and found out that the solr's java process holds a 
lock on the autocomplete.dictionary file. any reason why this is so?


thx,

running:
solr 3.5
win7

Re: suggester/autocomplete locks file preventing replication

2012-06-21 Thread tom


pocking into the code i think the FileDictionary class is the culprit:
It takes an InputStream as a ctor argument but never releases the 
stream. what puzzles me is that the class seems to allow a one-time 
iteration and then the stream is useless, unless i'm missing smth. here.


is there a good reason for this or rather a bug?
should i move the topic to the dev list?


On 21.06.2012 14:49, tom wrote:

BTW: a core unload doesnt release the lock either ;(


On 21.06.2012 14:39, tom wrote:

hi,

i'm using the suggester with a file like so:

  

  suggest
  name="classname">org.apache.solr.spelling.suggest.Suggester
  name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup

  
  
  
  content
  0.05
  true
  100
  autocomplete.dictionary

  

when trying to replicate i get the following error message on the 
slave side:


 2012-06-21 14:34:50,781 ERROR 
[pool-3-thread-1  ] 
handler.ReplicationHandler- SnapPull failed
org.apache.solr.common.SolrException: Unable to rename:   
autocomplete.dictionary.20120620120611
at 
org.apache.solr.handler.SnapPuller.copyTmpConfFiles2Conf(SnapPuller.java:642)
at 
org.apache.solr.handler.SnapPuller.downloadConfFiles(SnapPuller.java:526)
at 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:299)
at 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268)

at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:181)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:205)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:885)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:907)

at java.lang.Thread.run(Thread.java:619)

so i dug around it and found out that the solr's java process holds a 
lock on the autocomplete.dictionary file. any reason why this is so?


thx,

running:
solr 3.5
win7

Fwd: suggester/autocomplete locks file preventing replication

2012-06-22 Thread tom


FYI
matter has been dealt with on the dev list


 Original Message 
Subject:Re: Re: suggester/autocomplete locks file preventing replication
Date:   Fri, 22 Jun 2012 12:16:35 +0200
From:   Simon Willnauer 
Reply-To:   d...@lucene.apache.org, simon.willna...@gmail.com
To: d...@lucene.apache.org



here is the issue https://issues.apache.org/jira/browse/SOLR-3570

On Fri, Jun 22, 2012 at 11:55 AM, Simon Willnauer 
mailto:simon.willna...@googlemail.com>> 
wrote:




   On Fri, Jun 22, 2012 at 11:47 AM, Simon Willnauer
   mailto:simon.willna...@googlemail.com>> wrote:



   On Fri, Jun 22, 2012 at 10:37 AM, tom mailto:dev.tom.men...@gmx.net>> wrote:

   cross posting this issue to the dev list in the hope to get
   a response here...


   I think you are right. Closing the Stream / Reader is the
   responsibility of the caller not the FileDictionary IMO but solr
   doesn't close it so that might cause your problems. Are you
   running on windows by any chance?
   I will create an issue and fix it.


   hmm I just looked at it and I see a IOUtils.close call in FileDictionary

   
https://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_6/lucene/contrib/spellchecker/src/java/org/apache/lucene/search/suggest/FileDictionary.java

   are you using solr 3.6?


   simon



    Original Message 
   Subject: Re: suggester/autocomplete locks file preventing
   replication
   Date:Thu, 21 Jun 2012 17:11:40 +0200
       From:tom 
   <mailto:dev.tom.men...@gmx.net>
   Reply-To:solr-user@lucene.apache.org
   <mailto:solr-user@lucene.apache.org>
   To:  solr-user@lucene.apache.org
   <mailto:solr-user@lucene.apache.org>



   pocking into the code i think the FileDictionary class is the 
culprit:
   It takes an InputStream as a ctor argument but never releases the
   stream. what puzzles me is that the class seems to allow a one-time
   iteration and then the stream is useless, unless i'm missing smth. 
here.

   is there a good reason for this or rather a bug?
   should i move the topic to the dev list?


   On21.06.2012 14  :49, tom wrote:
   > BTW: a core unload doesnt release the lock either ;(
   >
   >
   > On21.06.2012 14  :39, tom wrote:
   >> hi,
   >>
   >> i'm using the suggester with a file like so:
   >>
   >>   
   >> 
   >>   suggest
   >>   > name="classname">org.apache.solr.spelling.suggest.Suggester
   >>   > 
name="lookupImpl">org.apache.solr.spelling.suggest.fst.FSTLookup
   >>   
   >>   
   >>   
   >>   content
   >>   0.05
   >>   true
   >>   100
   >>   autocomplete.dictionary
   >> 
   >>   
   >>
   >> when trying to replicate i get the following error message on the
   >> slave side:
   >>
   >>  2012-06-21 14:34:50,781 ERROR
   >> [pool-3-thread-1  ]
   >> handler.ReplicationHandler- SnapPull failed
   >> org.apache.solr.common.SolrException: Unable to rename: 
   >> autocomplete.dictionary.20120620120611
   >> at
   >> 
org.apache.solr.handler.SnapPuller.copyTmpConfFiles2Conf(SnapPuller.java:642)
   >> at
   >> 
org.apache.solr.handler.SnapPuller.downloadConfFiles(SnapPuller.java:526)
   >> at
   >> 
org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:299)
   >> at
   >> 
org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:268)
   >> at 
org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
   >> at
   >> 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
   >> at
   >> 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
   >> at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
   >> at
   >> 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
   >> at
   >> 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolE

Re: Restrict access to localhost

2010-12-03 Thread Tom


If you are using another app to create the index, I think you can remove the
update servlet mapping in the web.xml.
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Restrict-access-to-localhost-tp2004475p2014129.html
Sent from the Solr - User mailing list archive at Nabble.com.

copyField for big indexes

2011-08-22 Thread Tom

Is it a good rule of thumb, that when dealing with large indexes copyField
should not be used.  It seems to duplicate the indexing of data.

You don't need copyField to be able to search on multiple fields.  Example,
if I have two fields: title and post and I want to search on both, I could
just query 
title: OR post:

So it seems to me if you have lot's of data and a large indexes, copyField
should be avoided.

Any thoughts?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/copyField-for-big-indexes-tp3275712p3275712.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: copyField for big indexes

2011-08-22 Thread Tom

Thanks Erick

--
View this message in context: 
http://lucene.472066.n3.nabble.com/copyField-for-big-indexes-tp3275712p3275816.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: copyField for big indexes

2011-08-22 Thread Tom

Bill,

  I was using it as a simple default search field.  I realise now that's not
a good reason to use copyField.  As I see it now, it should be used if you
want to search in a way that is different: use different analyzers, etc; not
for just searching on multiple fields in a single query.

Thanks

--
View this message in context: 
http://lucene.472066.n3.nabble.com/copyField-for-big-indexes-tp3275712p3276994.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr indexing process: keep a persistent Mysql connection throu all the indexing process

2011-08-23 Thread Tom

10K documents.  Why not just batch them?  

You could read in 10K from your database, load em into an array of
SolrDocuments. and them post them all at once to the Solr server?  Or do em
in 1K increments if they are really big.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-indexing-process-keep-a-persistent-Mysql-connection-throu-all-the-indexing-process-tp3278608p3279708.html
Sent from the Solr - User mailing list archive at Nabble.com.

Trimming the list of docs returned.

2006-10-30 Thread Tom


Hi -

I'd like to be able to limit the number of documents returned from 
any particular group of documents, much as Google only shows a max of 
two results from any one website.


The docs are all marked as to which group they belong to. There will 
probably be multiple groups returned from any search. Documents 
belong to only one group


I could just examine each returned document, and discard documents 
from groups I have seen before, but that seems slow (but I'm not sure 
there is a better alternative).


The number of groups is fairly high percentage of the number of 
documents (maybe 5% of all documents), so building something like a 
filter for each group doesn't seem feasible.


CustomHitCollector of some sort could work, but there is the comment 
in the javadoc about "should not call  Searcher.doc(int) 
or  IndexReader.document(int) on every  document number encountered." 
which would seem to be necessary to get the group id.


Does Solr add anything to Lucene in this regard?

Thanks,

Tom

Re: Trimming the list of docs returned.

2006-11-08 Thread Tom


Hi -

On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Yes, a custom hit collector would work.  Searcher.doc() would be
> deadly... but since each doc has at most one category, the FieldCache
> could be used (it quickly maps id to field value and was historically
> used for sorting).

Not to be dense, but how do I use a custom HitCollector with Solr?

I've checked the wiki, and searched the mailing list, and don't see 
anything. Is there a way to configure this, or do I just build a 
custom version of Solr?


I have no problems doing this in Lucene, but I'm not quite sure where 
to configure/code this in Solr.


Thanks,

Tom


On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> Hi Tom, I moderated your email in... you need to subscribe to prevent
> your emails being blocked in the future.

Thanks. That's fixed, I hope. I was using the wrong address.

> http://incubator.apache.org/solr/mailing_lists.html
>
> On 10/30/06, Tom <[EMAIL PROTECTED]> wrote:
> > I'd like to be able to limit the number of documents returned from
> > any particular group of documents, much as Google only shows a max of
> > two results from any one website.
>
> You bring up an interesting problem that may be of general use.
> Solr doesn't currently do this, but it should be possible (with some
> work in the internals).
>
> > The docs are all marked as to which group they belong to. There will
> > probably be multiple groups returned from any search. Documents
> > belong to only one group
>
> Documents belonging to only one group does make things easier.
>
> > I could just examine each returned document, and discard documents
> > from groups I have seen before, but that seems slow (but I'm not sure
> > there is a better alternative).
> >
> > The number of groups is fairly high percentage of the number of
> > documents (maybe 5% of all documents), so building something like a
> > filter for each group doesn't seem feasible.
> >
> > CustomHitCollector of some sort could work, but there is the comment
> > in the javadoc about "should not call  Searcher.doc(int)
> > or  IndexReader.document(int) on every  document number encountered."
> > which would seem to be necessary to get the group id.
>
> Yes, a custom hit collector would work.  Searcher.doc() would be
> deadly... but since each doc has at most one category, the FieldCache
> could be used (it quickly maps id to field value and was historically
> used for sorting).
>
> It might be useful to see what Nutch does in this regard too.
>
> -Yonik
>

Re: Trimming the list of docs returned.

2006-11-15 Thread Tom

Hi -

Recap:

 > > I'd like to be able to limit the number of documents returned from
 > > any particular group of documents, much as Google only shows a max of
 > > two results from any one website.
 > >
 > > The docs are all marked as to which group they belong to. There will
 > > probably be multiple groups returned from any search. Documents
 > > belong to only one group

It looks like that for trimming, the places I want to modify are in 
ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to 
return the top item in the group that matches, whether by score or 
sort, not just the first one that goes through the HitCollector.

But since I want to enable this per request basis, I need some way to 
get the parameters from the original request, and pass it down to my 
implementation of ScorePriorityQueue.

I'm trying to minimize the number of changes I'd have to make, so 
I've defined another flag (like SolrIndexHandler.GET_SCORES), and I 
check and set it in a modified version of StandardRequestHandler. 
This seems to work, and doesn't require me to change any method 
signatures. Suggestions for other implementations welcome!

Index: src/java/org/apache/solr/request/StandardRequestHandler.java
===
--- 
src/java/org/apache/solr/request/StandardRequestHandler.java 
(revision 470495)
+++ 
src/java/org/apache/solr/request/StandardRequestHandler.java 
(working copy)

@@ -97,6 +97,10 @@
   // find fieldnames to return (fieldlist)
   String fl = p.get(SolrParams.FL);
   int flags = 0;
+  String trim = p.get("trim");
+  if ((trim == null) || !trim.equals("0"))
+   flags |= SolrIndexSearcher.TRIM_RESULTS;
+
   if (fl != null) {
 flags |= U.setReturnFields(fl, rsp);
   }

But, unsurprisingly, trimming vs. not trimming is being ignored with 
regard to caching. How would I indicate that a query with trim=0 is 
not the same as trim=1? I do still want to cache. But obviously, my 
implementation won't work at the moment, since all queries will cache 
the value generated using the results generated by the value of trim 
on the initial query.

Any suggestions for where to go poking around to fix this vs. caching?

Thanks,

Tom

At 11:10 AM 11/8/2006, you wrote:

On 11/8/06, Tom <[EMAIL PROTECTED]> wrote:

On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
 > Yes, a custom hit collector would work.  Searcher.doc() would be
 > deadly... but since each doc has at most one category, the FieldCache
 > could be used (it quickly maps id to field value and was historically
 > used for sorting).

Not to be dense, but how do I use a custom HitCollector with Solr?

You would need a custom request handler, then just use the
SolrIndexSearcher you get with a request... it exposes all of the
Lucene IndexSearcher methods.

-Yonik

On 10/30/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
 > Hi Tom, I moderated your email in... you need to subscribe to prevent
 > your emails being blocked in the future.

Thanks. That's fixed, I hope. I was using the wrong address.

 > http://incubator.apache.org/solr/mailing_lists.html
 >
 > On 10/30/06, Tom <[EMAIL PROTECTED]> wrote:
 > > I'd like to be able to limit the number of documents returned from
 > > any particular group of documents, much as Google only shows a max of
 > > two results from any one website.
 >
 > You bring up an interesting problem that may be of general use.
 > Solr doesn't currently do this, but it should be possible (with some
 > work in the internals).
 >
 > > The docs are all marked as to which group they belong to. There will
 > > probably be multiple groups returned from any search. Documents
 > > belong to only one group
 >
 > Documents belonging to only one group does make things easier.
 >
 > > I could just examine each returned document, and discard documents
 > > from groups I have seen before, but that seems slow (but I'm not sure
 > > there is a better alternative).
 > >
 > > The number of groups is fairly high percentage of the number of
 > > documents (maybe 5% of all documents), so building something like a
 > > filter for each group doesn't seem feasible.
 > >
 > > CustomHitCollector of some sort could work, but there is the comment
 > > in the javadoc about "should not call  Searcher.doc(int)
 > > or  IndexReader.document(int) on every  document number encountered."
 > > which would seem to be necessary to get the group id.
 >
 > Yes, a custom hit collector would work.  Searcher.doc() would be
 > deadly... but since each doc has at most one category, the FieldCache
 > could be used (it quickly maps id to field value and was historically
 > used for sorting).
 >
 > It might be useful to see what Nutch does in this regard too.
 >
 > -Yonik

Re: Trimming the list of docs returned.

2006-11-15 Thread Tom


At 01:35 PM 11/15/2006, you wrote:

On 11/15/06, Tom <[EMAIL PROTECTED]> wrote:

It looks like that for trimming, the places I want to modify are in
ScorePriorityQueue and FieldSortedHitQueue. When trimming, I want to
return the top item in the group that matches, whether by score or
sort, not just the first one that goes through the HitCollector.


Wouldn't you actually need a priority queue per group?


I'm still playing with implementations, but I think you just need a 
max score for each group.


You can't just do a PrioirtyQueue (of either max, or PriorityQueues) 
since I don't think the Lucene PriorityQueue handles entries whose 
value changes after insertion.




But, unsurprisingly, trimming vs. not trimming is being ignored with
regard to caching. How would I indicate that a query with trim=0 is
not the same as trim=1? I do still want to cache.


One hack: implement a simple query that delegates to another query and
encapsulates the trim value... that way hashCode/equals won't match
unless the trim does.


Not sure what you mean by "delegates to another query". Could you 
clarify or give me a pointer?


I was thinking in terms of just adding some guaranteed true clause to 
the end when trimming, is that similar to what you were talking about?


Thanks,

Tom




-Yonik


But obviously, my
implementation won't work at the moment, since all queries will cache
the value generated using the results generated by the value of trim
on the initial query.

Any suggestions for where to go poking around to fix this vs. caching?

Thanks,

Tom

MatchAllDocsQuery in solr?

2006-11-21 Thread Tom


Is there a way to do a match all docs query in solr?

I mean is there something I can put in a solr URL that will get 
recognized by the SolrQueryParser as meaning a "match all"?


Why? Because I'm porting unit tests from our internal Lucene 
container to Solr, and the tests usually run such a query,  upon 
completion, to make sure the index is in the expected state (nothing 
missing, nothing extra).


Yes, I can create a query that will match all my docs, there are a 
few fields that have a relatively small range of values. I was just 
looking for a standard way to do it first.


Thanks,

Tom

Re: MatchAllDocsQuery in solr?

2006-11-21 Thread Tom


Thanks for the quick response.

I thought about a range query on the ID, but was wondering what the 
implications were for a large range query. (e.g. Number of docs > 
maxBooleanClauses). But this approach will work for me, as my test 
indicies are generally small.


For a large data set, would it be faster to do that on a field with 
fewer values (but the same number of documents)


e.g. type:[* TO *] where the type field has a small number of values.

Or does that not matter?

Thanks,

Tom

At 02:49 PM 11/21/2006, you wrote:


: > I mean is there something I can put in a solr URL that will get
: > recognized by the SolrQueryParser as meaning a "match all"?
:
: No, but there should be.

if you use the uniqueKey feature, then you can do id:[* TO *] ... that
acctually works on any field to find all docs that have "a" value, but on
a uniqueKey field it by definition returns all docs since all docs have a
uniequeKey.




-Hoss

Re: MatchAllDocsQuery in solr?

2006-11-27 Thread Tom


At 03:18 PM 11/21/2006, Hoss wrote:


It would would be really cool is if you could say something like...

field:[low TO high]^0  other clauses XXX^0

...and SolrIndexSearcher recognised that teh score contributions from the
range query and the XXX TermQuery weren't going to contribute to the
score, so it pulled the DocSets for them explicitly, and replaced their
spots in the orriginal query with ConstantScoreQueries containing their
DocSets ... that way they could be cached independently and reused.


Just checking my understanding here.

Right now, if I have ranges that I don't want to affect the score, 
but I would like to have cached, I should use Filter Queries, right? 
(SolrParams.FQ)


Thanks,

Tom

Cache stats

2006-11-29 Thread Tom


Hi -

I'm starting to try to tune my installation a bit, and I'm looking 
for cache statistics. Is there a way to peek into a running 
installation, and see what my cache stats are?


I'm looking for the usual cache hits/cache misses sort of things.

Also, on a related note, I was looking for solr info via mbeans. I 
fired up jconsole, and I can see all sort of tomcat mbeans, but 
nothing for solr. Is there something extra I have to do to turn this 
on? I see things implementing SolrInfoMBean, so I'm assuming there is 
something there.


(Off topic, but suggestions for anything better than JConsole also welcome).

Thanks,

Tom

boosts?

2006-12-27 Thread Tom


Hi -

I'm having a problem getting boosts to work the way I think they are 
supposed to.


What I want is for documents to be returned in doc boost order, when 
all the queries are constant scoring range queries. (e.g. date:[2006 TO 2007])


I believe (but am not certain) that this is supposed to be what 
happens. If that's not the case, you can probably skip the rest :-)


As an example, I grabbed solr-1.1, and ran it (java -jar start.jar).

Then I modified the hd.xml example doc, to add a boost on the first 
document (SP2514N)




Then I loaded monitor.xml, and hd.xml

./post.sh monitor.xml
./post.sh hd.xml

I then went to the solr admin interface and queried on

id:[* TO *]

Which I believe gets mapped to a ConstantScoreRangeQuery.

So, given

http://fred:8983/solr/select/?q=id%3A%5B*+TO+*%5D&version=2.2&start=0&rows=10&indent=on&debugQuery=1

I get the result below. Note that all the results list "boost=1.0"

I would expect to see a boost of 100 on the SP2514N, in the 
explanation. Should I get that? I would also expect it to be at the 
head of the list, but I think I'm seeing the docs in insertion order. 
(if I insert xd.xml before monitor.xml, I get them in insertion order 
in that case as well.)


Please let me know if my assumptions or my methods aren't correct.

Thanks,

Tom






 0
 4
 
  10
  0


  on
  id:[* TO *]
  1
  2.2
 


 


  electronicsmonitor
  30" TFT active matrix LCD, 2560 x 1600, 
.25mm dot pitch, 700:1 contrast

  3007WFP
  true
  USB cable
  Dell, Inc.


  Dell Widescreen UltraSharp 3007WFP
  6
  2199.0
  3007WFP
  401.6
 


 
  electronicshard drive
  7200RPM, 8MB cache, IDE Ultra 
ATA-133NoiseGuard, SilentSeek technology, Fluid Dynamic 
Bearing (FDB) motor

  SP2514N
  true
  Samsung Electronics Co. Ltd.


  Samsung SpinPoint P120 SP2514N - hard drive - 250 
GB - ATA-133

  6
  92.0
  SP2514N
 
 
  electronicshard drive


  SATA 3.0Gb/s, NCQ8.5ms 
seek16MB cache

  6H500F0
  true
  Maxtor Corp.
  Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300


  6
  350.0
  6H500F0
 


 id:[* TO *]
 id:[* TO *]


 id:[* TO *]
 id:[* TO *]
 
  
1.0 = (MATCH) ConstantScoreQuery(id:[-}), product of:
  1.0 = boost
  1.0 = queryNorm

  
1.0 = (MATCH) ConstantScoreQuery(id:[-}), product of:
  1.0 = boost
  1.0 = queryNorm

  


1.0 = (MATCH) ConstantScoreQuery(id:[-}), product of:
  1.0 = boost
  1.0 = queryNorm

Re: boosts?

2006-12-28 Thread Tom


Hi Yonik,

Thanks for the quick response.

At 07:45 AM 12/28/2006, you wrote:

On 12/27/06, Tom <[EMAIL PROTECTED]> wrote:

I'm having a problem getting boosts to work the way I think they are
supposed to.


Do you have a specific relevance problem you are trying to solve, or
just testing things out?


Specific problem.

Frequently our users will start by specifying a facet, such a date 
range, geo location, etc. At this point I don't have any positive 
query terms, just constant score range queries that are used to 
eliminate things the user is not interested in.  So at this point, 
there's nothing to be relevant to, so I need to pick some ordering. 
Since I have information about which results tend to be more 
interesting in the general case, I've set boosts on the documents. 
I'd like to order by that, until the user gives me more information.


For an example, think of amazon ordering by "best selling", when the 
user asks for books published since Dec. 1st. You don't yet know what 
is relevant to this user's query, since all you have is "since Dec 
1st", but you want to give an order more reasonable than "doc 
number", or "date published".



What I want is for documents to be returned in doc boost order, when
all the queries are constant scoring range queries. (e.g. 
date:[2006 TO 2007])


They are *constant scoring* range queries :-)  Index-time boosts
currently don't factor in.


Gotcha. I think I misinterpreted an earlier post (which did say 
"query boost"). I was thinking it would include index time boost, too.




I'd recommend only using index-time boosting when you can't get the
relevance you want with query boosting and scoring.


I'm not sure how I'd do it that way.

What I want (what I _think_ I want :-) is a way to specify a default 
order for results, for the cases where the user has only provided 
exclusion information. In this case, I'm doing a match all docs, with 
filter queries.


Tom

Re: boosts?

2006-12-28 Thread Tom


At 12:03 PM 12/28/2006, you wrote:

On 12/28/06, Tom <[EMAIL PROTECTED]> wrote:
Could you index  your documents in the desired order?  This is the
default sort order.


I don't think I can control document order, as documents may get 
edited after creation.



If not, you can add a field that is present in all documents, and add
this as part of the query.  Then you can fiddle with the index-time
field boost to alter the results (without skewing queries that have a
meaningful relevancy score as using document boosts would do).


That seems to work. Thanks!

I'll probably do it that way, but... :-)

I was looking at how I would write a modified version of 
MatchAllDocsQuery that would simply return the documents boost as the 
score. But I haven't really figured out Lucene scoring.


Could someone explain how one would do something like this? I'm just 
trying to understand how one might do custom scoring in Lucene, so 
I'm more looking for concepts than code.


Thanks!

Tom

Re: boosts?

2006-12-30 Thread Tom


At 06:03 PM 12/28/2006, you wrote:

maybe i'm missing something, but it sounds like what you want is a simple
sort on a numeric field -- whatever value you are tyring to use as the
index time boost, you can just set as a field value instead and then sort
on it right?


Yes.

I had been just been thinking about it in terms of how to use the 
info I already had in the index. But making another field works, too, 
and is probably simpler.




: I was looking at how I would write a modified version of
: MatchAllDocsQuery that would simply return the documents boost as the
: score. But I haven't really figured out Lucene scoring.

document boosts aren't maintained in the index ... they are multiplied by
the various field boosts and lengthNorms and stored on a per field basis.


Thanks! I had seen comments that the doc boost wasn't stored, but 
didn't know how it worked.


Tom

SolrCloud, DIH, and XPathEntityProcessor

2016-01-12 Thread Tom Evans

Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having
some problems with a DIH config that attempts to load an XML file and
iterate through the nodes in that file, it trys to load the file from
disk instead of from zookeeper.



The file exists in zookeeper, adjacent to the data_import.conf in the
lookups_config conf folder.

The exception:

2016-01-12 12:59:47.852 ERROR (Thread-44) [c:lookups s:shard1
r:core_node6 x:lookups_shard1_replica2] o.a.s.h.d.DataImporter Full
Import failed:java.lang.RuntimeException: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:271)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:417)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:481)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:462)
Caused by: java.lang.RuntimeException:
org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:417)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:330)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:233)
... 3 more
Caused by: org.apache.solr.handler.dataimport.DataImportHandlerException:
java.lang.RuntimeException: java.io.FileNotFoundException: Could not
find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:62)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:287)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEntityProcessor.java:225)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityProcessor.java:202)
at 
org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:244)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:476)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:415)
... 5 more
Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
Could not find file: lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:127)
at 
org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:86)
at 
org.apache.solr.handler.dataimport.FileDataSource.getData(FileDataSource.java:48)
at 
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntityProcessor.java:284)
... 10 more
Caused by: java.io.FileNotFoundException: Could not find file:
lookup_conf.xml (resolved to:
/mnt/solr/server/lookups_shard1_replica2/conf/lookup_conf.xml
at 
org.apache.solr.handler.dataimport.FileDataSource.getFile(FileDataSource.java:123)
... 13 more


Any hints gratefully accepted

Cheers

Tom

Re: SolrCloud, DIH, and XPathEntityProcessor

2016-01-12 Thread Tom Evans

On Tue, Jan 12, 2016 at 2:32 PM, Shawn Heisey  wrote:
> On 1/12/2016 6:05 AM, Tom Evans wrote:
>> Hi all, trying to move our Solr 4 setup to SolrCloud (5.4). Having
>> some problems with a DIH config that attempts to load an XML file and
>> iterate through the nodes in that file, it trys to load the file from
>> disk instead of from zookeeper.
>>
>> > dataSource="lookup_conf"
>> rootEntity="false"
>> name="lookups"
>> processor="XPathEntityProcessor"
>> url="lookup_conf.xml"
>> forEach="/lookups/lookup">
>>
>> The file exists in zookeeper, adjacent to the data_import.conf in the
>> lookups_config conf folder.
>
> SolrCloud puts all the *config* for Solr into zookeeper, and adds a new
> abstraction for indexes (the collection), but other parts of Solr like
> DIH are not really affected.  The entity processors in DIH cannot
> retrieve data from zookeeper.  They do not know how.

That makes no sense whatsoever. DIH loads the data_import.conf from ZK
just fine, or is that provided to DIH from another module that does
know about ZK?

Either way, it is entirely sub-optimal to have SolrCloud store "all"
its configuration in ZK, but still require manually storing and
updating files on specific nodes in order to influence DIH. If a
server is mistakenly not updated, or manually modified locally on
disk, that node would start indexing documents differently than other
replicas, which sounds dangerous and scary!

If there is not a ZkFileDataSource, it shouldn't be too tricky to add
one... I'll see how much I dislike having config files on the host...

Cheers

Tom

Re: SolrCloud, DIH, and XPathEntityProcessor

2016-01-12 Thread Tom Evans

On Tue, Jan 12, 2016 at 3:00 PM, Shawn Heisey  wrote:
> On 1/12/2016 7:45 AM, Tom Evans wrote:
>> That makes no sense whatsoever. DIH loads the data_import.conf from ZK
>> just fine, or is that provided to DIH from another module that does
>> know about ZK?
>
> This is accomplished indirectly through a resource loader in the
> SolrCore object that is responsible for config files.  Also, the
> dataimport handler is created by the main Solr code which then hands the
> configuration to the dataimport module.  DIH itself does not know about
> zookeeper.

ZkPropertiesWriter seems to know a little..

>
>> Either way, it is entirely sub-optimal to have SolrCloud store "all"
>> its configuration in ZK, but still require manually storing and
>> updating files on specific nodes in order to influence DIH. If a
>> server is mistakenly not updated, or manually modified locally on
>> disk, that node would start indexing documents differently than other
>> replicas, which sounds dangerous and scary!
>
> The entity processor you are using accesses files through a Java
> interface for mounted filesystems.  As already mentioned, it does not
> know about zookeeper.
>
>> If there is not a ZkFileDataSource, it shouldn't be too tricky to add
>> one... I'll see how much I dislike having config files on the host...
>
> Creating your own DIH class would be the only solution available right now.
>
> I don't know how useful this would be in practice.  Without special
> config in multiple places, Zookeeper limits the size of the files it
> contains to 1MB.  It is not designed to deal with a large amount of data
> at once.

This is not large amounts of data, it is a 5kb XML file containing
configuration of what tables to query for what fields and how to map
them in to the document.

>
> You could submit a feature request in Jira, but unless you supply a
> complete patch that survives the review process, I do not know how
> likely an implementation would be.

We've already started implementation, basing around FileDataSource and
using SolrZkClient, which we will deploy as an additional library
whilst that process is ongoing or doesn't survive it.

Cheers

Tom

Shard allocation across nodes

2016-02-01 Thread Tom Evans

Hi all

We're setting up a solr cloud cluster, and unfortunately some of our
VMs may be physically located on the same VM host. Is there a way of
ensuring that all copies of a shard are not located on the same
physical server?

If they do end up in that state, is there a way of rebalancing them?

Cheers

Tom

Re: Shard allocation across nodes

2016-02-02 Thread Tom Evans

Thank you both, those are exactly what I was looking for!

If I'm reading it right, if I specify a "-Dvmhost=foo" when starting
SolrCloud, and then specify a snitch rule like this when creating the
collection:

  sysprop.vmhost:*,replica:<2

then this would ensure that on each vmhost there is at most one
replica. I'm assuming that a shard leader and a replica are both
treated as replicas in this scenario.

Thanks

Tom

On Mon, Feb 1, 2016 at 8:34 PM, Erick Erickson  wrote:
> See the createNodeset and node parameters for the Collections API CREATE and
> ADDREPLICA commands, respectively. That's more a manual process, there's
> nothing OOB but Jeff's suggestion is sound.
>
> Best,
> Erick
>
>
>
> On Mon, Feb 1, 2016 at 11:00 AM, Jeff Wartes  wrote:
>>
>> You could write your own snitch: 
>> https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement
>>
>> Or, it would be more annoying, but you can always add/remove replicas 
>> manually and juggle things yourself after you create the initial collection.
>>
>>
>>
>>
>> On 2/1/16, 8:42 AM, "Tom Evans"  wrote:
>>
>>>Hi all
>>>
>>>We're setting up a solr cloud cluster, and unfortunately some of our
>>>VMs may be physically located on the same VM host. Is there a way of
>>>ensuring that all copies of a shard are not located on the same
>>>physical server?
>>>
>>>If they do end up in that state, is there a way of rebalancing them?
>>>
>>>Cheers
>>>
>>>Tom

Change in EXPLAIN info since Solr 5

2016-02-04 Thread Burgmans, Tom

Hi group, 

While exploring Solr 5.4.0, I noticed a subtle difference in the EXPLAIN debug 
information, compared to the version we currently use (4.10.1).

Solr 4.10.1:

2.0739748 = (MATCH) max plus 1.0 times others of:
  2.0739748 = (MATCH) weight(text:test in 30) [DefaultSimilarity], result of:
2.0739748 = score(doc=30,freq=3.0), product of:
  0.3556181 = queryWeight, product of:
3.3671236 = idf(docFreq=17, maxDocs=192)
0.105614804 = queryNorm
  5.832029 = fieldWeight in 30, product of:
1.7320508 = tf(freq=3.0), with freq of:
  3.0 = termFreq=3.0
3.3671236 = idf(docFreq=17, maxDocs=192)
1.0 = fieldNorm(doc=30)

Solr 5.4.0:

2.0739748 = max plus 1.0 times others of:
  2.0739748 = weight(text:test in 30) [ClassicSimilarity], result of:
2.0739748 = score(doc=30,freq=3.0), product of:
  0.3556181 = queryWeight, product of:
3.3671236 = idf(docFreq=17, maxDocs=192)
0.105614804 = queryNorm
  5.832029 = fieldWeight in 30, product of:
1.7320508 = tf(freq=3.0), with freq of:
  3.0 = termFreq=3.0
3.3671236 = idf(docFreq=17, maxDocs=192)
1.0 = fieldNorm(doc=30)

The difference is the removal of (MATCH) in some of the EXPLAIN lines. That is 
causing issues for us since we have developed an EXPLAIN parser that leans on 
the presence of (MATCH) in the EXPLAIN.
Does anyone have a suggestion how to insert back (MATCH) in the explain info 
(like which file should we patch)?

Thanks, Tom

fq in SolrCloud

2016-02-05 Thread Tom Evans

I have a small question about fq in cloud mode that I couldn't find an
explanation for in confluence. If I specify a query with an fq, where
is that cached, is it just on the nodes/replicas that process that
specific query, or will it exist on all replicas?

We have a sub type of queries that specify an expensive join condition
that we specify in the fq, so that subsequent requests with the same
fq won't have to do the same expensive query, and was wondering
whether we needed to ensure that the query goes to the same node when
we move to cloud.

Cheers

Tom

Re: Json faceting, aggregate numeric field by day?

2016-02-10 Thread Tom Evans

On Wed, Feb 10, 2016 at 10:21 AM, Markus Jelsma
 wrote:
> Hi - if we assume the following simple documents:
>
> 
>   2015-01-01T00:00:00Z
>   2
> 
> 
>   2015-01-01T00:00:00Z
>   4
> 
> 
>   2015-01-02T00:00:00Z
>   3
> 
> 
>   2015-01-02T00:00:00Z
>   7
> 
>
> Can i get a daily average for the field 'value' by day? e.g.
>
> 
>   3.0
>   5.0
> 
>
> Reading the documentation, i don't think i can, or i am missing it 
> completely. But i just want to be sure.

Yes, you can facet by day, and use the stats component to calculate
the mean average. This blog post explains it:

https://lucidworks.com/blog/2015/01/29/you-got-stats-in-my-facets/

Cheers

Tom

Re: Json faceting, aggregate numeric field by day?

2016-02-11 Thread Tom Evans

On Wed, Feb 10, 2016 at 12:13 PM, Markus Jelsma
 wrote:
> Hi Tom - thanks. But judging from the article and SOLR-6348 faceting stats 
> over ranges is not yet supported. More specifically, SOLR-6352 is what we 
> would need.
>
> [1]: https://issues.apache.org/jira/browse/SOLR-6348
> [2]: https://issues.apache.org/jira/browse/SOLR-6352
>
> Thanks anyway, at least we found the tickets :)
>

No problem - as I was reading this I was thinking "But wait, I *know*
we do this ourselves for average price vs month published". In fact, I
was forgetting that we index the ranges that we will want to facet
over as part of the document - so a document with a date_published of
"2010-03-29T00:00:00Z" also has a date_published.month of "201003"
(and a bunch of other ranges that we want to facet by). The frontend
then converts those fields in to the appropriate values for display.

This might be an acceptable solution for you guys too, depending on
how many ranges that you require, and how much larger it would make
your index.

Cheers

Tom

Solr and Nutch integration

2016-02-16 Thread Tom Running

I am having problem configuring Solr to read Nutch data or Integrate with
Nutch.
Does  anyone able to get SOLR 5.4.x to work with Nutch?

I went through lot of google's article any still not able to get SOLR 5.4.1
to searching Nutch contents.

Any howto or working configuration sample that you can share will be
greatly appreciate it.

Thanks,
Toom

Display entire string containing query string

2016-02-17 Thread Tom Running

Hello,

I am working on a project using Solr to search data from retrieved from
Nutch.

I have successfully integrated Nutch with Solr, and Solr is able to search
Nutch's data.

However I am having a bit of a problem. If I query Solr, it will bring back
the numfound and which document the query string was found in, but it will
not display the string that contains the query string.

Can anyone help on how to display the entire string that contains the query.


I appreciate your time and guidance. Thank you so much!

-T

Re: Display entire string containing query string

2016-02-18 Thread Tom Running

Hello
Thank you for your reply.
I am wondering if you can clarify a bit more for me. Is
field_where_string_may_be_present something that I have to specify? I am
searching HTML page.
For example if I search for the word "name" I am trying to display the
entire sentence containing  "name = T" or maybe "name: T". Ultimately by
searching for the string "name" I am trying to find the value of name.

Thanks for your time. I appreciate your help
-T
On Feb 18, 2016 1:18 AM, "Binoy Dalal"  wrote:

> Append &fl=
>
> On Thu, 18 Feb 2016, 11:35 Tom Running  wrote:
>
> > Hello,
> >
> > I am working on a project using Solr to search data from retrieved from
> > Nutch.
> >
> > I have successfully integrated Nutch with Solr, and Solr is able to
> search
> > Nutch's data.
> >
> > However I am having a bit of a problem. If I query Solr, it will bring
> back
> > the numfound and which document the query string was found in, but it
> will
> > not display the string that contains the query string.
> >
> > Can anyone help on how to display the entire string that contains the
> > query.
> >
> >
> > I appreciate your time and guidance. Thank you so much!
> >
> > -T
> >
> --
> Regards,
> Binoy Dalal
>

Re: docValues error

2016-02-29 Thread Tom Evans

On Mon, Feb 29, 2016 at 11:43 AM, David Santamauro
 wrote:
> You will have noticed below, the field definition does not contain
> multiValues=true

What version of the schema are you using? In pre 1.1 schemas,
multiValued="true" is the default if it is omitted.

Cheers

Tom

Separating cores from Solr home

2016-03-03 Thread Tom Evans

Hi all

I'm struggling to configure solr cloud to put the index files and
core.properties in the correct places in SolrCloud 5.5. Let me explain
what I am trying to achieve:

* solr is installed in /opt/solr
* the user who runs solr only has read only access to that tree
* the solr home files - custom libraries, log4j.properties, solr.in.sh
and solr.xml - live in /data/project/solr/releases/, which
is then the target of a symlink /data/project/solr/releases/current
* releasing a new version of the solr home (eg adding/changing
libraries, changing logging options) is done by checking out a fresh
copy of the solr home, switching the symlink and restarting solr
* the solr core.properties and any data live in /data/project/indexes,
so they are preserved when new solr home is released

Setting core specific dataDir with absolute paths in solrconfig.xml
only gets me part of the way, as the core.properties for each shard is
created inside the solr home.

This is obviously no good, as when releasing a new version of the solr
home, they will no longer be in the current solr home.

Cheers

Tom

Re: Separating cores from Solr home

2016-03-03 Thread Tom Evans

Hmm, I've worked around this by setting the directory where the
indexes should live to be the actual solr home, and symlink the files
from the current release in to that directory, but it feels icky.

Any better ideas?

Cheers

Tom

On Thu, Mar 3, 2016 at 11:12 AM, Tom Evans  wrote:
> Hi all
>
> I'm struggling to configure solr cloud to put the index files and
> core.properties in the correct places in SolrCloud 5.5. Let me explain
> what I am trying to achieve:
>
> * solr is installed in /opt/solr
> * the user who runs solr only has read only access to that tree
> * the solr home files - custom libraries, log4j.properties, solr.in.sh
> and solr.xml - live in /data/project/solr/releases/, which
> is then the target of a symlink /data/project/solr/releases/current
> * releasing a new version of the solr home (eg adding/changing
> libraries, changing logging options) is done by checking out a fresh
> copy of the solr home, switching the symlink and restarting solr
> * the solr core.properties and any data live in /data/project/indexes,
> so they are preserved when new solr home is released
>
> Setting core specific dataDir with absolute paths in solrconfig.xml
> only gets me part of the way, as the core.properties for each shard is
> created inside the solr home.
>
> This is obviously no good, as when releasing a new version of the solr
> home, they will no longer be in the current solr home.
>
> Cheers
>
> Tom

mergeFactor/maxMergeDocs is deprecated

2016-03-03 Thread Tom Evans

Hi all

Updating to Solr 5.5.0, and getting these messages in our error log:

Beginning with Solr 5.5,  is deprecated, configure it on
the relevant  instead.

Beginning with Solr 5.5,  is deprecated, configure it on
the relevant  instead.

However, mergeFactor is only mentioned in a commented out sections of
our solrconfig.xml files, and mergeFactor is not mentioned at all.

> $ ack -B 1 -A 1 '10
212-  -->

> $ ack --all maxMergeDocs
> $

Any ideas?

Cheers

Tom

Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans

Hi all

I have a cloud setup with 8 nodes and 3 collections, products, items
and skus. All collections have just one shard, products has 6
replicas, items has 2 replicas, skus has 8 replicas. No node has both
products and items, all nodes have skus

Some of our queries join from sku to either products or items. If the
query is directed at a node without the appropriate shard on them, we
obviously get an error, so we have separate balancers for products and
items.

The problem occurs when we attempt to query a node to see if products
or items is active on that node. The balancer (haproxy) requests the
ping handler for the appropriate collection, however all the nodes
return OK for all the collections(!)

Eg, on node01, it has replicas for products and skus, but the ping
handler for /solr/items/admin/ping returns 200!

This means that as far as the balancer is concerned, node01 is a valid
destination for item queries, and inevitably it blows up as soon as
such a query is made to it.

As I understand it, this is because the URL we are checking is for the
collection ("items") rather than a specific core
("items_shard1_replica1")

Is there a way to make the ping handler only check local shards? I
have tried with distrib=false&preferLocalShards=false, but it still
returns a 200.

The option I'm trying now is to make two ping handler for skus that
join to one of items/products, which should fail on the servers which
do not support it, but I am concerned that this is a little
heavyweight for a status check to see whether we can direct requests
at this server or not.

Cheers

Tom

Re: Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans

On Wed, Mar 16, 2016 at 2:14 PM, Tom Evans  wrote:
> Hi all
>
> [ .. ]
>
> The option I'm trying now is to make two ping handler for skus that
> join to one of items/products, which should fail on the servers which
> do not support it, but I am concerned that this is a little
> heavyweight for a status check to see whether we can direct requests
> at this server or not.

This worked, I would still be interested in a lighter-weight approach
that doesn't involve joins to see if a given collection has a shard on
this server. I suspect that might require a custom ping handler plugin
however.

Cheers

Tom

Re: Ping handler in SolrCloud mode

2016-03-19 Thread Tom Evans

On Wed, Mar 16, 2016 at 4:10 PM, Shawn Heisey  wrote:
> On 3/16/2016 8:14 AM, Tom Evans wrote:
>> The problem occurs when we attempt to query a node to see if products
>> or items is active on that node. The balancer (haproxy) requests the
>> ping handler for the appropriate collection, however all the nodes
>> return OK for all the collections(!)
>>
>> Eg, on node01, it has replicas for products and skus, but the ping
>> handler for /solr/items/admin/ping returns 200!
>
> This returns OK because as long as one replica for every shard in
> "items" is available somewhere in the cloud, you can make a request for
> "items" on that node and it will work.  Or at least it *should* work,
> and if it's not working, that's a bug.  I remember that one of the older
> 4.x versions *did* have a bug where queries for a collection would only
> work if the node actually contained shards for that collection.

Sorry, this is Solr 5.5, I should have said.

Yes, we can absolutely make a request of "items", and it will work
correctly. However, we are making requests of "skus" that join to
"products", and the query is routed to a node which has only "skus"
and "items", and the request fails because joins can only work over
local replicas.

To fix this, we now have two additional balancers:

solr: has all the nodes, all nodes are valid backends
solr-items: has all the nodes in the cluster, but nodes are only valid
backends if it has "items" and "skus" replicas.
solr-products: has all the nodes in the cluster, but nodes are only
valid backends if it has "products" and "skus" replicas

(I'm simplifying things a bit, there are another 6 collections that
are on all nodes, hence the main balancer.)

The new balancers need a cheap way of checking what nodes are valid,
and ideally I'd like that check to not involve a query with a join
clause!

Cheers

Tom

Paging and cursorMark

2016-03-22 Thread Tom Evans

Hi all

With Solr 5.5.0, we're trying to improve our paging performance. When
we are delivering results using infinite scrolling, cursorMark is
perfectly fine - one page is followed by the next. However, we also
offer traditional paging of results, and this is where it gets a
little tricky.

Say we have 10 results per page, and a user wants to jump from page 1
to page 20, and then wants to view page 21, there doesn't seem to be a
simple way to get the nextCursorMark. We can make an inefficient
request for page 20 (start=190, rows=10), but we cannot give that
request a cursorMark=* as it contains start=190.

Consequently, if the user clicks to page 21, we have to continue along
using start=200, as we have no cursorMark. The only way I can see to
get a cursorMark at that point is to omit the start=200, and instead
say rows=210, and ignore the first 200 results on the client side.
Obviously, this gets more and more inefficient the deeper we page - I
know that internally to Solr, using start=200&rows=10 has to do the
same work as rows=210, but less data is sent over the wire to the
client.

As I understand it, the cursorMark is a hash of the sort values of the
last document returned, so I don't really see why it is forbidden to
specify start=190&rows=10&cursorMark=* - why is it not possible to
calculate the nextCursorMark from the last document returned?

I was also thinking a possible temporary workaround would be to
request start=190&rows=10, note the last document returned, and then
make a subsequent query for q=id:""&rows=1&cursorMark=*.
This seems to work, but means an extra Solr query for no real reason.
Is there any other problem to doing this?

Is there some other simple trick I am missing that we can use to get
both the page of results we want and a nextCursorMark for the
subsequent page?

Cheers

Tom

Re: Re: Paging and cursorMark

2016-03-23 Thread Tom Evans

On Wed, Mar 23, 2016 at 12:21 PM, Vanlerberghe, Luc
 wrote:
> I worked on something similar a couple of years ago, but didn’t continue work 
> on it in the end.
>
> I've included the text of my original mail.
> If you're interested, I could try to find the sources I was working on at the 
> time
>
> Luc
>

Thanks both Luc and Steve. I'm not sure if we will have time to deploy
patched versions of things to production - time is always the enemy :(
, and we're not a Java shop so there is non trivial time investment in
just building replacement jars, let alone getting that integrated in
to our RPMs - but I'll definitely try it out on my dev server.

The change seems excessively complex imo, but maybe I'm not seeing the
use cases for skip.

To my mind, calculating a nextCursorMark is cheap and only relies on
having a strict sort ordering, which is also cheap to check. If that
condition is met, you should get a nextCursorMark in your response
regardless of whether you specified a cursorMark in the request, to
allow you to efficiently get the next page.

This would still leave slightly pathological performance if you skip
to page N, and then iterate back to page 0, which Luc's idea of a
previousCursorMark can solve. cursorMark is easy to implement, you can
ignore docs which sort lower than that mark. Can you do similar with
previousCursorMark?, as would it not require to keep a buffer of rows
documents, and stop when a document which sorts higher than the
supplied mark appears. Seems more complex, but maybe I'm not
understanding the internals correctly.

Fortunately for us, 90% of our users prefer infinite scroll, and 97%
of them never go beyond page 3.

Cheers

Tom

Re: Creating new cluster with existing config in zookeeper

2016-03-23 Thread Tom Evans

On Wed, Mar 23, 2016 at 3:43 PM, Robert Brown  wrote:
> So I setup a new solr server to point to my existing ZK configs.
>
> When going to the admin UI on this new server I can see the shards/replica's
> of the existing collection, and can even query it, even tho this new server
> has no cores on it itself.
>
> Is this all expected behaviour?
>
> Is there any performance gain with what I have at this precise stage?  The
> extra server certainly makes it appear i could balance more load/requests,
> but I guess the queries are just being forwarded on to the servers with the
> actual data?
>
> Am I correct in thinking I can now create a new collection on this host, and
> begin to build up a new cluster?  and they won't interfere with each other
> at all?
>
> Also, that I'll be able to see both collections when using the admin UI
> Cloud page on any of the servers in either collection?
>

I'm confused slightly:

SolrCloud is a (singular) cluster of servers, storing all of its state
and configuration underneath a single zookeeper path. The cluster
contains collections. Collections are tied to a particular config set
within the cluster. Collections are made up of 1 or more shards. Each
shard is a core, and there are 1 or more replicas of each core.

You can add more servers to the cluster, and then create a new
collection with the same config as an existing collection, but it is
still part of the same cluster. Of course, you could think of a set of
servers within a cluster as a "logical" cluster if it just serves
particular collection, but "cluster" to me would be all of the servers
within the same zookeeper tree, because that is where cluster state is
maintained.

Cheers

Tom

SolrCloud no leader for collection

2016-04-05 Thread Tom Evans

Hi all, I have an 8 node SolrCloud 5.5 cluster with 11 collections,
most of them in a 1 shard x 8 replicas configuration. We have 5 ZK
nodes.

During the night, we attempted to reindex one of the larger
collections. We reindex by pushing json docs to the update handler
from a number of processes. It seemed this overwhelmed the servers,
and caused all of the collections to fail and end up in either a down
or a recovering state, often with no leader.

Restarting and rebooting the servers brought a lot of the collections
back online, but we are left with a few collections for which all the
nodes hosting those replicas are up, but the replica reports as either
"active" or "down", and with no leader.

Trying to force a leader election has no effect, it keeps choosing a
leader that is in "down" state. Removing all the nodes that are in
"down" state and forcing a leader election also has no effect.


Any ideas? The only viable option I see is to create a new collection,
index it and then remove the old collection and alias it in.

Cheers

Tom

Anticipated Solr 5.5.1 release date

2016-04-15 Thread Tom Evans

Hi all

We're currently using Solr 5.5.0 and converting our regular old style
facets into JSON facets, and are running in to SOLR-8155 and
SOLR-8835. I can see these have already been back-ported to 5.5.x
branch, does anyone know when 5.5.1 may be released?

We don't particularly want to move to Solr 6, as we have only just
finished validating 5.5.0 with our original queries!

Cheers

Tom

Re: Anticipated Solr 5.5.1 release date

2016-04-15 Thread Tom Evans

Awesome, thanks :)

On Fri, Apr 15, 2016 at 4:19 PM, Anshum Gupta  wrote:
> Hi Tom,
>
> I plan on getting a release candidate out for vote by Monday. If all goes
> well, it'd be about a week from then for the official release.
>
> On Fri, Apr 15, 2016 at 6:52 AM, Tom Evans  wrote:
>
>> Hi all
>>
>> We're currently using Solr 5.5.0 and converting our regular old style
>> facets into JSON facets, and are running in to SOLR-8155 and
>> SOLR-8835. I can see these have already been back-ported to 5.5.x
>> branch, does anyone know when 5.5.1 may be released?
>>
>> We don't particularly want to move to Solr 6, as we have only just
>> finished validating 5.5.0 with our original queries!
>>
>> Cheers
>>
>> Tom
>>
>
>
>
> --
> Anshum Gupta

Re: Verifying - SOLR Cloud replaces load balancer?

2016-04-18 Thread Tom Evans

On Mon, Apr 18, 2016 at 3:52 PM, John Bickerstaff
 wrote:
> Thanks all - very helpful.
>
> @Shawn - your reply implies that even if I'm hitting the URL for a single
> endpoint via HTTP - the "balancing" will still occur across the Solr Cloud
> (I understand the caveat about that single endpoint being a potential point
> of failure).  I just want to verify that I'm interpreting your response
> correctly...
>
> (I have been asked to provide IT with a comprehensive list of options prior
> to a design discussion - which is why I'm trying to get clear about the
> various options)
>
> In a nutshell, I think I understand the following:
>
> a. Even if hitting a single URL, the Solr Cloud will "balance" across all
> available nodes for searching
>   Caveat: That single URL represents a potential single point of
> failure and this should be taken into account
>
> b. SolrJ's CloudSolrClient API provides the ability to distribute load --
> based on Zookeeper's "knowledge" of all available Solr instances.
>   Note: This is more robust than "a" due to the fact that it
> eliminates the "single point of failure"
>
> c.  Use of a load balancer hitting all known Solr instances will be fine -
> although the search requests may not run on the Solr instance the load
> balancer targeted - due to "a" above.
>
> Corrections or refinements welcomed...

With option a), although queries will be distributed across the
cluster, all queries will be going through that single node. Not only
is that a single point of failure, but you risk saturating the
inter-node network traffic, possibly resulting in lower QPS and higher
latency on your queries.

With option b), as well as SolrJ, recent versions of pysolr have a
ZK-aware SolrCloud client that behaves in a similar way.

With option c), you can use the preferLocalShards so that shards that
are local to the queried node are used in preference to distributed
shards. Depending on your shard/cluster topology, this can increase
performance if you are returning large amounts of data - many or large
fields or many documents.

Cheers

Tom

Re: Indexing 700 docs per second

2016-04-19 Thread Tom Evans

On Tue, Apr 19, 2016 at 10:25 AM, Mark Robinson  wrote:
> Hi,
>
> I have a requirement to index (mainly updation) 700 docs per second.
> Suppose I have a 128GB RAM, 32 CPU machine, with each doc size around 260
> byes (6 fields out of which only 2 will undergo updation at the above
> rate). This collection has around 122Million docs and that count is pretty
> much a constant.
>
> 1. Can I manage this updation rate with a non-sharded ie single Solr
> instance set up?
> 2. Also is atomic update or a full update (the whole doc) of the changed
> records the better approach in this case.
>
> Could some one please share their views/ experience?

Try it and see - everyone's data/schemas are different and can affect
indexing speed. It certainly sounds achievable enough - presumably you
can at least produce the documents at that rate?

Cheers

Tom

User Authentication

2015-08-24 Thread LeZotte, Tom

Hi Solr Community

I have been trying to add user authentication to our Solr 5.3.1 RedHat install. 
I’ve found some examples on user authentication on the Jetty side. But they 
have failed.

Does any one have a step by step example on authentication for the admin 
screen? And a core?


Thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830

Re: User Authentication

2015-08-24 Thread LeZotte, Tom

Alex
I got a super secret release of Solr 5.3.1, wasn’t suppose to say anything.

Yes I’m running 5.2.1, I will check out the release notes for 5.3.

Was looking for three types of user authentication, I guess.
1. the Admin Console
2. User auth for each Core ( and select and update) on a server.
3. HTML interface access (example: 
ajax-solr<https://github.com/evolvingweb/ajax-solr>)

Thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830






On Aug 24, 2015, at 10:05 AM, Alexandre Rafalovitch 
mailto:arafa...@gmail.com>> wrote:

Thanks for the email from the future. It is good to start to prepare
for 5.3.1 now that 5.3 is nearly out.

Joking aside (and assuming Solr 5.2.1), what exactly are you trying to
achieve? Solr should not actually be exposed to the users directly. It
should be hiding in a backend only visible to your middleware. If you
are looking for a HTML interface that talks directly to Solr after
authentication, that's not the right way to set it up.

That said, some security features are being rolled out and you should
definitely check the release notes for the 5.3.

Regards,
  Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 August 2015 at 10:01, LeZotte, Tom  wrote:
Hi Solr Community

I have been trying to add user authentication to our Solr 5.3.1 RedHat install. 
I’ve found some examples on user authentication on the Jetty side. But they 
have failed.

Does any one have a step by step example on authentication for the admin 
screen? And a core?


Thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830

Re: User Authentication

2015-08-24 Thread LeZotte, Tom

Bosco,

We use CAS for user authentication, not sure if we have Kerberos working 
anywhere. Also we are not using ZooKeeper, because we are only running one 
server currently.

thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830






On Aug 24, 2015, at 3:12 PM, Don Bosco Durai 
mailto:bo...@apache.org>> wrote:

Just curious, is Kerberos an option for you? If so, mostly all your 3 use
cases will addressed.

Bosco


On 8/24/15, 12:18 PM, "Steven White" 
mailto:swhite4...@gmail.com>> wrote:

Hi Noble,

Is everything in the link you provided applicable to Solr 5.2.1?

Thanks

Steve

On Mon, Aug 24, 2015 at 2:20 PM, Noble Paul 
mailto:noble.p...@gmail.com>> wrote:

did you manage to look at the reference guide?
https://cwiki.apache.org/confluence/display/solr/Securing+Solr

On Mon, Aug 24, 2015 at 9:23 PM, LeZotte, Tom
 wrote:
Alex
I got a super secret release of Solr 5.3.1, wasn¹t suppose to say
anything.

Yes I¹m running 5.2.1, I will check out the release notes for 5.3.

Was looking for three types of user authentication, I guess.
1. the Admin Console
2. User auth for each Core ( and select and update) on a server.
3. HTML interface access (example: ajax-solr<
https://github.com/evolvingweb/ajax-solr>)

Thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830






On Aug 24, 2015, at 10:05 AM, Alexandre Rafalovitch
mailto:arafa...@gmail.com>
<mailto:arafa...@gmail.com>> wrote:

Thanks for the email from the future. It is good to start to prepare
for 5.3.1 now that 5.3 is nearly out.

Joking aside (and assuming Solr 5.2.1), what exactly are you trying to
achieve? Solr should not actually be exposed to the users directly. It
should be hiding in a backend only visible to your middleware. If you
are looking for a HTML interface that talks directly to Solr after
authentication, that's not the right way to set it up.

That said, some security features are being rolled out and you should
definitely check the release notes for the 5.3.

Regards,
 Alex.

Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
http://www.solr-start.com/


On 24 August 2015 at 10:01, LeZotte, Tom 
mailto:tom.lezo...@vanderbilt.edu>>
wrote:
Hi Solr Community

I have been trying to add user authentication to our Solr 5.3.1 RedHat
install. I¹ve found some examples on user authentication on the Jetty
side.
But they have failed.

Does any one have a step by step example on authentication for the
admin
screen? And a core?


Thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830










--
-
Noble Paul

Solr 5.3 Faceting on Children with Block Join Parser

2015-08-31 Thread Tom Devel

Apologies for cross posting a question from SO here.

I am very interested in the new faceting on child documents feature of Solr
5.3 and would like to know if somebody has figured out how to do it as
asked in the question on
http://stackoverflow.com/questions/32212949/solr-5-3-faceting-on-children-with-block-join-parser

Thanks for any hints,
Tom

The question is:

Solr 5.3 supports faceting on nested documents [1], with a great tutorial
from Yonik [2].

In the tutorial example, the query to get the documents for faceting is
directly performed on the child documents:

$ curl http://localhost:8983/solr/demo/query -d '
q=author_s:yonik&fl=id,comment_t&
json.facet={
genres : {
type: terms,
field: cat_s,
domain: { blockParent : "type_s:book" }
}
}'

What I do not know is how to facet on child documents returned from a Block
Join Parent Query Parser [3] and provided through ExpandComponent [4].

What I have working so far is the same as in the example from the
ExpandComponent [4]: Query the child fields to return the parent documents
(see 1.), then expand the result to get the relevant child documents (see
2.)


1. q={!parent which="type_s:parent" v='text_t:solr'}

2. &expand=true&expand.field=ISBN_s&expand.q=*:*

What I need:

Having steps 1.) and 2.) already working, how can we facet on some field
(does not matter which) of the returned child documents from (2.) ?

  [1]: http://yonik.com/solr-5-3/
  [2]: http://yonik.com/solr-nested-objects/
  [3]: https://cwiki.apache.org/confluence/display/solr/Other+Parsers
  [4]: http://heliosearch.org/expand-block-join/

tmp directory over load

2015-09-08 Thread LeZotte, Tom

HI

Solr/Tika uses the /tmp directory to process documents. At times the directory 
hits 100%. This causes alarms from Nagios for us. Is there a way in Solr/Tika 
to limit the amount of space used in /tmp? Value could be 80% or 570MB.

thanks

Tom LeZotte
Health I.T. - Senior Product Developer
(p) 615-875-8830

Re: org.apache.solr.common.SolrException: Document is missing mandatory uniqueKey field: id

2015-11-02 Thread Tom Evans

On Mon, Nov 2, 2015 at 1:38 PM, fabigol  wrote:
> Thank
> All works.
> I have 2 last questions:
> How can i put 0 by defaults " clean" during a indexation?
>
> To conclure, i wand to understand:
>
>
> Requests: 7 (1/s), Fetched: 452447 (45245/s), Skipped: 0, Processed: 17433
> (1743/s)
>
> What is the "requests"?
> What is 'Fetched"?
> What is "Processed"?
>
> Thank again for your answer
>

Depends upon how DIH is configured - different things return different
numbers. For a SqlEntityProcessor, "Requests" is the number of SQL
queries, "Fetched" is the number of rows read from those queries, and
"Processed" is the number of documents processed by SOLR.

> For the second question, i try:
> 
> false
> 
>
> and
> true
> false
>

Putting things in "invariants" overrides whatever is passed for that
parameter in the request parameters. By putting "false" in invariants, you are making it impossible
to clean + index as part of DIH, because "clean" is always false.

Cheers

Tom

Best way to track cumulative GC pauses in Solr

2015-11-13 Thread Tom Evans

Hi all

We have some issues with our Solr servers spending too much time
paused doing GC. From turning on gc debug, and extracting numbers from
the GC log, we're getting an idea of just how much of a problem.

I'm currently doing this in a hacky, inefficient way:

grep -h 'Total time for which application threads were stopped:' solr_gc* \
| awk '($11 > 0.3) { print $1, $11 }' \
| sed 's#:.*:##' \
| sort -n \
| sum_by_date.py

(Yes, I really am using sed, grep and awk all in one line. Just wrong :)

The "sum_by_date.py" program simply adds up all the values with the
same first column, and remembers the largest value seen. This is
giving me the cumulative GC time for extended pauses (over 0.5s), and
the maximum pause seen in a given time period (hourly), eg:

2015-11-13T11 119.124037 2.203569
2015-11-13T12 184.683309 3.156565
2015-11-13T13 65.934526 1.978202
2015-11-13T14 63.970378 1.411700


This is fine for seeing that we have a problem. However, really I need
to get this in to our monitoring systems - we use munin. I'm
struggling to work out the best way to extract this information for
our monitoring systems, and I think this might be my naivety about
Java, and working out what should be logged.

I've turned on JMX debugging, and looking at the different beans
available using jconsole, but I'm drowning in information. What would
be the best thing to monitor?

Ideally, like the stats above, I'd like to know the cumulative time
spent paused in GC since the last poll, and the longest GC pause that
we see. munin polls every 5 minutes, are there suitable counters
exposed by JMX that it could extract?

Thanks in advance

Tom

Re: Best way to track cumulative GC pauses in Solr

2015-11-16 Thread Tom Evans

On Fri, Nov 13, 2015 at 4:50 PM, Walter Underwood  wrote:
> Also, what GC settings are you using? We may be able to make some suggestions.
>
> Cumulative GC pauses aren’t very interesting to me. I’m more interested in 
> the longest ones, 90th percentile, 95th, etc.
>

Any advice would be great, but what I'm primarily interested in is how
people are monitoring these statistics in real time, for all time, on
production servers. Eg, for looking at the disk or RAM usage of one of
my servers, I can look at the historical usage in the last week, last
month, last year and so on.

I need to get these stats in to the same monitoring tools as we use
for monitoring every other vital aspect of our servers. Looking at log
files can be useful, but I don't want to keep arbitrarily large log
files on our servers, nor extract data from them, I want to record it
for posterity in one system that understands sampling.

We already use and maintain our own munin systems, so I'm not
interested in paid-for equivalents of munin - regardless of how simple
to set up they are, they don't integrate with our other performance
monitoring stats, and I would never get budget anyway.

So really:

1) Is it OK to turn JMX monitoring on on production systems? The
comments in solr.in.sh suggest not.

2) What JMX beans and attributes should I be using to monitor GC
pauses, particularly maximum length of a single pause in a period, and
the total length of pauses in that period?

Cheers

Tom

Re: Defining SOLR nested fields

2015-12-14 Thread Tom Evans

On Sun, Dec 13, 2015 at 6:40 PM, santosh sidnal
 wrote:
> Hi All,
>
> I want to define nested fileds in SOLR using schema.xml. we are using Apache
> Solr 4.7.0.
>
> i see some links which says how to do, but not sure how can i do it in
> schema.xml
> https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers
>
>
> any help over here is appreciable.
>

With nested documents, it is better to not think of them as
"children", but as related documents. All the documents in your index
will follow exactly the same schema, whether they are "children" or
"parents", and the nested aspect of a a document simply allows you to
restrict your queries based upon that relationship.

Solr is extremely efficient dealing with sparse documents (docs with
only a few fields defined), so one way is to define all your fields
for "parent" and "child" in the schema, and only use the appropriate
ones in the right document. Another way is to use a schema-less
structure, although I'm not a fan of that for error checking reasons.
You can also define a suffix or prefix for fields that you use as part
of your methodology, so that you know what domain it belongs in, but
that would just be for your benefit, Solr would not complain if you
put a "child" field in a parent or vice-versa.

Cheers

Tom

PS:

I would not use Solr 4.7 for this. Nested docs are a new-ish feature,
you may encounter bugs that have been fixed in later versions, and
performance has certainly been improved in later versions. Faceting on
a specific domain (eg, on children or parents) is only supported by
the JSON facet API, which was added in 5.2, and the current stable
version of Solr is 5.4.

Moving to SolrCloud, specifying dataDir correctly

2015-12-14 Thread Tom Evans

Hi all

We're currently in the process of migrating our distributed search
running on 5.0 to SolrCloud running on 5.4, and setting up a test
cluster for performance testing etc.

We have several cores/collections, and in each core's solrconfig.xml,
we were specifying an empty , and specifying the same
core.baseDataDir in core.properties.

When I tried this in SolrCloud mode, specifying
"-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine
for the first collection, but then the second collection tried to use
the same directory to store its index, which obviously failed. I fixed
this by changing solrconfig.xml in each collection to specify a
specific directory, like so:

  ${solr.data.dir:}products

Looking back after the weekend, I'm not a big fan of this. Is there a
way to add a core.properties to ZK, or a way to specify
core.baseDatadir on the command line, or just a better way of handling
this that I'm not aware of?

Cheers

Tom

Re: Moving to SolrCloud, specifying dataDir correctly

2015-12-14 Thread Tom Evans

On Mon, Dec 14, 2015 at 1:22 PM, Shawn Heisey  wrote:
> On 12/14/2015 10:49 AM, Tom Evans wrote:
>> When I tried this in SolrCloud mode, specifying
>> "-Dsolr.data.dir=/mnt/solr/" when starting each node, it worked fine
>> for the first collection, but then the second collection tried to use
>> the same directory to store its index, which obviously failed. I fixed
>> this by changing solrconfig.xml in each collection to specify a
>> specific directory, like so:
>>
>>   ${solr.data.dir:}products
>>
>> Looking back after the weekend, I'm not a big fan of this. Is there a
>> way to add a core.properties to ZK, or a way to specify
>> core.baseDatadir on the command line, or just a better way of handling
>> this that I'm not aware of?
>
> Since you're running SolrCloud, just let Solr handle the dataDir, don't
> try to override it.  It will default to "data" relative to the
> instanceDir.  Each instanceDir is likely to be in the solr home.
>
> With SolrCloud, your cores will not contain a "conf" directory (unless
> you create it manually), therefore the on-disk locations will be *only*
> data, there's not really any need to have separate locations for
> instanceDir and dataDir.  All active configuration information for
> SolrCloud is in zookeeper.
>

That makes sense, but I guess I was asking the wrong question :)

We have our SSDs mounted on /data/solr, which is where our indexes
should go, but our solr install is on /opt/solr, with the default solr
home in /opt/solr/server/solr. How do we change where the indexes get
put so they end up on the fast storage?

Cheers

Tom

Search over a multiValued field

2015-03-03 Thread Tom Devel

Hi,

I am running Solr 5.0.0 and have a question about proximity search and
multiValued fields.

I am indexing xml files of the following form with foundField being a field
defined as multiValued and text_en my in schema.xml.



8
"Oranges from South California - ordered"
"Green Apples - available"
"Black Report Books - ordered"


There are several such documents, and for instance, I would like to query
all documents having in the foundField "Oranges" and "ordered". The
following proximity query takes care of it:

q=foundField:("oranges AND ordered"~2)

However, a field could have more words, and I also cannot know the
proximity of the desired query words in advance. Setting the proximity
value too high results in false positives, the following query also returns
the document (although "available" was in the entry about Apples):

foundField:("oranges AND available"~200)

I do not think that tweaking a proximity value is the correct approach.

How can I search to match contents in a multiValued field per Value as
described above, without running into the problem?

Many thanks for any help

Re: Search over a multiValued field

2015-03-03 Thread Tom Devel

Jack,

This is exactly what I was looking for, thanks. I found the
positionIncrementGap attribute in the schema.xml for the text_en

I was putting in "AND" because I read in the Solr documentation that "The
OR operator is the default conjunction operator."

Does it mean that words between " symbols, such as "Orange ordered" are
treated as a single term, with (implicitly) AND conjunction between them?

Where could I found more info about this?

I am currently reading
https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser

Thanks again

On Tue, Mar 3, 2015 at 3:58 PM, Jack Krupansky 
wrote:

> Just set the positionIncrementGap for the multivalued field to a much
> higher value, like 1000 or 5000. That's the purpose of this attribute, to
> assure that reasonable proximity matches don't match across multiple
> values.
>
> Also, leave "AND" out of the query phrases - you're just trying to match
> the product name and availability.
>
>
> -- Jack Krupansky
>
> On Tue, Mar 3, 2015 at 4:51 PM, Tom Devel  wrote:
>
> > Hi,
> >
> > I am running Solr 5.0.0 and have a question about proximity search and
> > multiValued fields.
> >
> > I am indexing xml files of the following form with foundField being a
> field
> > defined as multiValued and text_en my in schema.xml.
> >
> > 
> > 
> > 8
> > "Oranges from South California -
> ordered"
> > "Green Apples - available"
> > "Black Report Books - ordered"
> > 
> >
> > There are several such documents, and for instance, I would like to query
> > all documents having in the foundField "Oranges" and "ordered". The
> > following proximity query takes care of it:
> >
> > q=foundField:("oranges AND ordered"~2)
> >
> > However, a field could have more words, and I also cannot know the
> > proximity of the desired query words in advance. Setting the proximity
> > value too high results in false positives, the following query also
> returns
> > the document (although "available" was in the entry about Apples):
> >
> > foundField:("oranges AND available"~200)
> >
> > I do not think that tweaking a proximity value is the correct approach.
> >
> > How can I search to match contents in a multiValued field per Value as
> > described above, without running into the problem?
> >
> > Many thanks for any help
> >
>

Re: Search over a multiValued field

2015-03-03 Thread Tom Devel

Erick,

Thanks a lot for the explanation, makes sense now.

Tom

On Tue, Mar 3, 2015 at 5:54 PM, Erick Erickson 
wrote:

> bq: Does it mean that words between " symbols, such as "Orange ordered" are
> treated as a single term, with (implicitly) AND conjunction between them?
>
> not at all. When you quote things, you're getting a "phrase query",
> perhaps one
> with slop. So something like
> "a b" means that 'a' must appear right next to 'b'. This is something
> like an AND
> in the sense that both terms must appear, but it is far more
> restrictive since it takes into
> account the position of the terms in the field.
>
> "a b"~10 means that both words must appear within 10 transpositions in
> the same field.
> You can think of "transposition" as how many intervening terms there
> are, so something
> like "a b"~2 would match docs with "a x b", but not "a x y z b".
>
> And this is where positionIncrementGap comes in. By putting 1000 in
> for it, you guarantee
> "a b"~999 won't match 'a' in one field and 'b' in another.
>
> whereas a AND b would match across successive MV entries no matter what the
> gap.
>
> HTH,
> Erick
>
> On Tue, Mar 3, 2015 at 2:22 PM, Tom Devel  wrote:
> > Jack,
> >
> > This is exactly what I was looking for, thanks. I found the
> > positionIncrementGap attribute in the schema.xml for the text_en
> >
> > I was putting in "AND" because I read in the Solr documentation that "The
> > OR operator is the default conjunction operator."
> >
> > Does it mean that words between " symbols, such as "Orange ordered" are
> > treated as a single term, with (implicitly) AND conjunction between them?
> >
> > Where could I found more info about this?
> >
> > I am currently reading
> >
> https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser
> >
> > Thanks again
> >
> > On Tue, Mar 3, 2015 at 3:58 PM, Jack Krupansky  >
> > wrote:
> >
> >> Just set the positionIncrementGap for the multivalued field to a much
> >> higher value, like 1000 or 5000. That's the purpose of this attribute,
> to
> >> assure that reasonable proximity matches don't match across multiple
> >> values.
> >>
> >> Also, leave "AND" out of the query phrases - you're just trying to match
> >> the product name and availability.
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Tue, Mar 3, 2015 at 4:51 PM, Tom Devel  wrote:
> >>
> >> > Hi,
> >> >
> >> > I am running Solr 5.0.0 and have a question about proximity search and
> >> > multiValued fields.
> >> >
> >> > I am indexing xml files of the following form with foundField being a
> >> field
> >> > defined as multiValued and text_en my in schema.xml.
> >> >
> >> > 
> >> > 
> >> > 8
> >> > "Oranges from South California -
> >> ordered"
> >> > "Green Apples - available"
> >> > "Black Report Books - ordered"
> >> > 
> >> >
> >> > There are several such documents, and for instance, I would like to
> query
> >> > all documents having in the foundField "Oranges" and "ordered". The
> >> > following proximity query takes care of it:
> >> >
> >> > q=foundField:("oranges AND ordered"~2)
> >> >
> >> > However, a field could have more words, and I also cannot know the
> >> > proximity of the desired query words in advance. Setting the proximity
> >> > value too high results in false positives, the following query also
> >> returns
> >> > the document (although "available" was in the entry about Apples):
> >> >
> >> > foundField:("oranges AND available"~200)
> >> >
> >> > I do not think that tweaking a proximity value is the correct
> approach.
> >> >
> >> > How can I search to match contents in a multiValued field per Value as
> >> > described above, without running into the problem?
> >> >
> >> > Many thanks for any help
> >> >
> >>
>

Order of defining fields and dynamic fields in schema.xml

2015-03-06 Thread Tom Devel

Hi,

I am running solr 5 using basic_configs and have a questions about the
order of defining fields and dynamic fields in the schema.xml file?

For example, there is a field "hierarchy.of.fields.Project" I am capturing
as below as "text_en_splitting", but the rest of the fields in this
hierarchy, I would like as "text_en"

Since the dynamicField with * is technically spanning over the Project
field, should its definition go above, or below the Project field?





Or this case, I have a hierarchy where currently only one field should be
captured "another.hierarchy.of.fields.Description", the rest for now should
be just ignored. Is here any significance of which definition comes first?




Thanks for any hints,
Tom

Re: Order of defining fields and dynamic fields in schema.xml

2015-03-06 Thread Tom Devel

Thats good to know.

On http://wiki.apache.org/solr/SchemaXml it also states about dynamicFields
that "you can create field rules that Solr will use to understand what
datatype should be used whenever it is given a field name that is not
explicitly defined, but matches a prefix or suffix used in a dynamicField. "

Thanks

On Fri, Mar 6, 2015 at 10:43 AM, Alexandre Rafalovitch 
wrote:

> I don't believe the order in file matters for anything apart from
> initParams section. The longer - more specific one - matches first.
>
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 6 March 2015 at 11:21, Tom Devel  wrote:
> > Hi,
> >
> > I am running solr 5 using basic_configs and have a questions about the
> > order of defining fields and dynamic fields in the schema.xml file?
> >
> > For example, there is a field "hierarchy.of.fields.Project" I am
> capturing
> > as below as "text_en_splitting", but the rest of the fields in this
> > hierarchy, I would like as "text_en"
> >
> > Since the dynamicField with * is technically spanning over the Project
> > field, should its definition go above, or below the Project field?
> >
> >  > indexed="true"  stored="true"  multiValued="true" required="false" />
> >  > indexed="true"  stored="true"  multiValued="true" required="false" />
> >
> >
> > Or this case, I have a hierarchy where currently only one field should be
> > captured "another.hierarchy.of.fields.Description", the rest for now
> should
> > be just ignored. Is here any significance of which definition comes
> first?
> >
> >  > indexed="false"  stored="false"  multiValued="true" required="false" />
> >  > type="text_en"indexed="true"  stored="true"  multiValued="true"
> > required="false" />
> >
> > Thanks for any hints,
> > Tom
>

Setting up SOLR 5 from an RPM

2015-03-24 Thread Tom Evans

Hi all

We're migrating to SOLR 5 (from 4.8), and our infrastructure guys
would prefer we installed SOLR from an RPM rather than extracting the
tarball where we need it. They are creating the RPM file themselves,
and it installs an init.d script and the equivalent of the tarball to
/opt/solr.

We're having problems running SOLR from the installed files, as SOLR
wants to (I think) extract the WAR file and create various temporary
files below /opt/solr/server.

We currently have this structure:

/data/solr - root directory of our solr instance
/data/solr/{logs,run} - log/run directories
/data/solr/cores - configuration for our cores and solr.in.sh
/opt/solr - the RPM installed solr 5

The user running solr can modify anything under /data/solr, but
nothing under /opt/solr.

Is this sort of configuration supported? Am I missing some variable in
our solr.in.sh that sets where temporary files can be extracted? We
currently set:

SOLR_PID_DIR=/data/solr/run
SOLR_HOME=/data/solr/cores
SOLR_LOGS_DIR=/data/solr/logs


Cheers

Tom

Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Tom Evans

On Tue, Mar 24, 2015 at 4:00 PM, Tom Evans  wrote:
> Hi all
>
> We're migrating to SOLR 5 (from 4.8), and our infrastructure guys
> would prefer we installed SOLR from an RPM rather than extracting the
> tarball where we need it. They are creating the RPM file themselves,
> and it installs an init.d script and the equivalent of the tarball to
> /opt/solr.
>
> We're having problems running SOLR from the installed files, as SOLR
> wants to (I think) extract the WAR file and create various temporary
> files below /opt/solr/server.

>From the SOLR 5 reference guide, section "Managing SOLR", sub-section
"Taking SOLR to production", it seems changing the ownership of the
installed files to the user that will run SOLR is an explicit
requirement if you do not wish to run as root.

It would be better if this was not required. With most applications
you do not normally require permission to modify the installed files
in order to run the application, eg I do not need write permission to
/usr/share/vim to run vim, it is a shame I need write permission to
/opt/solr to run solr.

Cheers

Tom

Re: Setting up SOLR 5 from an RPM

2015-03-25 Thread Tom Evans

On Wed, Mar 25, 2015 at 2:40 PM, Shawn Heisey  wrote:
> I think you will only need to change the ownership of the solr home and
> the location where the .war file is extracted, which by default is
> server/solr-webapp.  The user must be able to *read* the program data,
> but should not need to write to it. If you are using the start script
> included with Solr 5 and one of the examples, I believe the logging
> destination will also be located under the solr home, but you should
> make sure that's the case.

Thanks Shawn, this sort of makes sense. The thing which I cannot seem
to do is change the location where the war file is extracted. I think
this is probably because, as of solr 5, I am not supposed to know or
be aware that there is a war file, or that the war file is hosted in
jetty, which makes it tricky to specify the jetty temporary directory.

Our use case is that we want to create a single system image that
would be usable for several projects, each project would check out its
solr home and run solr as their own user (possibly on the same
server). Eg, /data/projectA being a solr home for one project,
/data/projectB being a solr home for another project, both running
solr from the same location.

Also, on a dev server, I want to install solr once, and each member of
my team run it from that single location. Because they cannot change
the temporary directory, and they cannot all own server/solr-webapp,
this does not work and they must each have their own copy of the solr
install.

I think the way we will go for this is in production to run all our
solr instance as the "solr" user, who will own the files in /opt/solr,
and have their solr home directory wherever they choose. In dev, we
will just do something...

Cheers

Tom

Confusing SOLR 5 memory usage

2015-04-21 Thread Tom Evans

Hi all

I have two SOLR 5 servers, one is the master and one is the slave.
They both have 12 cores, fully replicated and giving identical results
when querying them. The only difference between configuration on the
two servers is that one is set to slave from the other - identical
core configs and solr.in.sh.

They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are
setting the heap size identically:

SOLR_JAVA_MEM="-Xms512m -Xmx7168m"

The two servers are balanced behind haproxy, and identical numbers and
types of queries flow to both servers. Indexing only happens once a
day.

When viewing the memory usage of the servers, the master server's JVM
has 8.8GB RSS, but the slave only has 1.2GB RSS.

Can someone hit me with the cluebat please? :)

Cheers

Tom

Re: Confusing SOLR 5 memory usage

2015-04-21 Thread Tom Evans

We monitor them with munin, so I have charts if attachments are
acceptable? Having said that, they have only been running for a day
with this memory allocation..

Describing them, the master consistently has 8GB used for apps, the
8GB used in cache, whilst the slave consistently only uses ~1.5GB for
apps, 14GB used in cache.

We are trying to use our SOLR servers to do a lot more facet queries,
previously we were mainly doing searches, and the
SolrPerformanceProblems wiki page mentions that faceting (amongst
others) require a lot of JVM heap, so I'm confused why it is not using
the heap we've allocated on one server, whilst it is on the other
server. Perhaps our master server needs even more heap?

Also, my infra guy is wondering why I asked him to add more memory to
the slave server, if it is "just" in cache, although I did try to
explain that ideally, I'd have even more in cache - we have about 35GB
of index data.

Cheers

Tom

On Tue, Apr 21, 2015 at 11:25 AM, Markus Jelsma
 wrote:
> Hi - what do you see if you monitor memory over time? You should see a 
> typical saw tooth.
> Markus
>
> -Original message-
>> From:Tom Evans 
>> Sent: Tuesday 21st April 2015 12:22
>> To: solr-user@lucene.apache.org
>> Subject: Confusing SOLR 5 memory usage
>>
>> Hi all
>>
>> I have two SOLR 5 servers, one is the master and one is the slave.
>> They both have 12 cores, fully replicated and giving identical results
>> when querying them. The only difference between configuration on the
>> two servers is that one is set to slave from the other - identical
>> core configs and solr.in.sh.
>>
>> They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are
>> setting the heap size identically:
>>
>> SOLR_JAVA_MEM="-Xms512m -Xmx7168m"
>>
>> The two servers are balanced behind haproxy, and identical numbers and
>> types of queries flow to both servers. Indexing only happens once a
>> day.
>>
>> When viewing the memory usage of the servers, the master server's JVM
>> has 8.8GB RSS, but the slave only has 1.2GB RSS.
>>
>> Can someone hit me with the cluebat please? :)
>>
>> Cheers
>>
>> Tom
>>

Re: Confusing SOLR 5 memory usage

2015-04-21 Thread Tom Evans

I do apologise for wasting anyone's time on this, the PEBKAC (my
keyboard and chair unfortunately). When adding the new server to
haproxy, I updated the label for the balancer entry to the new server,
but left the host name the same, so the server that wasn't using any
RAM... wasn't getting any requests.

Again, sorry!

Tom

On Tue, Apr 21, 2015 at 11:54 AM, Tom Evans  wrote:
> We monitor them with munin, so I have charts if attachments are
> acceptable? Having said that, they have only been running for a day
> with this memory allocation..
>
> Describing them, the master consistently has 8GB used for apps, the
> 8GB used in cache, whilst the slave consistently only uses ~1.5GB for
> apps, 14GB used in cache.
>
> We are trying to use our SOLR servers to do a lot more facet queries,
> previously we were mainly doing searches, and the
> SolrPerformanceProblems wiki page mentions that faceting (amongst
> others) require a lot of JVM heap, so I'm confused why it is not using
> the heap we've allocated on one server, whilst it is on the other
> server. Perhaps our master server needs even more heap?
>
> Also, my infra guy is wondering why I asked him to add more memory to
> the slave server, if it is "just" in cache, although I did try to
> explain that ideally, I'd have even more in cache - we have about 35GB
> of index data.
>
> Cheers
>
> Tom
>
> On Tue, Apr 21, 2015 at 11:25 AM, Markus Jelsma
>  wrote:
>> Hi - what do you see if you monitor memory over time? You should see a 
>> typical saw tooth.
>> Markus
>>
>> -Original message-
>>> From:Tom Evans 
>>> Sent: Tuesday 21st April 2015 12:22
>>> To: solr-user@lucene.apache.org
>>> Subject: Confusing SOLR 5 memory usage
>>>
>>> Hi all
>>>
>>> I have two SOLR 5 servers, one is the master and one is the slave.
>>> They both have 12 cores, fully replicated and giving identical results
>>> when querying them. The only difference between configuration on the
>>> two servers is that one is set to slave from the other - identical
>>> core configs and solr.in.sh.
>>>
>>> They both run on identical VMs with 16GB of RAM. In solr.in.sh, we are
>>> setting the heap size identically:
>>>
>>> SOLR_JAVA_MEM="-Xms512m -Xmx7168m"
>>>
>>> The two servers are balanced behind haproxy, and identical numbers and
>>> types of queries flow to both servers. Indexing only happens once a
>>> day.
>>>
>>> When viewing the memory usage of the servers, the master server's JVM
>>> has 8.8GB RSS, but the slave only has 1.2GB RSS.
>>>
>>> Can someone hit me with the cluebat please? :)
>>>
>>> Cheers
>>>
>>> Tom
>>>

Re: Checking of Solr Memory and Disk usage

2015-04-24 Thread Tom Evans

On Fri, Apr 24, 2015 at 8:31 AM, Zheng Lin Edwin Yeo
 wrote:
> Hi,
>
> So has anyone knows what is the issue with the "Heap Memory Usage" reading
> showing the value -1. Should I open an issue in Jira?

I have solr 4.8.1 and solr 5.0.0 servers, on the solr 4.8.1 servers
the core statistics have values for heap memory, on the solr 5.0.0
ones I also see the value -1. This is with CentOS 6/Java 1.7 OpenJDK
on both versions.

I don't see this issue in the fixed bugs in 5.1.0, but I only looked
at the headlines of the tickets..

http://lucene.apache.org/solr/5_1_0/changes/Changes.html#v5.1.0.bug_fixes

Cheers

Tom

Block Join Query update documents, how to do it correctly?

2015-05-13 Thread Tom Devel

I am using the Block Join Query Parser with success, following the example
on:

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-BlockJoinQueryParsers

As this example shows, each parent document can have a number of documents
embedded, and each document, be it a parent or a child, has its own unique
identifier.

Now I would like to update some of the parent documents, and read that
there are horror stories with duplicate documents, scrambled data etc., the
two prominent JIRA entries for this are:

https://issues.apache.org/jira/browse/SOLR-6700
https://issues.apache.org/jira/browse/SOLR-6096

My question is, how do you usually update such documents, for example to
update a value for the parent or a value for one of its children?

I tried to repost the whole modified document (the parent and ALL of its
children as one file), and it seems to work on a small toy example, but of
course I cannot be sure for a larger instance with thousands of documents,
and I would like to know if this is the correct way to go or not.

To make it clear, if originally I used bin/solr post on on the following
file:



1
Solr has block join support
  parentDocument

 2
SolrCloud supports it too!




Now I could do bin/solr post on a file:



1
Updated field: Solr has block join support
  parentDocument

 2
Updated field: SolrCloud supports it
too!




Will this avoid these inconsistent and scrambled or duplicate data on Solr
instances as discussed in the JIRAs? How do you usually do this?

Thanks for any help or hints.

Tom

Solr

2015-06-06 Thread Tom Running

Hello,

I have customized my Solr results so that they display only 3 fields: the
document ID, name and last_modified date. The results are in JSON.

This is a sample of my Javascript function to execute the query:

var query = "";

//set user input to query
query = window.document.box.input.value;

//solr URL
var sol = "
http://localhost:8983/solr/gettingstarted_shard1_replica1/select?q=";;
var sol2 =
"&wt=json&fl=title,id,category,last_modified&rows=1000&indent=true";

//redirect
window.location.href = sol+query+sol2;
//

The output example would look like:

{
"id":"/solr/docs/ISO/Employee Benefits Information/BCN.doc",
"title":["BCN Auto Policy Verbiage:"],
"last_modified":["2014-01-07T15:19:00Z"]},



I want to format my Solr results so that the document ID will be displayed
as a link that users can click on and load the BCN.doc file.

Any tips on how to do this? I am a stuck.

All help is appreciated!

Thanks,

T

changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo

I need to change the web context and the port for a SolrCloud installation.

Example, change:

host:8080/some-api-here/

to this:

host:8983/solr/

Does anyone know how to do this with SolrCloud?  There are values stored in 
clusterstate.json and /leader/elect and I could change them but 
that seems a little messy.

Thanks

Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo

My Solr installation is running on Tomcat on port 8080 with a  web context name 
that is different than /solr.   We want to move to a basic jetty setup with all 
the defaults.  I haven’t found a clean way to do this.  A lot of the values 
like baseurl and /leader/elect/shard1 have values that need to be updated.  If 
I try shutting down the servers, change the zookeeper settings and then restart 
Solr in Jetty I get issues - like Solr thinks they are replicas.   So I’m 
looking to see if anyone knows what is the cleanest way to move from a 
Tomcat/8080 install to a Jetty/8983 one.

Thanks

> On May 11, 2016, at 1:59 PM, John Bickerstaff  
> wrote:
> 
> I may be answering the wrong question - but SolrCloud goes in by default on
> 8983, yes?  Is yours currently on 8080?
> 
> I don't recall where, but I think I saw a config file setting for the port
> number (In Solr I mean)
> 
> Am I on the right track or are you asking something other than how to get
> Solr on host:8983/solr ?
> 
> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo  wrote:
> 
>> I need to change the web context and the port for a SolrCloud installation.
>> 
>> Example, change:
>> 
>> host:8080/some-api-here/
>> 
>> to this:
>> 
>> host:8983/solr/
>> 
>> Does anyone know how to do this with SolrCloud?  There are values stored
>> in clusterstate.json and /leader/elect and I could change them
>> but that seems a little messy.
>> 
>> Thanks

Re: changing web context and port for SolrCloud Zookeeper

2016-05-11 Thread Tom Gullo

That helps.  I ended up updating the sole.in.sh file in /etc/default and that 
was in getting picked up.  Thanks

> On May 11, 2016, at 2:05 PM, Tom Gullo  wrote:
> 
> My Solr installation is running on Tomcat on port 8080 with a  web context 
> name that is different than /solr.   We want to move to a basic jetty setup 
> with all the defaults.  I haven’t found a clean way to do this.  A lot of the 
> values like baseurl and /leader/elect/shard1 have values that need to be 
> updated.  If I try shutting down the servers, change the zookeeper settings 
> and then restart Solr in Jetty I get issues - like Solr thinks they are 
> replicas.   So I’m looking to see if anyone knows what is the cleanest way to 
> move from a Tomcat/8080 install to a Jetty/8983 one.
> 
> Thanks
> 
>> On May 11, 2016, at 1:59 PM, John Bickerstaff  
>> wrote:
>> 
>> I may be answering the wrong question - but SolrCloud goes in by default on
>> 8983, yes?  Is yours currently on 8080?
>> 
>> I don't recall where, but I think I saw a config file setting for the port
>> number (In Solr I mean)
>> 
>> Am I on the right track or are you asking something other than how to get
>> Solr on host:8983/solr ?
>> 
>> On Wed, May 11, 2016 at 11:56 AM, Tom Gullo  wrote:
>> 
>>> I need to change the web context and the port for a SolrCloud installation.
>>> 
>>> Example, change:
>>> 
>>> host:8080/some-api-here/
>>> 
>>> to this:
>>> 
>>> host:8983/solr/
>>> 
>>> Does anyone know how to do this with SolrCloud?  There are values stored
>>> in clusterstate.json and /leader/elect and I could change them
>>> but that seems a little messy.
>>> 
>>> Thanks
>

Re: Creating a collection with 1 shard gives a weird range

2016-05-17 Thread Tom Evans

On Tue, May 17, 2016 at 9:40 AM, John Smith  wrote:
> I'm trying to create a collection starting with only one shard
> (numShards=1) using a compositeID router. The purpose is to start small
> and begin splitting shards when the index grows larger. The shard
> created gets a weird range value: 8000-7fff, which doesn't look
> effective. Indeed, if a try to import some documents using a DIH, none
> gets added.
>
> If I create the same collection with 2 shards, the ranges seem more
> logical (0-7fff & 8000-). In this case documents are
> indexed correctly.
>
> Is this behavior by design, i.e. is a minimum of 2 shards required? If
> not, how can I create a working collection with a single shard?
>
> This is Solr-6.0.0 in cloud mode with zookeeper-3.4.8.
>

I believe this is as designed, see this email from Shawn:

https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201604.mbox/%3c570d0a03.5010...@elyograg.org%3E

Cheers

Tom

Re: SolrCloud increase replication factor

2016-05-23 Thread Tom Evans

On Mon, May 23, 2016 at 10:37 AM, Hendrik Haddorp
 wrote:
> Hi,
>
> I have a SolrCloud 6.0 setup and created my collection with a
> replication factor of 1. Now I want to increase the replication factor
> but would like the replicas for the same shard to be on different nodes,
> so that my collection does not fail when one node fails. I tried two
> approaches so far:
>
> 1) When I use the collections API with the MODIFYCOLLECTION action [1] I
> can set the replication factor but that did not result in the creation
> of additional replicas. The Solr Admin UI showed that my replication
> factor changed but otherwise nothing happened. A reload of the
> collection did also result in no change.
>
> 2) Using the ADDREPLICA action [2] from the collections API I have to
> add the replicas to the shard individually, which is a bit more
> complicated but otherwise worked. During testing this did however at
> least once result in the replica being created on the same node. My
> collection was split in 4 shards and for 2 of them all replicas ended up
> on the same node.
>
> So is the only option to create the replicas manually and also pick the
> nodes manually or is the perceived behavior wrong?
>
> regards,
> Hendrik
>
> [1]
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-modifycoll
> [2]
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api_addreplica

With ADDREPLICA, you can specify the node to create the replica on. If
you are using a script to increase/remove replicas, you can simply
incorporate the logic you desire in to your script - you can also use
CLUSTERSTATUS to get a list of nodes/collections/shards etc in order
to inform the logic in the script. This is the approach we took, we
have a fabric script to add/remove extra nodes to/from the cluster, it
works well.

The alternative is to put the logic in to Solr itself, using what Solr
calls a "snitch" to define the rules on where replicas are created.
The snitch is specified at collection creation time, or you can use
MODIFYCOLLECTION to set it after the fact. See this wiki patch for
details:

https://cwiki.apache.org/confluence/display/solr/Rule-based+Replica+Placement

Cheers

Tom

Re: Import html data in mysql and map schemas using onlySolrCELL+TIKA+DIH [scottchu]

2016-05-24 Thread Tom Evans

On Tue, May 24, 2016 at 3:06 PM, Scott Chu  wrote:
> p.s. There're really many many extensive, worthy stuffs in Solr. If the
> project team can provide some "dictionary" of them, It would be a "Santa 
> Claus"
> for we solr users. Ha! Just a X'mas wish! Sigh! I know it's quite not 
> possbile.
> I really like to study them one after another, to learn about all of them.
> However, Internet IT goes too fast to have time to congest all of the great
>  stuffs in Solr.

The reference guide is both extensive and also broadly informative.
Start from the top page and browse away!

https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

Handy to keep the glossary handy for any terms that you don't recognise:

https://cwiki.apache.org/confluence/display/solr/Solr+Glossary

Cheers

Tom

Re: result grouping in sharded index

2016-06-15 Thread Tom Evans

Do you have to group, or can you collapse instead?

https://cwiki.apache.org/confluence/display/solr/Collapse+and+Expand+Results

Cheers

Tom

On Tue, Jun 14, 2016 at 4:57 PM, Jay Potharaju  wrote:
> Any suggestions on how to handle result grouping in sharded index?
>
>
> On Mon, Jun 13, 2016 at 1:15 PM, Jay Potharaju 
> wrote:
>
>> Hi,
>> I am working on a functionality that would require me to group documents
>> by a id field. I read that the ngroups feature would not work in a sharded
>> index.
>> Can someone recommend how to handle this in a sharded index?
>>
>>
>> Solr Version: 5.5
>>
>>
>> https://cwiki.apache.org/confluence/display/solr/Result+Grouping#ResultGrouping-DistributedResultGroupingCaveats
>>
>> --
>> Thanks
>> Jay
>>
>>
>
>
>
> --
> Thanks
> Jay Potharaju

Strange highlighting on search

2016-06-16 Thread Tom Evans

Hi all

I'm investigating a bug where by every term in the highlighted field
gets marked for highlighting instead of just the words that match the
fulltext portion of the query. This is on Solr 5.5.0, but I didn't see
any bug fixes related to highlighting in 5.5.1 or 6.0 release notes.

The query that affects it is where we have a not clause on a specific
field (not the fulltext field) and also only include documents where
that field has a value:

q: cosmetics_packaging_fulltext:(Mist) AND ingredient_tag_id:[0 TO *]
AND -ingredient_tag_id:(35223)

This returns the correct results, but the highlighting has matched
every word in the results (see below for debugQuery output). If I
change the query to put the exclusion in to an fq, the highlighting is
correct again (and the results are correct):

q: cosmetics_packaging_fulltext:(Mist)
fq: {!cache=false} ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)

Is there any way I can make the query and highlighting work as
expected as part of q?

Is there any downside to putting the exclusion part in the fq in terms
of performance? We don't use score at all for our results, we always
order by other parameters.

Cheers

Tom

Query with strange highlighting:

{
  "responseHeader":{
"status":0,
"QTime":314,
"params":{
  "q":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
  "hl":"true",
  "hl.simple.post":"",
  "indent":"true",
  "fl":"id,product",
  "hl.fragsize":"0",
  "hl.fl":"product",
  "rows":"5",
  "wt":"json",
  "debugQuery":"true",
  "hl.simple.pre":""}},
  "response":{"numFound":10132,"start":0,"docs":[
  {
"id":"2403841-1498608",
"product":"Mist"},
  {
"id":"2410603-1502577",
"product":"Mist"},
  {
"id":"5988531-3882415",
"product":"Ao + Mist"},
  {
"id":"6020805-3904203",
"product":"UV Mist Cushion SPF 50+ PA+++"},
  {
"id":"2617977-1629335",
"product":"Ultra Radiance Facial Re-Hydrating Mist"}]
  },
  "highlighting":{
"2403841-1498608":{
  "product":["Mist"]},
"2410603-1502577":{
  "product":["Mist"]},
"5988531-3882415":{
  "product":["Ao + Mist"]},
"6020805-3904203":{
  "product":["UV Mist Cushion
SPF 50+ PA+++"]},
"2617977-1629335":{
  "product":["Ultra Radiance Facial
Re-Hydrating Mist"]}},
  "debug":{
"rawquerystring":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
"querystring":"cosmetics_packaging_fulltext:(Mist) AND
ingredient_tag_id:[0 TO *] AND -ingredient_tag_id:(35223)",
"parsedquery":"+cosmetics_packaging_fulltext:mist
+ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223",
"parsedquery_toString":"+cosmetics_packaging_fulltext:mist
+ingredient_tag_id:[0 TO *] -ingredient_tag_id:35223",
"explain":{
  "2403841-1498608":"\n40.082462 = sum of:\n  39.92971 =
weight(cosmetics_packaging_fulltext:mist in 13983)
[ClassicSimilarity], result of:\n39.92971 =
score(doc=13983,freq=39.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  40.40386 =
fieldWeight in 13983, product of:\n6.244998 = tf(freq=39.0),
with freq of:\n  39.0 = termFreq=39.0\n6.469795 =
idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=13983)\n  0.15275055 = ingredient_tag_id:[0 TO *],
product of:\n1.0 = boost\n0.15275055 = queryNorm\n",
  "2410603-1502577":"\n40.082462 = sum of:\n  39.92971 =
weight(cosmetics_packaging_fulltext:mist in 14023)
[ClassicSimilarity], result of:\n39.92971 =
score(doc=14023,freq=39.0), product of:\n  0.9882648 =
queryWeight, product of:\n6.469795 = idf(docFreq=22502,
maxDocs=5342472)\n0.15275055 = queryNorm\n  40.40386 =
fieldWeight in 14023, product of:\n6.244998 = tf(freq=39.0),
with freq of:\n  39.0 = termFreq=39.0\n6.469795 =
idf(docFreq=22502, maxDocs=5342472)\n1.0 =
fieldNorm(doc=14023)\n  0.15275055 = ingredient_tag

Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans

Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
of the collections on it marked as "Recovering" or "Recovery Failed".
It attempts to recover from the leader, but the leader responds with:

Error while trying to recover.
core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://172.31.1.171:3/solr: We are not the
leader
at java.util.concurrent.FutureTask.report(FutureTask.java:122)
at java.util.concurrent.FutureTask.get(FutureTask.java:192)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
at org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
Error from server at http://172.31.1.171:3/solr: We are not the
leader
at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
at 
org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
... 5 more

and recovery never occurs.

Each collection in this state has plenty (10+) of active replicas, but
stopping the server that is marked as the leader doesn't trigger a
leader election amongst these replicas.

REBALANCELEADERS did nothing.
FORCELEADER complains that there is already a leader.
FORCELEADER with the purported leader stopped took 45 seconds,
reported status of "0" (and no other message) and kept the down node
as the leader (!)
Deleting the failed collection from the failed node and re-adding it
has the same "Leader said I'm not the leader" error message.

Any other ideas?

Cheers

Tom

Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans

There are 11 collections, each only has one shard, and each node has
10 replicas (9 collections are on every node, 2 are just on one node).
We're not seeing any OOM errors on restart.

I think we're being patient waiting for the leader election to occur.
We stopped the troublesome "leader that is not the leader" server
about 15-20 minutes ago, but we still have not had a leader election.

Cheers

Tom

On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson  wrote:
> How many replicas per Solr JVM? And do you
> see any OOM errors when you bounce a server?
> And how patient are you being, because it can
> take 3 minutes for a leaderless shard to decide
> it needs to elect a leader.
>
> See SOLR-7280 and SOLR-7191 for the case
> where lots of replicas are in the same JVM,
> the tell-tale symptom is errors in the log as you
> bring Solr up saying something like
> "OutOfMemory error unable to create native thread"
>
> SOLR-7280 has patches for 6x and 7x, with a 5x one
> being added momentarily.
>
> Best,
> Erick
>
> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans  wrote:
>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
>> of the collections on it marked as "Recovering" or "Recovery Failed".
>> It attempts to recover from the leader, but the leader responds with:
>>
>> Error while trying to recover.
>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://172.31.1.171:3/solr: We are not the
>> leader
>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>> at 
>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
>> at 
>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> at 
>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>> Caused by: 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>> Error from server at http://172.31.1.171:3/solr: We are not the
>> leader
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:576)
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:284)
>> at 
>> org.apache.solr.client.solrj.impl.HttpSolrClient$1.call(HttpSolrClient.java:280)
>> ... 5 more
>>
>> and recovery never occurs.
>>
>> Each collection in this state has plenty (10+) of active replicas, but
>> stopping the server that is marked as the leader doesn't trigger a
>> leader election amongst these replicas.
>>
>> REBALANCELEADERS did nothing.
>> FORCELEADER complains that there is already a leader.
>> FORCELEADER with the purported leader stopped took 45 seconds,
>> reported status of "0" (and no other message) and kept the down node
>> as the leader (!)
>> Deleting the failed collection from the failed node and re-adding it
>> has the same "Leader said I'm not the leader" error message.
>>
>> Any other ideas?
>>
>> Cheers
>>
>> Tom

Re: Node not recovering, leader elections not occuring

2016-07-19 Thread Tom Evans

On the nodes that have the replica in a recovering state we now see:

19-07-2016 16:18:28 ERROR RecoveryStrategy:159 - Error while trying to
recover. core=lookups_shard1_replica8:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
lookups slice: shard1
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607)
at 
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

19-07-2016 16:18:28 INFO  RecoveryStrategy:444 - Replay not started,
or was not successful... still buffering updates.
19-07-2016 16:18:28 ERROR RecoveryStrategy:481 - Recovery failed -
trying again... (164)
19-07-2016 16:18:28 INFO  RecoveryStrategy:503 - Wait [12.0] seconds
before trying to recover again (attempt=165)

This is with the "leader that is not the leader" shut down.

Issuing a FORCELEADER via collections API doesn't in fact force a
leader election to occur.

Is there any other way to prompt Solr to have an election?

Cheers

Tom

On Tue, Jul 19, 2016 at 5:10 PM, Tom Evans  wrote:
> There are 11 collections, each only has one shard, and each node has
> 10 replicas (9 collections are on every node, 2 are just on one node).
> We're not seeing any OOM errors on restart.
>
> I think we're being patient waiting for the leader election to occur.
> We stopped the troublesome "leader that is not the leader" server
> about 15-20 minutes ago, but we still have not had a leader election.
>
> Cheers
>
> Tom
>
> On Tue, Jul 19, 2016 at 4:30 PM, Erick Erickson  
> wrote:
>> How many replicas per Solr JVM? And do you
>> see any OOM errors when you bounce a server?
>> And how patient are you being, because it can
>> take 3 minutes for a leaderless shard to decide
>> it needs to elect a leader.
>>
>> See SOLR-7280 and SOLR-7191 for the case
>> where lots of replicas are in the same JVM,
>> the tell-tale symptom is errors in the log as you
>> bring Solr up saying something like
>> "OutOfMemory error unable to create native thread"
>>
>> SOLR-7280 has patches for 6x and 7x, with a 5x one
>> being added momentarily.
>>
>> Best,
>> Erick
>>
>> On Tue, Jul 19, 2016 at 7:41 AM, Tom Evans  wrote:
>>> Hi all - problem with a SolrCloud 5.5.0, we have a node that has most
>>> of the collections on it marked as "Recovering" or "Recovery Failed".
>>> It attempts to recover from the leader, but the leader responds with:
>>>
>>> Error while trying to recover.
>>> core=iris_shard1_replica1:java.util.concurrent.ExecutionException:
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:3/solr: We are not the
>>> leader
>>> at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>>> at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>>> at 
>>> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:596)
>>> at 
>>> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:353)
>>> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
>>> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> at 
>>> org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at 
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>> Caused by: 
>>> org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException:
>>> Error from server at http://172.31.1.171:3/solr: We are not the
>>> leader
>>> at 
>>> org.apache.solr.client.solrj.impl.HttpSolrClient.execu

min()/max() on date fields using JSON facets

2016-07-25 Thread Tom Evans

Hi all

I'm trying to replace a use of the stats module with JSON facets in
order to calculate the min/max date range of documents in a query. For
the same search, "stats.field=date_published" returns this:

{u'date_published': {u'count': 86760,
 u'max': u'2016-07-13T00:00:00Z',
 u'mean': u'2013-12-11T07:09:17.676Z',
 u'min': u'2011-01-04T00:00:00Z',
 u'missing': 0,
 u'stddev': 50006856043.410477,
 u'sum': u'3814570-11-06T00:00:00Z',
 u'sumOfSquares': 1.670619719649826e+29}}

For the equivalent JSON facet - "{'date.max': 'max(date_published)',
'date.min': 'min(date_published)'}" - I'm returned this:

{u'count': 86760, u'date.max': 146836800.0, u'date.min': 129409920.0}

What do these numbers represent - I'm guessing it is milliseconds
since epoch? In UTC?
Is there any way to control the output format or TZ?
Is there any benefit in using JSON facets to determine this, or should
I just continue using stats?

Cheers

Tom

1 2 3 4 5 6 >

1 - 100 of 516 matches

Mail list logo