Re: Fixing corrupted index?

2014-03-25 Thread Dmitry Kan
Oh, somehow missed that in your original e-mail. How do you run the
checkindex? Do you pass the -fix option? [1]

You may want to try luke [2] to open index without opening the IndexReader
and run the Tools->Check Index tool from the luke.

[1] http://java.dzone.com/news/lucene-and-solrs-checkindex
[2] https://github.com/DmitryKey/luke/releases




On Mon, Mar 24, 2014 at 10:52 PM, zqzuk  wrote:

> Hi
> Thanks.
>
> But I am already using CheckIndex and the error is given by the CheckIndex
> utility: it could not even continue after reporting "could not read any
> segements file in directory".
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Fixing-corrupted-index-tp4126644p4126687.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


intersect query

2014-03-25 Thread cmd.ares
my_index(one core):
id,dealer,productName,amount,region 
1,A1,iphone4,400,east
2,A1,iphone4s,450,east
3,A2,iphone5s,550,east
..
4,A1,iphone4,400,west
5,A1,iphone4s,450,west
6,A3,iphone5s,550,west
..


-I'd like to get which dealer sale the 'iphone' both in the 'east' and
'west' 
pl/sql reference implementation:
1:
select dealer from my_index where region='east' and productName like
'%iphone%'
intersect
select dealer from my_index where region='west' and productName like
'%iphone%'
2:
select distinct dealer from my_index where region='east' and productName
like '%iphone%' and sales in (
select dealer from my_index where region='west' and productName like
'%iphone%'
)

solr reference implementation:
1.query parameters:
q=region:east AND productName:iphone
&fq={!join from=dealer to=dealer}(region:west AND productName:iphone)
&facet=true&facet.filed=dealer&facet.mincount=1

2.query parameters:
q=region:east AND productName:iphone({!join from=dealer
to=dealer}region:west AND productName:iphone)
&facet=true&facet.filed=dealer&facet.mincount=1

with the big index,the query is very slow.Is there any efficient way to
improve performance?

1.if must use solr join feature??if there are other approach??
2.if multicore shards can improve performance?? 
/***
as the wiki said:
In a DistributedSearch environment, you can not Join across cores on
multiple nodes. 
If however you have a custom sharding approach, you could join across cores
on the same node.
***/



--
View this message in context: 
http://lucene.472066.n3.nabble.com/intersect-query-tp4126828.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fixing corrupted index?

2014-03-25 Thread zqzuk
Thank you. 

I tried Luke with IndexReader disabled, however it seems the index is
compeletely broken, as it complains " ERROR: java.lang.Exception: there is
no valid Lucene index in this directory."

Sounds like I am out of luck, is it so?






--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fixing-corrupted-index-tp4126644p4126830.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Fixing corrupted index?

2014-03-25 Thread Dmitry Kan
1. Luke: if you leave the IndexReader on, does the index even open? Can you
access the CheckIndex?
2. The command line CheckIndex: what does the CheckIndex -fix do?


On Tue, Mar 25, 2014 at 10:54 AM, zqzuk  wrote:

> Thank you.
>
> I tried Luke with IndexReader disabled, however it seems the index is
> compeletely broken, as it complains " ERROR: java.lang.Exception: there is
> no valid Lucene index in this directory."
>
> Sounds like I am out of luck, is it so?
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Fixing-corrupted-index-tp4126644p4126830.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


document migrate

2014-03-25 Thread Cihat güzel
hi all,

I have a test for document migrate. I followed this url:
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api12Migratedocumentstoanothercollection

I am trying on solr- 4.6.1. I have two collection (collection1 and
collection2) and two shards. my collection1 has 33 document and collection2
has 14 doc. I tried a request as follow:
http://localhost:8983/solr/admin/collections?action=MIGRATE&collection=collection1&split.key=key1!&target.collection=collection2

The response is as follow:



400
0


Unknown action: MIGRATE
400



Why do solr response say "Unknown action: MIGRATE" ? What is my mistake?


How to index only the pdf content/text

2014-03-25 Thread Croci Francesco Luigi (ID SWS)
I searched a way to index only the content/text part of a PDF (without all the 
other fields Tika creates) and I found the "solution" with the "uprefix" = 
ignored_ and .

The problem is, that uprefix works on fields that are not specified in the 
schema. In my schema I specified two fields (id and rmDocumentTitle) and this 
two fields are added to the content too (what I will avoid).

How can I exclude this two fields to be added to the fullText?

Here are my config files:

schema.xml



   
   
   
   
   





   
   
   
   
   
   
   
   
   
   
   
   
   
   



   
   
   
   
   


fullText


id



solrconfig.xml


...

   
   true
   false
   false
   true
   true
   ignored_
   link
   fullText
   
   deduplication
   



   
   false
   signatureField
   true
   content
   10
   .2
   solr.update.processor.TextProfileSignature
   
   
   




none


   *:*




Thank you for any help.
Francesco


Re: Fixing corrupted index?

2014-03-25 Thread zqzuk
1. No, if IndexReader is on I get the same error message from checkindex
2. It doesnt do any thing but giving that error message I posted before then
quit. The full print of the error trace is:



Opening index @ E:\...\zookeeper\solr\collec
tion1\data\index

ERROR: could not read any segments file in directory
java.io.FileNotFoundException: E:\...\zookee
per\solr\collection1\data\index\segments_b5tb (The system cannot find the
file s
pecified)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.(Unknown Source)
at
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:22
3)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:285)
at
org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:347)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
s.java:783)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
s.java:630)
at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:343)
at
org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:383)
at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1777)




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fixing-corrupted-index-tp4126644p4126837.html
Sent from the Solr - User mailing list archive at Nabble.com.


Indexing parts of an HTML file differently

2014-03-25 Thread Michael Clivot
Hello,

I have the following issue and need help:

One HTML file has different parts for different countries.
For example:



Address for France and Benelux




Address for Switzerland



Depending on a parameter, I show or hide the parts on the website
Logically, all parts are in the index and therefore all items are found by SolR.
My question is: how can I have only the items for the current country in my 
result list?

Thanks a lot
Regards
Michael

___
cli...@netmedia.de
netmedia - the Social Workplace Experts

netmedianer GmbH, Neugrabenweg 5-7, 66123 Saarbr?cken, Germany
fon: +49 681 37988-12, fax: +49 681 37988-99, mobil: +49 151 54775197
Gesch?ftsf?hrer: Boris Brenner, Tim Mik?a | HRB Saarbr?cken 13975

https://twitter.com/netmedianer, https://www.facebook.com/netmedianer


Re: Indexing parts of an HTML file differently

2014-03-25 Thread Gora Mohanty
On 25 March 2014 15:59, Michael Clivot  wrote:
> Hello,
>
> I have the following issue and need help:
>
> One HTML file has different parts for different countries.
> For example:
>
> 
> 
> Address for France and Benelux
> 
> 
> 
> 
> Address for Switzerland
> 
> 
>
> Depending on a parameter, I show or hide the parts on the website
> Logically, all parts are in the index and therefore all items are found by 
> SolR.
> My question is: how can I have only the items for the current country in my 
> result list?

How are you fetching the HTML content, and indexing it into Solr?
It is probably best to handle this requirement at that point. Haven't
used Nutch ( http://nutch.apache.org/ ) recently, but you might be
able to use it for this.

Regards,
Gora


Re: document migrate

2014-03-25 Thread Jan Høydahl
Migrate is new in Solr 4.7.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

25. mars 2014 kl. 10:51 skrev Cihat güzel :

> hi all,
> 
> I have a test for document migrate. I followed this url:
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api12Migratedocumentstoanothercollection
> 
> I am trying on solr- 4.6.1. I have two collection (collection1 and
> collection2) and two shards. my collection1 has 33 document and collection2
> has 14 doc. I tried a request as follow:
> http://localhost:8983/solr/admin/collections?action=MIGRATE&collection=collection1&split.key=key1!&target.collection=collection2
> 
> The response is as follow:
> 
> 
> 
> 400
> 0
> 
> 
> Unknown action: MIGRATE
> 400
> 
> 
> 
> Why do solr response say "Unknown action: MIGRATE" ? What is my mistake?



Re: document migrate

2014-03-25 Thread Furkan KAMACI
Hi;

I think that we should add which version includes which parameters at
Collections API wiki page. A new 'migrate' collection API to split all
documents with a route key into another collection is introduced with Solr
4.7.0

Thanks;
Furkan KAMACI


2014-03-25 11:51 GMT+02:00 Cihat güzel :

> hi all,
>
> I have a test for document migrate. I followed this url:
>
> https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-api12Migratedocumentstoanothercollection
>
> I am trying on solr- 4.6.1. I have two collection (collection1 and
> collection2) and two shards. my collection1 has 33 document and collection2
> has 14 doc. I tried a request as follow:
>
> http://localhost:8983/solr/admin/collections?action=MIGRATE&collection=collection1&split.key=key1!&target.collection=collection2
>
> The response is as follow:
>
> 
> 
> 400
> 0
> 
> 
> Unknown action: MIGRATE
> 400
> 
> 
>
> Why do solr response say "Unknown action: MIGRATE" ? What is my mistake?
>


Re: solr 4.x reindexing issues

2014-03-25 Thread Jan Høydahl
Hi,

Seems you try to reindex from one server to the other.

Be aware that it could be easier for you to simply copy the whole index folder 
over to your 4.6.1 server and start Solr as it will be able to read your 3.x 
index. This is unless you also want to do major upgrades of your schema or 
update processors so that you'll need a re-index anyway.

If you believe you really need a re-index, then please try to batch index 
without triggering commits every few seconds - this is really heavy on the 
system and completely unnecessary. You won't get the benefit of SoftCommit if 
you're not running SolrCloud, so no need to configure that.

I would change your  into maxDocs=1 and maxTime=12 (every 
2min). 
Further please index without 1s commitWithin, i.e. instead of
>server.add(iDoc, 1000);
use
>server.add(iDoc);

This will make sure the server gets room to breathe and not constantly 
generating new indices.

Finally, it's probably not a good idea to use recursion here. You really don't 
need to, filling up your stack. You can instead refactor the method to do the 
whole indexing. And a hint is that it is generally better to ask for ALL 
documents in one go and stream to the end rather than increasing offsets with 
new queries all the time - because high offsets/start can be time consuming, 
especially with multiple shards. If you increase the timeout enough you should 
be able to retrieve all documents in one go!

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

24. mars 2014 kl. 22:36 skrev Ravi Solr :

> Hello,
>We are trying to reindex as part of our move from 3.6.2 to 4.6.1
> and have faced various issues reindexing 1.5 Million docs. We dont use
> solrcloud, its still Master/Slave config. For testing this Iam using a
> single test server reading from it and putting back into same index.
> 
> We send docs in batches of 100 but only 10/100 are getting indexed, is this
> related to the maxBufferedAddsPerServer setting that is hard coded ?? Also
> I tried to play with autocommit and softcommit settings but in vain.
> 
>
>   5
>   5000
>   true
>
> 
>
>1000
>
> 
> I use these on the test system just to check if docs are being indexed, but
> even with a batch of 5 my solrj client code runs faster than indexing
> causing some docs to not get indexed. The function that's indexing is a
> recursive method call  (shown below) which fails after sometime with stack
> overflow (I did not have this issue with 3.6.2 with same code)
> 
>private static void processDocs(HttpSolrServer server, Integer start,
> Integer rows) throws Exception {
>SolrQuery query = new SolrQuery();
>query.setQuery("*:*");
>query.addFilterQuery("-allfields:[* TO *]");
>QueryResponse resp = server.query(query);
>SolrDocumentList list =  resp.getResults();
>Long total = list.getNumFound();
> 
>if(list != null && !list.isEmpty()) {
>for(SolrDocument doc : list) {
>SolrInputDocument iDoc =
> ClientUtils.toSolrInputDocument(doc);
>//To index full doc again
>iDoc.removeField("_version_");
>server.add(iDoc, 1000);
>}
> 
>System.out.println("Indexed " + (start+rows) + "/" + total);
>if (total >= (start + rows)) {
>processDocs(server, (start + rows), rows);
>}
>}
>}
> 
> I also tried turning on the updateLog but that was filling up so fast to
> the point where it is useless.
> 
> How do we do bulk updates in solr 4.x environment ?? Is there any setting
> that Iam missing ??
> 
> Thanks
> 
> Ravi Kiran Bhaskar
> Technical Architect
> The Washington Post



Re: document migrate

2014-03-25 Thread Jan Høydahl
> I think that we should add which version includes which parameters at
> Collections API wiki page. A new 'migrate' collection API to split all
> documents with a route key into another collection is introduced with Solr
> 4.7.0

Should not be necessary, since the top of every Confluence page reads "This 
Unreleased Guide Will Cover Apache Solr 4.8". What we perhaps could do is add 
some JS magic which opens a popup for new users (based on a cookie) with an 
informative text 

"This Solr reference guide targets the next yet unreleased version of Solr. To 
find the reference guide for a released version of Solr, please follow this 
link ."

The old wiki was littered with version numbers all over the place. If people 
wants to know which versions support a given feature they can refer to CHANGES.

Jan

Re: Indexing parts of an HTML file differently

2014-03-25 Thread Jack Krupansky
There is no Solr feature that would break up your HTML file - you will have 
to do that yourself, either before you send the file to Solr or by 
developing a custom update processor that extracts the sections and directs 
each to a specific field for the language. The former is probably easier 
since any generic processor that extracts text from an HTML file will strip 
out all HTML comments.


-- Jack Krupansky

-Original Message- 
From: Michael Clivot

Sent: Tuesday, March 25, 2014 6:29 AM
To: solr-user@lucene.apache.org
Subject: Indexing parts of an HTML file differently

Hello,

I have the following issue and need help:

One HTML file has different parts for different countries.
For example:



Address for France and Benelux




Address for Switzerland



Depending on a parameter, I show or hide the parts on the website
Logically, all parts are in the index and therefore all items are found by 
SolR.
My question is: how can I have only the items for the current country in my 
result list?


Thanks a lot
Regards
Michael

___
cli...@netmedia.de
netmedia - the Social Workplace Experts

netmedianer GmbH, Neugrabenweg 5-7, 66123 Saarbr?cken, Germany
fon: +49 681 37988-12, fax: +49 681 37988-99, mobil: +49 151 54775197
Gesch?ftsf?hrer: Boris Brenner, Tim Mik?a | HRB Saarbr?cken 13975

https://twitter.com/netmedianer, https://www.facebook.com/netmedianer 



Re: Multilingual indexing, search results, edismax and stopwords

2014-03-25 Thread Jan Høydahl
If using stopwords with edismax, please make sure that ALL fields referred in 
"qf" have stopwords defined in the fieldType and also that the stopword 
dictionary is the SAME for all these. This way you will not encounter the 
infamous edismax+stopwords bug mentioned in 
https://issues.apache.org/jira/browse/SOLR-3085

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

23. mars 2014 kl. 19:37 skrev Jack Krupansky :

> Setting the default query operator to AND is the preferred approach: q.op=AND.
> 
> That said, I'm not sure that counting ignored and empty terms towards the mm 
> % makes sense. IOW, if a term transforms to nothing, either because it is a 
> stop word or empty synonym replacement or pure punctuation, I don't think it 
> should count as a term. I think this is worth a Jira.
> 
> -- Jack Krupansky
> 
> -Original Message- From: kastania44
> Sent: Thursday, March 20, 2014 11:00 AM
> To: solr-user@lucene.apache.org
> Subject: Multilingual indexing, search results, edismax and stopwords
> 
> On our drupal multilingual system we use apache Solr 3.5.
> The problem is well known on different blogs, sites I read.
> The search results are not the one we want.
> On our code in hook apachesolr_query_alter we override the defaultOperator:
> $query->replaceParam('mm', '90%');
> The requirement is, when I search for: biological analyses, I want to fetch
> only the results which have both of the words.
> When I search for: biological and chemical analyses, I want it to fetch only
> the results which have biological , chemical, analyses. The and is not
> indexed due to stopwords.
> 
> If I set mm to 100% and my query has stopwords it will not fetch any result.
> If I set mm to 100$ and my query does not have stopwords it will fetch the
> desired results.
> If I set mm anything between 50%-99% it fetches not wanted results, as
> results that contain only one of the searched keywords, or words like the
> searched keywords, like analyse (even if I searched for analyses).
> 
> If I search using + before the words that are mandatory it works ok, but it
> is not user friently, to ask from the user to type + before each word
> exvcept from the stopwords.
> 
> Do I make any sense?
> 
> Below are some of our configuration details:
> 
> All the indexed fields are of type text_language,
> e.g from our schema.xml
> / termVectors="true" omitNorms="true"/>
>  termVectors="true" omitNorms="true"/>
>  termVectors="true" omitNorms="true"/>/
> All the text fieldtypes have the same configuration except from the
> protected, words, dictionary parameters which are language specific.
> e.g from our schema.xml
> / positionIncrementGap="100">
> 
>mapping="mapping-ISOLatin1Accent_en.txt"/>
>   
> 
> 
>words="stopwords_en.txt" enablePositionIncrements="true"/>
>protected="protwords.txt" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
> preserveOriginal="1" splitOnNumerics="1" stemEnglishPossessive="1"/>
>   
>class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compoundwords_en.txt" minWordSize="5" minSubwordSize="4"
> maxSubwordSize="15" onlyLongestMatch="true"/>
>protected="protwords_en.txt"/>
>   
> 
> 
>mapping="mapping-ISOLatin1Accent_en.txt"/>
>   
>ignoreCase="true" expand="true"/>
>words="stopwords_en.txt" enablePositionIncrements="true"/>
>protected="protwords.txt" generateWordParts="1" generateNumberParts="1"
> catenateWords="1" catenateNumbers="1" catenateAll="1" splitOnCaseChange="1"
> preserveOriginal="1" splitOnNumerics="1" stemEnglishPossessive="1"/>
>   
>class="solr.DictionaryCompoundWordTokenFilterFactory"
> dictionary="compoundwords_en.txt" minWordSize="5" minSubwordSize="4"
> maxSubwordSize="15" onlyLongestMatch="true"/>
>protected="protwords_en.txt"/>
>   
> 
>   /
> 
> 
> 
> solrconfig.xml
> 
> / default="true">
>   
> edismax
> explicit
> true
> 0.01
> 
> ${solr.pinkPony.timeAllowed:-1}
> *:*
> 
> 
> false
> 
> true
> false
> 
> 1
>   
>   
> spellcheck
>   
> /
> 
> 
> ANY ideas are appreciated!
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Multilingual-indexing-search-results-edismax-and-stopwords-tp4125746.html
> Sent from the Solr - User mailing list archive at Nabble.com. 



Re: How to secure Solr admin page?

2014-03-25 Thread Jan Høydahl
Hi,

First of all, the wiki page you refer to is *not* the official ref-guide. The 
official one can be found here 
https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide

The wiki you found is a community-edited wiki, and may talk about ideas or 
patches.

The autentication you try to do is not part of Solr, but something you have to 
setup in your servlet container. Please refer to the documentation or community 
for your servlet container for help on this. 

The path to Admin is http://your.host:8983/solr/#/ but I have no idea on how to 
instruct your servlet container to only require auth for this path and not for 
e.g. http://your.host:8983/solr/collection1/select

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

19. mars 2014 kl. 09:47 skrev Tony Xue :

> Hi all,
> 
> I was following the instructions in the official wiki:
> https://wiki.apache.org/solr/SolrSecurity
> 
> But I don't have any idea about what directory I should put between
>  to secure only admin page.
> 
> I tried to put /admin/* but it didn't work.
> 
> 
> Tony



Re: intersect query

2014-03-25 Thread Ahmet Arslan
Hi Ares,

How about using field collapsing?  https://wiki.apache.org/solr/FieldCollapsing

&q=+region:(east OR west) +productName:iPhone
&group=true
&group.field=dealer


If the number of distinct groups is high, CollapsingQueryParser could be used 
too.

https://cwiki.apache.org/confluence/display/solr/Other+Parsers#OtherParsers-CollapsingQueryParser

Ahmet


On Tuesday, March 25, 2014 10:50 AM, cmd.ares  wrote:
my_index(one core):
id,dealer,productName,amount,region 
1,A1,iphone4,400,east
2,A1,iphone4s,450,east
3,A2,iphone5s,550,east
..
4,A1,iphone4,400,west
5,A1,iphone4s,450,west
6,A3,iphone5s,550,west
..


-I'd like to get which dealer sale the 'iphone' both in the 'east' and
'west' 
pl/sql reference implementation:
1:
select dealer from my_index where region='east' and productName like
'%iphone%'
intersect
select dealer from my_index where region='west' and productName like
'%iphone%'
2:
select distinct dealer from my_index where region='east' and productName
like '%iphone%' and sales in (
select dealer from my_index where region='west' and productName like
'%iphone%'
)

solr reference implementation:
1.query parameters:
q=region:east AND productName:iphone
&fq={!join from=dealer to=dealer}(region:west AND productName:iphone)
&facet=true&facet.filed=dealer&facet.mincount=1

2.query parameters:
q=region:east AND productName:iphone({!join from=dealer
to=dealer}region:west AND productName:iphone)
&facet=true&facet.filed=dealer&facet.mincount=1

with the big index,the query is very slow.Is there any efficient way to
improve performance?

1.if must use solr join feature??if there are other approach??
2.if multicore shards can improve performance?? 
/***
as the wiki said:
In a DistributedSearch environment, you can not Join across cores on
multiple nodes. 
If however you have a custom sharding approach, you could join across cores
on the same node.
***/



--
View this message in context: 
http://lucene.472066.n3.nabble.com/intersect-query-tp4126828.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Fixing corrupted index?

2014-03-25 Thread Dmitry Kan
right. If you have cfs files in the index directory, there is a thread
discussing the method of regenerating the segment files:

http://www.gossamer-threads.com/lists/lucene/java-user/39744

backup before doing changes!

source on SO:
http://stackoverflow.com/questions/9935177/how-to-repair-corrupted-lucene-index


On Tue, Mar 25, 2014 at 11:57 AM, zqzuk  wrote:

> 1. No, if IndexReader is on I get the same error message from checkindex
> 2. It doesnt do any thing but giving that error message I posted before
> then
> quit. The full print of the error trace is:
>
>
>
> Opening index @ E:\...\zookeeper\solr\collec
> tion1\data\index
>
> ERROR: could not read any segments file in directory
> java.io.FileNotFoundException: E:\...\zookee
> per\solr\collection1\data\index\segments_b5tb (The system cannot find the
> file s
> pecified)
> at java.io.RandomAccessFile.open(Native Method)
> at java.io.RandomAccessFile.(Unknown Source)
> at
> org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:22
> 3)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:285)
> at
> org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:347)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
> s.java:783)
> at
> org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfo
> s.java:630)
> at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:343)
> at
> org.apache.lucene.index.CheckIndex.checkIndex(CheckIndex.java:383)
> at org.apache.lucene.index.CheckIndex.main(CheckIndex.java:1777)
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Fixing-corrupted-index-tp4126644p4126837.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dmitry
Blog: http://dmitrykan.blogspot.com
Twitter: http://twitter.com/dmitrykan


alternate address for solr-user list, subscription confirmation

2014-03-25 Thread Philip Durbin
Thanks for Solr! It's a great product. I've been hanging out in
#lucene-dev for a while but I thought I'd join the mailing list.

ezmlm seems to pick up an alternate email address of mine in the
"Return-Path" header so I tried to override the default subscription
address by emailing
solr-user-subscribe-philip_durbin=harvard@lucene.apache.org

I didn't receive a confirmation that this worked but I suspect I am subscribed.

Anyway, sorry for the noise. I just want to make sure this goes
through. Assuming it does, I'll send my real question soon. :)

Phil

p.s. I guess I would suggest that ezmlm send a confirmation email that
subscribers are now on the list, if it's easy to do. I'm used to this
behavior from mailman.

-- 
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin


Re: Can the solr dataimporthandler consume an atom feed?

2014-03-25 Thread eShard
Gora! It works now! 
You are amazing! thank you so much!
I dropped the atom: from the xpath and everything is working.
I did have a typo that might have been causing issues too.
thanks again!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-the-solr-dataimporthandler-consume-an-atom-feed-tp4126134p4126887.html
Sent from the Solr - User mailing list archive at Nabble.com.


search WITH or WITHOUT accents (selection at runtime) + highlights

2014-03-25 Thread elfu
Hello,

I have the following problem to resolve using solr :

search WITH or WITHOUT accents (selection at runtime) + highlights

how can i configure the schema to realize this ? 

for example: 

inputString  "aaa près bbb pres" 

A) accent sensitive 

  1.search for *près*   highlight  ="aaa près bbb pres" 
  2.search for *pres*   highlight  ="aaa près bbb pres" 

B) accent insensitive

  1. search for *près*highlight = "aaa près bbb pres" 
  2. search for *pres*highlight = "aaa près bbb pres" 

I try with 3 field : 1 for inputString storage and highlight and 2 for indexs

field_S (stored, !indexed, used for highlight = accent insensitive)
field_A (!stored, indexed, keep accents  = accent sensitive)
field_B (!stored, indexed, remove accents = accent insensitive)

when i search without accents on field_B the highlight it's ok (field_S 
query/index op  like field_B)
but when i search with accents on field_A the highlight i't not ok anymore ..

thx for help 
regards,
e

The highlight engine can work with indexAnalyzer and/or queryAnalyzer from 
other schema fields ?
(if i want to highlight the result from accent sensitive search the content can 
be parsed with indexAnalyzer and the 
hl.q can use queryAnalyzer from field_A ?)




Re: SolrCloud from "Stopping recovery for" warnings to crash

2014-03-25 Thread Lukas Mikuckis
This night the problem occurred again and I have more data. This time this
problem happened only in one solr server and it successfully recovered.

solr which had all the leaders:

[06:38:58.205 - 06:38:58.222] Stopping recovery for
zkNodeName=core_node2core=** *- for all collections*

[06:38:59.322 - 06:38:59.995] ContextcancelElection did not find election
node to remove* - many times*

[06:39:02.403] PeerSyncno frame of reference to tell if we've missed
updates *- solr recovered*


one of zookeepers (the one which is in the same server as solr which got
the warnings):

[06:38:58.099, 06:38:58.113, 06:38:58.114]
org.apache.zookeeper.server.NIOServerCnxn: [ERROR] Unexpected Exception:
java.nio.channels.CancelledKeyException *- 3 times*


The other solr and other zookeepers haven't got any errors / warnings.


Some monitoring data:

Garbage Collectors Summary:
https://apps.sematext.com/spm-reports/s/RYZxbcHXzu

Pool Size:
https://apps.sematext.com/spm-reports/s/N5c8QFc86d

Pool Utilization:
https://apps.sematext.com/spm-reports/s/B487KaWGXP

Load:
https://apps.sematext.com/spm-reports/s/ytfFzqYBl2



2014-03-24 17:39 GMT+02:00 Lukas Mikuckis :

> We tried to set ZK timeout to 1s and did load testing (both indexing and
> search) and this issue didn't happen.
>
>
> 2014-03-24 17:00 GMT+02:00 Lukas Mikuckis :
>
> Garbage Collectors Summary:
>> https://apps.sematext.com/spm-reports/s/rgRnwuShgI
>>
>> Pool Size:
>> https://apps.sematext.com/spm-reports/s/H16ndqichM
>>
>> First Stopping recovery warning: 4:00, OOM error: 6:30.
>>
>>
>> 2014-03-24 16:35 GMT+02:00 Shalin Shekhar Mangar 
>> :
>>
>> I am guessing that it is all related to memory issues. I guess that as
>>> the used heap increases, full GC cycles increase causing ZK timeouts
>>> which in turn cause more recoveries to be initiated. In the end,
>>> everything blows up with the out of memory errors. Do you log GC
>>> activity on your servers?
>>>
>>> I suggest that you rollback to 4.6.1 for now and upgrade to 4.7.1 when
>>> it releases next week.
>>>
>>> On Mon, Mar 24, 2014 at 7:51 PM, Lukas Mikuckis 
>>> wrote:
>>> > Yes, we upgraded solr from 4.6.1 to 4.7 3 weeks ago (2 weeks before
>>> solr
>>> > started crashing).
>>> > When we were upgrading, we just upgraded solr and changed versions in
>>> > collections configs.
>>> >
>>> > When solr crashes we get OOM but only 2h after first Stopping recovery
>>> > warnings.
>>> >
>>> > Maybe you have any ideas when Stopping recovery warnings are thrown?
>>> > Because now we have no idea what could cause this issue.
>>> >
>>> > Mon, 24 Mar 2014 04:03:17 GMT Shalin Shekhar Mangar <
>>> shalinman...@gmail.com
>>> >>:
>>> >>
>>> >> Did you upgrade recently to Solr 4.7? 4.7 has a bad bug which can
>>> >> cause out of memory issues. Can you check your logs for out of memory
>>> >> errors?
>>> >>
>>> >> On Sun, Mar 23, 2014 at 9:07 PM, Lukas Mikuckis <
>>> lukasmikuc...@gmail.com>
>>> > wrote:
>>> >> > Solr version: 4.7
>>> >> >
>>> >> > Architecture:
>>> >> > 2 solrs (1 shard, leader + replica)
>>> >> > 3 zookeepers
>>> >> >
>>> >> > Servers:
>>> >> > * zookeeper + solr (heap 4gb) - RAM 8gb, 2 cpu cores
>>> >> > * zookeeper + solr  (heap 4gb) - RAM 8gb, 2 cpu cores
>>> >> > * zookeeper
>>> >> >
>>> >> > Solr data:
>>> >> > * 21 collections
>>> >> > * Many fields, small docs, docs count per collection from 1k to 500k
>>> >> >
>>> >> > About a week ago solr started crashing. It crashes every day, 3-4
>>> times
>>> > a
>>> >> > day. Usually at nigh. I can't tell anything what could it be
>>> related to
>>> >> > because at that time we haven't done any configuration changes. Load
>>> >> > haven't changed too.
>>> >> >
>>> >> >
>>> >> > Everything starts with Stopping recovery for .. warnings (every
>>> > warnings is
>>> >> > repeated several times):
>>> >> >
>>> >> > WARN  org.apache.solr.cloud.RecoveryStrategy; Stopping recovery for
>>> >> > zkNodeName=core_node1core=**
>>> >> >
>>> >> > WARN  org.apache.solr.cloud.ElectionContext; cancelElection did not
>>> find
>>> >> > election node to remove
>>> >> >
>>> >> > WARN  org.apache.solr.update.PeerSync; no frame of reference to
>>> tell if
>>> >> > we've missed updates
>>> >> >
>>> >> > WARN  - 2014-03-23 04:00:26.286; org.apache.solr.update.PeerSync; no
>>> > frame
>>> >> > of reference to tell if we've missed updates
>>> >> >
>>> >> > WARN  - 2014-03-23 04:00:30.728; org.apache.solr.handler.SnapPuller;
>>> > File
>>> >> > _f9m_Lucene41_0.doc expected to be 6218278 while it is 7759879
>>> >> >
>>> >> > WARN  - 2014-03-23 04:00:54.126;
>>> >> > org.apache.solr.

Re: creating shards on the fly in a single Solr instance ("shards" query parameter)

2014-03-25 Thread Shalin Shekhar Mangar
Hi Philip,

Comments inline:

On Tue, Mar 25, 2014 at 8:11 PM, Philip Durbin
 wrote:
> I'm new to Solr and am exploring the idea of creating shards on the
> fly. Once the shards have been created and populated, I am hoping to
> use the "shards" query parameter to combine results from multiple
> shards into a single results set.
>
> By following the "Testing Index Sharding on Two Local Servers"
> instructions[1] in the wiki I'm able to target two different shards
> individually. It works great, I get three results from one shard and
> one result from another shard. When I select them both with the
> following query (straight from the wiki) I get all four results:
>
> curl 
> 'http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr'
>
> My immediate problem is trying to figure out how to convert this
> example from two instances of Solr running on different ports (8983
> and 7574) that each have a single shard to one instance of Solr that
> has two shards.
>
> Various posts[2] suggest this is possible but refer to older versions
> of Solr and the instructions don't seem to work for Solr 4.7.0.

That post[2] refers to a SolrCloud installation but I guess you are
trying a non-ZK solr cluster. In the old way it is possible via
multiple cores on the same instance so you could do
shards=localhost:8983/solr/core1,localhost:8983/solr/core2 and so on.

That being said, I think you should use SolrCloud for your own sanity's sake :)

>
> When I try the CREATESHARD API call from the wiki[3] I get "Solr
> instance is not running in SolrCloud mode" which isn't a huge surprise
> because it's documented under the Collections API under SolrCloud.
>

Yes, CREATESHARD is a Collection API which will work only with a
SolrCloud installation. Plus it only works for what we call custom
sharding i.e. where the user controls which documents goes where.

> I don't know anything about SolrCloud. Not yet, anyway. My experience
> with Solr so far involves running `java -jar start.jar` from the
> "example" directory. I barely know what Zookeeper is.

ZooKeeper is a distributed co-ordination service. In simple words,
think of it as a mediator between different Solr nodes which helps
them to reach a consensus on topics. All Solr instances must know the
host:port of your ZooKeeper instances. Solr ships with an embedded
ZooKeeper which can be used to get started but for production consider
running your own instances of ZooKeeper (at least 3).

>
> My goal for now is to use CREATESHARD to create a new shard in a
> single Solr instance and then verify it was created with the STATUS[4]
> command.
>
> Can anyone please explain how I would accomplish this?

I'm assuming you have a binary distribution of Solr 4.6.1 or later on
a linux distribution.

Here's the high level: We will create a shard for "public" and shards
for each unique "user". We will do this by adding a field to each doc
called "user" whose value (public, user1, user2 etc) will be used to
route documents to the correct shard.

1. Create a Solr schema which has a field called "user" - It's value
will be mapped to a shard.
2. Let's start by installing ZooKeeper:
http://zookeeper.apache.org/doc/r3.4.6/zookeeperStarted.html
3. Upload configuration to ZooKeeper:
cd example; ./cloud-scripts/zkcli.sh -cmd upconfig -zkhost
localhost:2181 -confdir solr/collection1/conf -confname conf1
4. Start Solr: java -DzkHost=localhost:2181 -jar start.jar
5. Create a collection with custom sharding enabled:
http://localhost:8983/solr/admin/collections?action=CREATE&name=mydata&collection.configName=conf1&maxShardsPerNode=10&router.name=implicit&router.field=user&shards=public,user1,user2
6. Whenever a new user is added to the system, you can use CREATESHARD:
http://localhost:8983/solr/admin/collections?action=CREATESHARD&collection=mydata&shard=user3

While indexing, you can send data to the right shard by adding a
parameter called _route_= e.g. _route_=public or
_route_=user1
While searching, you can search both the public as well as the user1's
shard by adding _route_=public,user2

You can verify that the shards are added by going to the Collections
page on the UI which will show each shard and the Solr instance on
which it is running.

>
> The thought is to have one shard for public data and a shard per user,
> which is why I'm asking about creating the shards on the fly. (Logged
> in users would see a mix of public and private data.) For now I'd like
> to keep using a single Solr instance for simplicity. For more
> background on where I'm coming from, please see
> http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2014-02-06#l99
> and 
> https://trello.com/c/5z5PpR4r/50-design-solr-document-level-security-filter-solution
>
> Thanks,
>
> Phil
>
> 1. 
> https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
>
> 2. 
> http://solr.pl/en/2013/01/07/solr-4-1-solrcloud-multiple-shards-on-the-same-solr-node/
>
> 3. i.e

Re: Question on highlighting edgegrams

2014-03-25 Thread Software Dev
Bump

On Mon, Mar 24, 2014 at 3:00 PM, Software Dev  wrote:
> In 3.5.0 we have the following.
>
>  positionIncrementGap="100">
>   
> 
> 
>  maxGramSize="30"/>
>   
>   
> 
> 
>   
> 
>
> If we searched for "c" with highlighting enabled we would get back
> results such as:
>
> cdat
> crocdile
> cool beans
>
> But in the latest Solr (4.7) we get the full words highlighted back.
> Did something change from these versions with regards to highlighting?
>
> Thanks


Replication (Solr Cloud)

2014-03-25 Thread Software Dev
I see that by default in SolrCloud that my collections are
replicating. Should this be disabled in SolrCloud as this is already
handled by it?

>From the documentation:

"The Replication screen shows you the current replication state for
the named core you have specified. In Solr, replication is for the
index only. SolrCloud has supplanted much of this functionality, but
if you are still using index replication, you can use this screen to
see the replication state:"

I just want to make sure before I disable it that if we send an update
to one server that the document will be correctly replicated across
all nodes. Thanks


Multiple search analyzers question

2014-03-25 Thread ku3ia
Hi all!
Now I have a default search field, defined as


...
   
  







  
  







  


In a future, I will need to search using my current field (with KStem
filter) and need alternative search – w/o using KStem filter. The easiest
way is to add a copy field and declare a new field type (w/o KStem):


...
   
  






  
  






  


and to re-index all my data.
Is any alternative way?

Thanks.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-search-analyzers-question-tp4126955.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Replication (Solr Cloud)

2014-03-25 Thread Shawn Heisey

On 3/25/2014 10:42 AM, Software Dev wrote:

I see that by default in SolrCloud that my collections are
replicating. Should this be disabled in SolrCloud as this is already
handled by it?

 From the documentation:

"The Replication screen shows you the current replication state for
the named core you have specified. In Solr, replication is for the
index only. SolrCloud has supplanted much of this functionality, but
if you are still using index replication, you can use this screen to
see the replication state:"

I just want to make sure before I disable it that if we send an update
to one server that the document will be correctly replicated across
all nodes. Thanks


The replication handler must be configured for SolrCloud to operate 
properly ... but not in the way that you might think. This is a source 
of major confusion for those who are new to SolrCloud, especially if 
they already understand master/slave replication.


During normal operation, SolrCloud does NOT use replication.  
Replication is ONLY used to recover indexes.  When everything is working 
well, recovery only happens when a Solr instance starts up.


Every Solr instance will be a master.  If that Solr instance has *EVER* 
(since the last instance start) replicated its index from a shard 
leader, it will *also* say that it is a slave. These are NOT indications 
that a replication is occurring, they are just the current configuration 
state of the replication handler.


You can ignore everything you see on the replication tab if you are 
running SolrCloud.  It only has meaning at the moment a replication is 
happening, and that is completely automated by SolrCloud.


Thanks,
Shawn



Re: solr 4.x reindexing issues

2014-03-25 Thread Ravi Solr
Thank you very much for responding Mr. Høydahl. I removed the recursion
which eliminated the stack overflow exception. However, I still
encountering my main problem with the docs not getting indexed in solr 4.x
as I mentioned in my original email. The reason I am reindexing is that
with solr 4.x EnglishPorterFilterFactory has been removed and also I wanted
to add another copyField of all field values into destination "allfields"

As per your suggestion I removed softcommit and had autoCommit to maxDocs
100 and maxTime to 12. I was printing out the indexing call...You can
clearly see still it does index around 10 at a time (testing code and
results shown below). Again my code finished fully and just for a good
measure I commited manually after 10 minutes still when I query I only see
"13513" docs got indexed.

There must be something else I am missing


 
  0
  1
  
   allfields:[* TO *]
xml
0
  
  
  

TEST INDEXER CODE
 ---
Long total = null;
Integer start = 0;
Integer rows = 100;
while(total == null || total >= (start+rows)) {
SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.setSort("displaydatetime", ORDER.desc);
query.addFilterQuery("-allfields:[* TO *]");
QueryResponse resp = server.query(query);
SolrDocumentList list =  resp.getResults();
total = list.getNumFound();

if(list != null && !list.isEmpty()) {
for(SolrDocument doc : list) {
SolrInputDocument iDoc =
ClientUtils.toSolrInputDocument(doc);
//To index full doc again
iDoc.removeField("_version_");
server.add(iDoc);
}

System.out.println("Indexed " + (start+rows) + "/" + total);
start = (start+rows);
}
}

   System.out.println("COMPLETELY DONE");

System.out output
-
Indexed 1252100/1256575
Indexed 1252200/1256575
Indexed 1252300/1256575
Indexed 1252400/1256575
Indexed 1252500/1256575
Indexed 1252600/1256575
Indexed 1252700/1256575
Indexed 1252800/1256575
Indexed 1252900/1256575
Indexed 1253000/1256575
Indexed 1253100/1256566
Indexed 1253200/1256566
Indexed 1253300/1256566
Indexed 1253400/1256566
Indexed 1253500/1256566
Indexed 1253600/1256566
Indexed 1253700/1256566
Indexed 1253800/1256566
Indexed 1253900/1256566
Indexed 1254000/1256566
Indexed 1254100/1256566
Indexed 1254200/1256566
Indexed 1254300/1256566
Indexed 1254400/1256566
Indexed 1254500/1256566
Indexed 1254600/1256566
Indexed 1254700/1256566
Indexed 1254800/1256566
Indexed 1254900/1256566
Indexed 1255000/1256566
Indexed 1255100/1256566
Indexed 1255200/1256566
Indexed 1255300/1256566
Indexed 1255400/1256566
Indexed 1255500/1256566
Indexed 1255600/1256566
Indexed 1255700/1256557
Indexed 1255800/1256557
Indexed 1255900/1256557
Indexed 1256000/1256557
Indexed 1256100/1256557
Indexed 1256200/1256557
Indexed 1256300/1256557
Indexed 1256400/1256557
Indexed 1256500/1256557
COMPLETELY DONE


Thanks,
Ravi Kiran Bhaskar



On Tue, Mar 25, 2014 at 7:13 AM, Jan Høydahl  wrote:

> Hi,
>
> Seems you try to reindex from one server to the other.
>
> Be aware that it could be easier for you to simply copy the whole index
> folder over to your 4.6.1 server and start Solr as it will be able to read
> your 3.x index. This is unless you also want to do major upgrades of your
> schema or update processors so that you'll need a re-index anyway.
>
> If you believe you really need a re-index, then please try to batch index
> without triggering commits every few seconds - this is really heavy on the
> system and completely unnecessary. You won't get the benefit of SoftCommit
> if you're not running SolrCloud, so no need to configure that.
>
> I would change your  into maxDocs=1 and maxTime=12
> (every 2min).
> Further please index without 1s commitWithin, i.e. instead of
> >server.add(iDoc, 1000);
> use
> >server.add(iDoc);
>
> This will make sure the server gets room to breathe and not constantly
> generating new indices.
>
> Finally, it's probably not a good idea to use recursion here. You really
> don't need to, filling up your stack. You can instead refactor the method
> to do the whole indexing. And a hint is that it is generally better to ask
> for ALL documents in one go and stream to the end rather than increasing
> offsets with new queries all the time - because high offsets/start can be
> time consuming, especially with multiple shards. If you increase the
> timeout enough you should be able to retrieve all documents in one go!
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> 24. mars 2014 kl. 22:36 skrev Ravi Solr :
>
> > Hello,
> >We are trying to reindex as part of our move from 3.6.2 to 4.6.1
> 

creating shards on the fly in a single Solr instance ("shards" query parameter)

2014-03-25 Thread Philip Durbin
I'm new to Solr and am exploring the idea of creating shards on the
fly. Once the shards have been created and populated, I am hoping to
use the "shards" query parameter to combine results from multiple
shards into a single results set.

By following the "Testing Index Sharding on Two Local Servers"
instructions[1] in the wiki I'm able to target two different shards
individually. It works great, I get three results from one shard and
one result from another shard. When I select them both with the
following query (straight from the wiki) I get all four results:

curl 
'http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&indent=true&q=ipod+solr'

My immediate problem is trying to figure out how to convert this
example from two instances of Solr running on different ports (8983
and 7574) that each have a single shard to one instance of Solr that
has two shards.

Various posts[2] suggest this is possible but refer to older versions
of Solr and the instructions don't seem to work for Solr 4.7.0.

When I try the CREATESHARD API call from the wiki[3] I get "Solr
instance is not running in SolrCloud mode" which isn't a huge surprise
because it's documented under the Collections API under SolrCloud.

I don't know anything about SolrCloud. Not yet, anyway. My experience
with Solr so far involves running `java -jar start.jar` from the
"example" directory. I barely know what Zookeeper is.

My goal for now is to use CREATESHARD to create a new shard in a
single Solr instance and then verify it was created with the STATUS[4]
command.

Can anyone please explain how I would accomplish this?

The thought is to have one shard for public data and a shard per user,
which is why I'm asking about creating the shards on the fly. (Logged
in users would see a mix of public and private data.) For now I'd like
to keep using a single Solr instance for simplicity. For more
background on where I'm coming from, please see
http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2014-02-06#l99
and 
https://trello.com/c/5z5PpR4r/50-design-solr-document-level-security-filter-solution

Thanks,

Phil

1. 
https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

2. 
http://solr.pl/en/2013/01/07/solr-4-1-solrcloud-multiple-shards-on-the-same-solr-node/

3. i.e. curl 
'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shardName&collection=collection1'
from  
https://cwiki.apache.org/confluence/display/solr/Collections+API#CollectionsAPI-CreateaShard

4. i.e. http://localhost:8983/solr/admin/cores?action=STATUS via
https://wiki.apache.org/solr/CoreAdmin#STATUS (or whatever the right
command would be to list shards)

-- 
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin


Re: Re-index Parent-Child Schema

2014-03-25 Thread Vijay Kokatnur
Hello Mikhail,

Thanks for the suggestions.  It took some time to get to this -

1. FieldsCollapsing cannot be done on Multivalue fields -
https://wiki.apache.org/solr/FieldCollapsing

2. Join acts on documents, how can I use it to join multi-value fields in
the same document?

3. Block-join requires you to index parent and child document separately
using IndexWriter.addDocuments API

4.  Concatenation requires me to index with those columns concatenated.
 This is not possible as I have around 20 multivalue fields.

Is there a way to solve this without changing how it's indexed?

Best,
-Vijay

On Thu, Mar 13, 2014 at 1:39 AM, Mikhail Khludnev <
mkhlud...@griddynamics.com> wrote:

> Hello Vijay,
> You can try FieldCollepsing, Join, Block-join, or just concatenate both
> field and search for concatenation.
>
>
> On Thu, Mar 13, 2014 at 7:16 AM, Vijay Kokatnur  >wrote:
>
> > Hi,
> >
> > I've inherited an Solr application with a Schema that contains
> parent-child
> > relationship.  All child elements are maintained in multi-value fields.
> > So an Order with 3 Order lines will result in an array of size 3 in Solr,
> >
> > This worked fine as long as clients queried only on Order, but with new
> > requirements it is serving inaccurate results.
> >
> > Consider some orders, for example -
> >
> >
> >  {
> > OrderId:123
> > BookingRecordId : ["145", "987", "*234*"]
> > OrderLineType : ["11", "12", "*13*"]
> > .
> > }
> >  {
> > OrderId:345
> > BookingRecordId : ["945", "882", "*234*"]
> > OrderLineType : ["1", "12", "*11*"]
> > .
> > }
> >  {
> > OrderId:678
> > BookingRecordId : ["444"]
> > OrderLineType : ["11"]
> > .
> > }
> >
> >
> > If you look up for an Order with BookingRecordId: 234 And
> OrderLineType:11.
> >  You will get two orders : 123 and 345, which is correct per Solr.   You
> > have two arrays in both the orders that satisfy this condition.
> >
> > However, for OrderId:123, the value at 3rd index of OrderLineType array
> is
> > 13 and not 11( this is for BookingRecordId:145) this should be excluded.
> >
> > Per this blog :
> >
> >
> http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html
> >
> > I can't use span queries as I have tons of child elements to query and I
> > want to keep any changes to client queries to minimum.
> >
> > So is creating multiple indexes is the only way? We have 3 Physical boxes
> > with SolrCloud and at some point we would like to shard.
> >
> > Appreciate any inputs.
> >
> >
> > Best,
> >
> > -Vijay
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
>  
>


Re: Replication (Solr Cloud)

2014-03-25 Thread Michael Della Bitta
No, don't disable replication!

The way shards ordinarily keep up with updates is by sending every document
to each member of the shard. However, if a shard goes offline for a period
of time and comes back, replication is used to "catch up" that shard. So
you really need it on.

If you created your collection with the collections API and the required
bits are in schema.xml and solrconfig.xml, you should be good to go. See
https://wiki.apache.org/solr/SolrCloud#Required_Config

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Tue, Mar 25, 2014 at 12:42 PM, Software Dev wrote:

> I see that by default in SolrCloud that my collections are
> replicating. Should this be disabled in SolrCloud as this is already
> handled by it?
>
> From the documentation:
>
> "The Replication screen shows you the current replication state for
> the named core you have specified. In Solr, replication is for the
> index only. SolrCloud has supplanted much of this functionality, but
> if you are still using index replication, you can use this screen to
> see the replication state:"
>
> I just want to make sure before I disable it that if we send an update
> to one server that the document will be correctly replicated across
> all nodes. Thanks
>


Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev
Thanks for the reply. Ill make sure NOT to disable it.


Re: Solr Cloud collection keep going down?

2014-03-25 Thread Software Dev
Can anyone else chime in? Thanks

On Mon, Mar 24, 2014 at 10:10 AM, Software Dev
 wrote:
> Shawn,
>
> Thanks for pointing me in the right direction. After consulting the
> above document I *think* that the problem may be too large of a heap
> and which may be affecting GC collection and hence causing ZK
> timeouts.
>
> We have around 20G of memory on these machines with a min/max of heap
> at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
> aside for disk cache. Why did we choose 6-10? No other reason than we
> wanted to allot enough for disk cache and then everything else was
> thrown and Solr. Does this sound about right?
>
> I took some screenshots for VisualVM and our NewRelic reporting as
> well as some relevant portions of our SolrConfig.xml. Any
> thoughts/comments would be greatly appreciated.
>
> http://postimg.org/gallery/4t73sdks/1fc10f9c/
>
> Thanks
>
>
>
>
> On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey  wrote:
>> On 3/22/2014 1:23 PM, Software Dev wrote:
>>> We have 2 collections with 1 shard each replicated over 5 servers in the
>>> cluster. We see a lot of flapping (down or recovering) on one of the
>>> collections. When this happens the other collection hosted on the same
>>> machine is still marked as active. When this happens it takes a fairly long
>>> time (~30 minutes) for the collection to come back online, if at all. I
>>> find that its usually more reliable to completely shutdown solr on the
>>> affected machine and bring it back up with its core disabled. We then
>>> re-enable the core when its marked as active.
>>>
>>> A few questions:
>>>
>>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is failing
>>> that marks one collection as down but the other on the same machine as up?
>>>
>>> 2) Why does recovery take forever when a node goes down.. even if its only
>>> down for 30 seconds. Our index is only 7-8G and we are running on SSD's.
>>>
>>> 3) What can be done to diagnose and fix this problem?
>>
>> Unless you are actually using the ping request handler, the healthcheck
>> config will not matter.  Or were you referring to something else?
>>
>> Referencing the logs you included in your reply:  The EofException
>> errors happen because your client code times out and disconnects before
>> the request it made has completed.  That is most likely just a symptom
>> that has nothing at all to do with the problem.
>>
>> Read the following wiki page.  What I'm going to say below will
>> reference information you can find there:
>>
>> http://wiki.apache.org/solr/SolrPerformanceProblems
>>
>> Relevant side note: The default zookeeper client timeout is 15 seconds.
>>  A typical zookeeper config defines tickTime as 2 seconds, and the
>> timeout cannot be configured to be more than 20 times the tickTime,
>> which means it cannot go beyond 40 seconds.  The default timeout value
>> 15 seconds is usually more than enough, unless you are having
>> performance problems.
>>
>> If you are not actually taking Solr instances down, then the fact that
>> you are seeing the log replay messages indicates to me that something is
>> taking so much time that the connection to Zookeeper times out.  When it
>> finally responds, it will attempt to recover the index, which means
>> first it will replay the transaction log and then it might replicate the
>> index from the shard leader.
>>
>> Replaying the transaction log is likely the reason it takes so long to
>> recover.  The wiki page I linked above has a "slow startup" section that
>> explains how to fix this.
>>
>> There is some kind of underlying problem that is causing the zookeeper
>> connection to timeout.  It is most likely garbage collection pauses or
>> insufficient RAM to cache the index, possibly both.
>>
>> You did not indicate how much total RAM you have or how big your Java
>> heap is.  As the wiki page mentions in the SSD section, SSD is not a
>> substitute for having enough RAM to cache at significant percentage of
>> your index.
>>
>> Thanks,
>> Shawn
>>


Re: Solr Cloud collection keep going down?

2014-03-25 Thread Michael Della Bitta
What kind of load are the machines under when this happens? A lot of
writes? A lot of http connections?

Do your zookeeper logs mention anything about losing clients?

Have you tried turning on GC logging or profiling GC?

Have you tried running with a smaller max heap size, or
setting -XX:CMSInitiatingOccupancyFraction ?

Just a shot in the dark, since I'm not familiar with Jetty's logging
statements, but that looks like plain old dropped HTTP sockets to me.

Michael Della Bitta

Applications Developer

o: +1 646 532 3062

appinions inc.

"The Science of Influence Marketing"

18 East 41st Street

New York, NY 10017

t: @appinions  | g+:
plus.google.com/appinions
w: appinions.com 


On Tue, Mar 25, 2014 at 1:13 PM, Software Dev wrote:

> Can anyone else chime in? Thanks
>
> On Mon, Mar 24, 2014 at 10:10 AM, Software Dev
>  wrote:
> > Shawn,
> >
> > Thanks for pointing me in the right direction. After consulting the
> > above document I *think* that the problem may be too large of a heap
> > and which may be affecting GC collection and hence causing ZK
> > timeouts.
> >
> > We have around 20G of memory on these machines with a min/max of heap
> > at 6, 8 respectively (-Xms6G -Xmx10G). The rest was allocated for
> > aside for disk cache. Why did we choose 6-10? No other reason than we
> > wanted to allot enough for disk cache and then everything else was
> > thrown and Solr. Does this sound about right?
> >
> > I took some screenshots for VisualVM and our NewRelic reporting as
> > well as some relevant portions of our SolrConfig.xml. Any
> > thoughts/comments would be greatly appreciated.
> >
> > http://postimg.org/gallery/4t73sdks/1fc10f9c/
> >
> > Thanks
> >
> >
> >
> >
> > On Sat, Mar 22, 2014 at 2:26 PM, Shawn Heisey  wrote:
> >> On 3/22/2014 1:23 PM, Software Dev wrote:
> >>> We have 2 collections with 1 shard each replicated over 5 servers in
> the
> >>> cluster. We see a lot of flapping (down or recovering) on one of the
> >>> collections. When this happens the other collection hosted on the same
> >>> machine is still marked as active. When this happens it takes a fairly
> long
> >>> time (~30 minutes) for the collection to come back online, if at all. I
> >>> find that its usually more reliable to completely shutdown solr on the
> >>> affected machine and bring it back up with its core disabled. We then
> >>> re-enable the core when its marked as active.
> >>>
> >>> A few questions:
> >>>
> >>> 1) What is the healthcheck in Solr-Cloud? Put another way, what is
> failing
> >>> that marks one collection as down but the other on the same machine as
> up?
> >>>
> >>> 2) Why does recovery take forever when a node goes down.. even if its
> only
> >>> down for 30 seconds. Our index is only 7-8G and we are running on
> SSD's.
> >>>
> >>> 3) What can be done to diagnose and fix this problem?
> >>
> >> Unless you are actually using the ping request handler, the healthcheck
> >> config will not matter.  Or were you referring to something else?
> >>
> >> Referencing the logs you included in your reply:  The EofException
> >> errors happen because your client code times out and disconnects before
> >> the request it made has completed.  That is most likely just a symptom
> >> that has nothing at all to do with the problem.
> >>
> >> Read the following wiki page.  What I'm going to say below will
> >> reference information you can find there:
> >>
> >> http://wiki.apache.org/solr/SolrPerformanceProblems
> >>
> >> Relevant side note: The default zookeeper client timeout is 15 seconds.
> >>  A typical zookeeper config defines tickTime as 2 seconds, and the
> >> timeout cannot be configured to be more than 20 times the tickTime,
> >> which means it cannot go beyond 40 seconds.  The default timeout value
> >> 15 seconds is usually more than enough, unless you are having
> >> performance problems.
> >>
> >> If you are not actually taking Solr instances down, then the fact that
> >> you are seeing the log replay messages indicates to me that something is
> >> taking so much time that the connection to Zookeeper times out.  When it
> >> finally responds, it will attempt to recover the index, which means
> >> first it will replay the transaction log and then it might replicate the
> >> index from the shard leader.
> >>
> >> Replaying the transaction log is likely the reason it takes so long to
> >> recover.  The wiki page I linked above has a "slow startup" section that
> >> explains how to fix this.
> >>
> >> There is some kind of underlying problem that is causing the zookeeper
> >> connection to timeout.  It is most likely garbage collection pauses or
> >> insufficient RAM to cache the index, possibly both.
> >>
> >> You did not indicate how much total RAM you have or how big your Java
> >> heap is.  As the wiki page mentions in the SSD section, SSD is not a
> >> substitute for hav

AND not as a boolean operator in Phrase

2014-03-25 Thread abhishek jain
hi friends,

when i search for "A and B" it gives me result for A , B , i am not sure
why?

Please guide how can i exact match when it is within phrase/quotes.

-- 
Thanks and kind Regards,
Abhishek jain


Re: AND not as a boolean operator in Phrase

2014-03-25 Thread Jack Krupansky

What does your field type analyzer look like?

I suspect that you have a stop filter which cause "and" to be removed.

-- Jack Krupansky

-Original Message- 
From: abhishek jain 
Sent: Tuesday, March 25, 2014 1:29 PM 
To: solr-user@lucene.apache.org 
Subject: AND not as a boolean operator in Phrase 


hi friends,

when i search for "A and B" it gives me result for A , B , i am not sure
why?

Please guide how can i exact match when it is within phrase/quotes.

--
Thanks and kind Regards,
Abhishek jain


Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev
One other question. If I optimize a collection on one node, does this
get replicated to all others when finished?

On Tue, Mar 25, 2014 at 10:13 AM, Software Dev
 wrote:
> Thanks for the reply. Ill make sure NOT to disable it.


Re: solr 4.x reindexing issues

2014-03-25 Thread Ravi Solr
Iam also seeing the following in the log. Is it really commiting ??? Now I
am totally confused about how solr 4.x indexes. My relavant update config
is as shown below

  
1

   100
   12
   false

  

[#|2014-03-25T13:44:03.765-0400|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=86;_ThreadName=commitScheduler-6-thread-1;|820509
[commitScheduler-6-thread-1] INFO  org.apache.solr.update.UpdateHandler  -
start
commit{,optimize=false,openSearcher=false,waitSearcher=true,expungeDeletes=false,softCommit=false,prepareCommit=false}
|#]

[#|2014-03-25T13:44:03.766-0400|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=83;_ThreadName=http-thread-pool-8080(4);|820510
[http-thread-pool-8080(4)] INFO
org.apache.solr.update.processor.LogUpdateProcessor  - [sitesearchcore]
webapp=/solr-admin path=/update params={wt=javabin&version=2}
{add=[09f693e6-9a6f-11e3-9900-dd917233cf9c]} 0 13
|#]

[#|2014-03-25T13:44:03.898-0400|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=86;_ThreadName=commitScheduler-6-thread-1;|820642
[commitScheduler-6-thread-1] INFO  org.apache.solr.core.SolrCore  -
SolrDeletionPolicy.onCommit: commits: num=3

commit{dir=/data/solr/core/sitesearch-data/index,segFN=segments_9y68,generation=464192}

commit{dir=/data/solr/core/sitesearch-data/index,segFN=segments_9yjf,generation=464667}

commit{dir=/data/solr/core/sitesearch-data/index,segFN=segments_9yjg,generation=464668}
|#]

[#|2014-03-25T13:44:03.898-0400|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=86;_ThreadName=commitScheduler-6-thread-1;|820642
[commitScheduler-6-thread-1] INFO  org.apache.solr.core.SolrCore  - newest
commit generation = 464668
|#]

[#|2014-03-25T13:44:03.908-0400|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=86;_ThreadName=commitScheduler-6-thread-1;|820652
[commitScheduler-6-thread-1] INFO
org.apache.solr.search.SolrIndexSearcher  - Opening
Searcher@1e2ca86e[sitesearchcore]
realtime
|#]

[#|2014-03-25T13:44:03.909-0400|INFO|glassfish3.1.2|javax.enterprise.system.std.com.sun.enterprise.server.logging|_ThreadID=86;_ThreadName=commitScheduler-6-thread-1;|820653
[commitScheduler-6-thread-1] INFO  org.apache.solr.update.UpdateHandler  -
end_commit_flush


Thanks

Ravi Kiran Bhaskar


On Tue, Mar 25, 2014 at 1:10 PM, Ravi Solr  wrote:

> Thank you very much for responding Mr. Høydahl. I removed the recursion
> which eliminated the stack overflow exception. However, I still
> encountering my main problem with the docs not getting indexed in solr 4.x
> as I mentioned in my original email. The reason I am reindexing is that
> with solr 4.x EnglishPorterFilterFactory has been removed and also I wanted
> to add another copyField of all field values into destination "allfields"
>
> As per your suggestion I removed softcommit and had autoCommit to maxDocs
> 100 and maxTime to 12. I was printing out the indexing call...You can
> clearly see still it does index around 10 at a time (testing code and
> results shown below). Again my code finished fully and just for a good
> measure I commited manually after 10 minutes still when I query I only see
> "13513" docs got indexed.
>
> There must be something else I am missing
>
> 
>  
>   0
>   1
>   
>allfields:[* TO *]
> xml
> 0
>   
>   
>   
>
> TEST INDEXER CODE
>  ---
> Long total = null;
> Integer start = 0;
> Integer rows = 100;
> while(total == null || total >= (start+rows)) {
>
> SolrQuery query = new SolrQuery();
> query.setQuery("*:*");
> query.setSort("displaydatetime", ORDER.desc);
>
> query.addFilterQuery("-allfields:[* TO *]");
> QueryResponse resp = server.query(query);
> SolrDocumentList list =  resp.getResults();
> total = list.getNumFound();
>
> if(list != null && !list.isEmpty()) {
> for(SolrDocument doc : list) {
> SolrInputDocument iDoc =
> ClientUtils.toSolrInputDocument(doc);
> //To index full doc again
> iDoc.removeField("_version_");
> server.add(iDoc);
>
> }
>
> System.out.println("Indexed " + (start+rows) + "/" +
> total);
> start = (start+rows);
> }
> }
>
>System.out.println("COMPLETELY DONE");
>
> System.out output
> -
> Indexed 1252100/1256575
> Indexed 1252200/1256575
> Indexed 1252300/1256575
> Indexed 1252400/1256575
> Indexed 1252500/1256575
> Indexed 1252600/1256575
> Indexed 1252700/1256575
> Indexed 1252800/1256575
> Indexed 1252900/1256575
> Indexed 1253000/1256575
> Indexed 1253100/1256566
> Indexed 1253

Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev
Ehh.. found out the hard way. I optimized the collection on 1 machine
and when it was completed it replicated to the others and took my
cluster down. Shitty

On Tue, Mar 25, 2014 at 10:46 AM, Software Dev
 wrote:
> One other question. If I optimize a collection on one node, does this
> get replicated to all others when finished?
>
> On Tue, Mar 25, 2014 at 10:13 AM, Software Dev
>  wrote:
>> Thanks for the reply. Ill make sure NOT to disable it.


Re: Replication (Solr Cloud)

2014-03-25 Thread Shawn Heisey

On 3/25/2014 11:59 AM, Software Dev wrote:

Ehh.. found out the hard way. I optimized the collection on 1 machine
and when it was completed it replicated to the others and took my
cluster down. Shitty


It doesn't get replicated -- each core in the collection will be 
optimized.  In older versions it might have done them all at once, but I 
believe that newer versions only do one core at a time.


Doing an optimize on a Solr core results in a LOT of I/O. If your Solr 
install is having performance issues, that will push it over the edge.  
When SolrCloud ends up with a performance problem in one place, they 
tend to multiply and cause MORE problems.  It can get bad enough that 
the whole cluster goes down because it's trying to do a recovery on 
every node.  For that reason, it's extremely important that you have 
enough system resources available across your cloud (RAM in particular) 
to avoid performance issues.


Thanks,
Shawn



Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev
So its generally a bad idea to optimize I gather?

- In older versions it might have done them all at once, but I believe
that newer versions only do one core at a time.

On Tue, Mar 25, 2014 at 11:16 AM, Shawn Heisey  wrote:
> On 3/25/2014 11:59 AM, Software Dev wrote:
>>
>> Ehh.. found out the hard way. I optimized the collection on 1 machine
>> and when it was completed it replicated to the others and took my
>> cluster down. Shitty
>
>
> It doesn't get replicated -- each core in the collection will be optimized.
> In older versions it might have done them all at once, but I believe that
> newer versions only do one core at a time.
>
> Doing an optimize on a Solr core results in a LOT of I/O. If your Solr
> install is having performance issues, that will push it over the edge.  When
> SolrCloud ends up with a performance problem in one place, they tend to
> multiply and cause MORE problems.  It can get bad enough that the whole
> cluster goes down because it's trying to do a recovery on every node.  For
> that reason, it's extremely important that you have enough system resources
> available across your cloud (RAM in particular) to avoid performance issues.
>
> Thanks,
> Shawn
>


Re: Replication (Solr Cloud)

2014-03-25 Thread Software Dev
"In older versions it might have done them all at once, but I believe
that newer versions only do one core at a time."

It looks like it did it all at once and I'm on the latest (4.7)

On Tue, Mar 25, 2014 at 11:27 AM, Software Dev
 wrote:
> So its generally a bad idea to optimize I gather?
>
> - In older versions it might have done them all at once, but I believe
> that newer versions only do one core at a time.
>
> On Tue, Mar 25, 2014 at 11:16 AM, Shawn Heisey  wrote:
>> On 3/25/2014 11:59 AM, Software Dev wrote:
>>>
>>> Ehh.. found out the hard way. I optimized the collection on 1 machine
>>> and when it was completed it replicated to the others and took my
>>> cluster down. Shitty
>>
>>
>> It doesn't get replicated -- each core in the collection will be optimized.
>> In older versions it might have done them all at once, but I believe that
>> newer versions only do one core at a time.
>>
>> Doing an optimize on a Solr core results in a LOT of I/O. If your Solr
>> install is having performance issues, that will push it over the edge.  When
>> SolrCloud ends up with a performance problem in one place, they tend to
>> multiply and cause MORE problems.  It can get bad enough that the whole
>> cluster goes down because it's trying to do a recovery on every node.  For
>> that reason, it's extremely important that you have enough system resources
>> available across your cloud (RAM in particular) to avoid performance issues.
>>
>> Thanks,
>> Shawn
>>


Re: solr 4.x reindexing issues

2014-03-25 Thread Lan
Ravi,

It looks like you are re-indexing data by pulling data from your solr server
and then indexing it back to the same server. I can think of many things
that could go wrong with this setup. For example are all your fields stored?
Since you are iterating through all documents on the solr server and at the
same time modifying the index, the sort order could change.

To make it easier to identify any bugs in your process, you should index
into a second solr server that is *EMPTY* so you can identify any problems.

Generally when people re-index data, they dont pull the data from Solr but
from system of record such as a DB.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/solr-4-x-reindexing-issues-tp4126695p4126986.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Replication (Solr Cloud)

2014-03-25 Thread Walter Underwood
Yes, it is generally a bad idea to optimize.

The system continually does merges as needed. You generally do not need to 
force a full merge.

wunder

On Mar 25, 2014, at 11:27 AM, Software Dev  wrote:

> So its generally a bad idea to optimize I gather?
> 
> - In older versions it might have done them all at once, but I believe
> that newer versions only do one core at a time.
> 
> On Tue, Mar 25, 2014 at 11:16 AM, Shawn Heisey  wrote:
>> On 3/25/2014 11:59 AM, Software Dev wrote:
>>> 
>>> Ehh.. found out the hard way. I optimized the collection on 1 machine
>>> and when it was completed it replicated to the others and took my
>>> cluster down. Shitty
>> 
>> 
>> It doesn't get replicated -- each core in the collection will be optimized.
>> In older versions it might have done them all at once, but I believe that
>> newer versions only do one core at a time.
>> 
>> Doing an optimize on a Solr core results in a LOT of I/O. If your Solr
>> install is having performance issues, that will push it over the edge.  When
>> SolrCloud ends up with a performance problem in one place, they tend to
>> multiply and cause MORE problems.  It can get bad enough that the whole
>> cluster goes down because it's trying to do a recovery on every node.  For
>> that reason, it's extremely important that you have enough system resources
>> available across your cloud (RAM in particular) to avoid performance issues.
>> 
>> Thanks,
>> Shawn
>> 

--
Walter Underwood
wun...@wunderwood.org





document level security filter solution for Solr

2014-03-25 Thread Philip Durbin
I'm new to Solr and I'm looking for a document level security filter
solution. Anonymous users searching my application should be able to
find public data. Logged in users should be able to find public data
and private data they have access to.

Earlier today I wrote about shards as a possible solution. I got a
great reply from Shalin Shekhar Mangar of LucidWorks explaining how to
achieve something technical but I'd like to back up a minute and
consider other solutions.

For one thing, I'm concerned about the potential misuse of shards.
Judging from this wiki page, shards seem to be used primarily for
scalability rather than security (access control): "When an index
becomes too large to fit on a single system..." -
https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding

For consistency with longer writeup of mine on this topic[1], I'm
going to refer to the sharding solution as Option 4. Here's the full
list of options I'm aware of for document level security filtering:

1. Manifold CF (Connector Framework)

http://manifoldcf.apache.org

2. ACL PostFilter (ACLs in each document)

Specifically, I mean this wonderful writeup by Erik Hatcher from
LucidWorks: http://java.dzone.com/articles/custom-security-filtering-solr

3. Pass a (often long) list of IDs in query

Representative question:
http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html

4. Sharding (public shard, private shards per user)

My post from earlier today:
http://lucene.472066.n3.nabble.com/creating-shards-on-the-fly-in-a-single-Solr-instance-quot-shards-quot-query-parameter-td4126909.html

I'm happy to hear opinions on any of these solutions or others I
haven't even considered!

Thanks!

Phil

1. My longer writeup of this topic:
https://trello.com/c/5z5PpR4r/50-design-solr-document-level-security-filter-solution

-- 
Philip Durbin
Software Developer for http://thedata.org
http://www.iq.harvard.edu/people/philip-durbin


Re: document level security filter solution for Solr

2014-03-25 Thread Yonik Seeley
Depending on requirements, another option for simple security is to
store the security info in the index and utilize a join.  This really
only works when you have a single shard since joins aren't
distributed.

# the documents, with permissions
id:doc1, perms:public,...
id:doc2, perms:group1 group2 joe, ...
id:doc3, perms:group3, ...

# documents modeling users and what groups they belong to
id:joe, groups:joe public  group3
id:mark, groups:mark public group1 group2

And then if joe does a query, you add a filter query like the following
fq={!join from=groups to=perms v=id:joe}

The user documents can either be in the same collection, or in a
separate "core" as long as it's co-located in the same JVM (core
container), and you can do a cross-core join.

-Yonik
http://heliosearch.org - solve Solr GC pauses with off-heap filters
and fieldcache


On Tue, Mar 25, 2014 at 3:06 PM, Philip Durbin
 wrote:
> I'm new to Solr and I'm looking for a document level security filter
> solution. Anonymous users searching my application should be able to
> find public data. Logged in users should be able to find public data
> and private data they have access to.
>
> Earlier today I wrote about shards as a possible solution. I got a
> great reply from Shalin Shekhar Mangar of LucidWorks explaining how to
> achieve something technical but I'd like to back up a minute and
> consider other solutions.
>
> For one thing, I'm concerned about the potential misuse of shards.
> Judging from this wiki page, shards seem to be used primarily for
> scalability rather than security (access control): "When an index
> becomes too large to fit on a single system..." -
> https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
>
> For consistency with longer writeup of mine on this topic[1], I'm
> going to refer to the sharding solution as Option 4. Here's the full
> list of options I'm aware of for document level security filtering:
>
> 1. Manifold CF (Connector Framework)
>
> http://manifoldcf.apache.org
>
> 2. ACL PostFilter (ACLs in each document)
>
> Specifically, I mean this wonderful writeup by Erik Hatcher from
> LucidWorks: http://java.dzone.com/articles/custom-security-filtering-solr
>
> 3. Pass a (often long) list of IDs in query
>
> Representative question:
> http://lucene.472066.n3.nabble.com/Solr-large-boolean-filter-td4070747.html
>
> 4. Sharding (public shard, private shards per user)
>
> My post from earlier today:
> http://lucene.472066.n3.nabble.com/creating-shards-on-the-fly-in-a-single-Solr-instance-quot-shards-quot-query-parameter-td4126909.html
>
> I'm happy to hear opinions on any of these solutions or others I
> haven't even considered!
>
> Thanks!
>
> Phil
>
> 1. My longer writeup of this topic:
> https://trello.com/c/5z5PpR4r/50-design-solr-document-level-security-filter-solution
>
> --
> Philip Durbin
> Software Developer for http://thedata.org
> http://www.iq.harvard.edu/people/philip-durbin


Re: solr 4.x reindexing issues

2014-03-25 Thread Ravi Solr
I just tried even reading from one core A and indexed it into core B and
the same issue still persists.


On Tue, Mar 25, 2014 at 2:49 PM, Lan  wrote:

> Ravi,
>
> It looks like you are re-indexing data by pulling data from your solr
> server
> and then indexing it back to the same server. I can think of many things
> that could go wrong with this setup. For example are all your fields
> stored?
> Since you are iterating through all documents on the solr server and at the
> same time modifying the index, the sort order could change.
>
> To make it easier to identify any bugs in your process, you should index
> into a second solr server that is *EMPTY* so you can identify any problems.
>
> Generally when people re-index data, they dont pull the data from Solr but
> from system of record such as a DB.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/solr-4-x-reindexing-issues-tp4126695p4126986.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: solr 4.x reindexing issues

2014-03-25 Thread Ravi Solr
Sorry Guys, really apologize for wasting your time...bone headed coding on
my part. Did not set the rows and start to correct values for proper
pagination so it was getting the same 10 docs every single time.

Thanks
Ravi Kiran Bhaskar


On Tue, Mar 25, 2014 at 3:50 PM, Ravi Solr  wrote:

> I just tried even reading from one core A and indexed it into core B and
> the same issue still persists.
>
>
> On Tue, Mar 25, 2014 at 2:49 PM, Lan  wrote:
>
>> Ravi,
>>
>> It looks like you are re-indexing data by pulling data from your solr
>> server
>> and then indexing it back to the same server. I can think of many things
>> that could go wrong with this setup. For example are all your fields
>> stored?
>> Since you are iterating through all documents on the solr server and at
>> the
>> same time modifying the index, the sort order could change.
>>
>> To make it easier to identify any bugs in your process, you should index
>> into a second solr server that is *EMPTY* so you can identify any
>> problems.
>>
>> Generally when people re-index data, they dont pull the data from Solr but
>> from system of record such as a DB.
>>
>>
>>
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/solr-4-x-reindexing-issues-tp4126695p4126986.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>


Re: Question on highlighting edgegrams

2014-03-25 Thread Software Dev
Same problem here:
http://lucene.472066.n3.nabble.com/Solr-4-x-EdgeNGramFilterFactory-and-highlighting-td4114748.html

On Tue, Mar 25, 2014 at 9:39 AM, Software Dev  wrote:
> Bump
>
> On Mon, Mar 24, 2014 at 3:00 PM, Software Dev  
> wrote:
>> In 3.5.0 we have the following.
>>
>> > positionIncrementGap="100">
>>   
>> 
>> 
>> > maxGramSize="30"/>
>>   
>>   
>> 
>> 
>>   
>> 
>>
>> If we searched for "c" with highlighting enabled we would get back
>> results such as:
>>
>> cdat
>> crocdile
>> cool beans
>>
>> But in the latest Solr (4.7) we get the full words highlighted back.
>> Did something change from these versions with regards to highlighting?
>>
>> Thanks


Re: Multiple search analyzers question

2014-03-25 Thread Gora Mohanty
On Mar 25, 2014 10:37 PM, "ku3ia"  wrote:
>
> Hi all!
> Now I have a default search field, defined as
>
> 
> ...
> autoGeneratePhraseQueries="true" >
>   
> 
> 
> 
> 
> 
>  ignoreCase="true"/>
> 
>   
>   
> 
> 
> 
> 
> 
>  ignoreCase="true"/>
> 
>   
> 
>
> In a future, I will need to search using my current field (with KStem
> filter) and need alternative search - w/o using KStem filter. The easiest
> way is to add a copy field and declare a new field type (w/o KStem):
>
> 
> ...
> autoGeneratePhraseQueries="true" >
>   
> 
> 
> 
> 
>  ignoreCase="true"/>
> 
>   
>   
> 
> 
> 
> 
>  ignoreCase="true"/>
> 
>   
> 
>
> and to re-index all my data.
> Is any alternative way?
[...]

No. If your analysers change, and/or you add new fields, you will need to
reindex.

Regards,
Gora


DIH dataimport.properties Zulu time

2014-03-25 Thread Kiran J
Hi

Is it possible to set up the data import handler so that it keeps track of
the last imported time in Zulu time and not local time ?

Its not very clear from the documentation how to do it or if it is even
possible to do it.

Ref:

http://wiki.apache.org/solr/DataImportHandler#Configuring_The_Property_Writer


Thanks


Memory Problems + java.lang.ref.Finalizer

2014-03-25 Thread Harish Agarwal
In reference to my prior thread:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201403.mbox/%3ccac-cpvrzbhizomcdhkrhygqizguerntkwtkxwwx3j1rqcxe...@mail.gmail.com%3E

I followed the advice to set unmap=false on my indexes with promising
results.  Without performing any index updates I am seeing very stable
memory usage and GC performance.  However, I'm still sporadically seeing
memory problems after replicating index updates.  After performing another
heap dump, I noticed that:

java.lang.ref.Finalizer

is occupying an inordinate amount of space.  jmap reports it as an
'unreachable object', however GC was obviously not cleaning it up.

I'm really not sure how to hunt down this problem, any help would be
appreciated.

-Harish


Re: leaks in solr

2014-03-25 Thread harish.agarwal
I'm having a very similar issue to this currently on 4.6.0 (large
java.lang.ref.Finalizer usage, many open file handles to long gone files) --
were you able to make any progress diagnosing this issue?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/leaks-in-solr-tp3992047p4127015.html
Sent from the Solr - User mailing list archive at Nabble.com.


What contributes to disk IO?

2014-03-25 Thread Software Dev
What are the main contributing factors for Solr Cloud generating a lot
of disk IO?

A lot of reads? Writes? Insufficient RAM?

I would think if there was enough disk cache available for the whole
index there would be little to no disk IO.


RE: intersect query

2014-03-25 Thread Susheel Kumar
How big is you index? #documents, #size?

Thanks,
Susheel

-Original Message-
From: cmd.ares [mailto:cmd.a...@gmail.com] 
Sent: Tuesday, March 25, 2014 4:50 AM
To: solr-user@lucene.apache.org
Subject: intersect query

my_index(one core):
id,dealer,productName,amount,region 
1,A1,iphone4,400,east
2,A1,iphone4s,450,east
3,A2,iphone5s,550,east
..
4,A1,iphone4,400,west
5,A1,iphone4s,450,west
6,A3,iphone5s,550,west
..


-I'd like to get which dealer sale the 'iphone' both in the 'east' and
'west' 
pl/sql reference implementation:
1:
select dealer from my_index where region='east' and productName like
'%iphone%'
intersect
select dealer from my_index where region='west' and productName like
'%iphone%'
2:
select distinct dealer from my_index where region='east' and productName
like '%iphone%' and sales in (
select dealer from my_index where region='west' and productName like
'%iphone%'
)

solr reference implementation:
1.query parameters:
q=region:east AND productName:iphone
&fq={!join from=dealer to=dealer}(region:west AND productName:iphone)
&facet=true&facet.filed=dealer&facet.mincount=1

2.query parameters:
q=region:east AND productName:iphone({!join from=dealer
to=dealer}region:west AND productName:iphone)
&facet=true&facet.filed=dealer&facet.mincount=1

with the big index,the query is very slow.Is there any efficient way to
improve performance?

1.if must use solr join feature??if there are other approach??
2.if multicore shards can improve performance?? 
/***
as the wiki said:
In a DistributedSearch environment, you can not Join across cores on
multiple nodes. 
If however you have a custom sharding approach, you could join across cores
on the same node.
***/



--
View this message in context: 
http://lucene.472066.n3.nabble.com/intersect-query-tp4126828.html
Sent from the Solr - User mailing list archive at Nabble.com.


This e-mail message may contain confidential or legally privileged information 
and is intended only for the use of the intended recipient(s). Any unauthorized 
disclosure, dissemination, distribution, copying or the taking of any action in 
reliance on the information herein is prohibited. E-mails are not secure and 
cannot be guaranteed to be error free as they can be intercepted, amended, or 
contain viruses. Anyone who communicates with us by e-mail is deemed to have 
accepted these risks. Company Name is not responsible for errors or omissions 
in this message and denies any responsibility for any damage arising from the 
use of e-mail. Any opinion defamatory or deemed to be defamatory or  any 
material which could be reasonably branded to be a species of plagiarism and 
other statements contained in this message and any attachment are solely those 
of the author and do not necessarily represent those of the company.


Re: AND not as a boolean operator in Phrase

2014-03-25 Thread Koji Sekiguchi

(2014/03/26 2:29), abhishek jain wrote:

hi friends,

when i search for "A and B" it gives me result for A , B , i am not sure
why?

Please guide how can i exact match when it is within phrase/quotes.


Generally speaking (w/ LuceneQParser), if you want phrase match results,
use quotes, i.e. q="A B". If you want results which contain both terms A
and B, do not use quotes but boolean operator AND, i.e. q=A AND B.

koji
--
http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html


Re: AND not as a boolean operator in Phrase

2014-03-25 Thread François Schiettecatte
Better to user '+A +B' rather than AND/OR, see:

http://searchhub.org/2011/12/28/why-not-and-or-and-not/

François

On Mar 25, 2014, at 10:21 PM, Koji Sekiguchi  wrote:

> (2014/03/26 2:29), abhishek jain wrote:
>> hi friends,
>> 
>> when i search for "A and B" it gives me result for A , B , i am not sure
>> why?
>> 
>> Please guide how can i exact match when it is within phrase/quotes.
> 
> Generally speaking (w/ LuceneQParser), if you want phrase match results,
> use quotes, i.e. q="A B". If you want results which contain both terms A
> and B, do not use quotes but boolean operator AND, i.e. q=A AND B.
> 
> koji
> -- 
> http://soleami.com/blog/comparing-document-classification-functions-of-lucene-and-mahout.html



signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: DIH dataimport.properties Zulu time

2014-03-25 Thread Gora Mohanty
On 26 March 2014 02:44, Kiran J  wrote:
>
> Hi
>
> Is it possible to set up the data import handler so that it keeps track of
> the last imported time in Zulu time and not local time ?
[...]

Start your JVM with the desired timezone, e.g.,
java -Duser.timezone=UTC -jar start.jar

Regards,
Gora


Issue with passing local params 4.7

2014-03-25 Thread William Bell
&q_score=cancer


http://hgsolr2testsl:8983/solr/autosuggest/select?omitHeader=false&q=cancer&pt1=39.740009,-104.992264&qt=joinautopraccond2&wt=json&rows=100&echoParams=all&fl=user_query,$p_score,$s_score,q_score:query({!dismax%20qf=%22user_query_edge^1%20user_query^0.5%20user_query_fuzzy%22%20v=$q_score}),priority,synonym_rank


That does not work. We need to use v=cancer or v='cancer' and it works. Why?


http://hgsolr2testsl:8983/solr/autosuggest/select?omitHeader=false&q=cancer&pt1=39.740009,-104.992264&qt=joinautopraccond2&wt=json&rows=100&echoParams=all&fl=user_query,$p_score,$s_score,q_score:query({!dismax%20qf=%22user_query_edge^1%20user_query^0.5%20user_query_fuzzy%22%20v=cancer}),priority,synonym_rank&sort=rint(product(sum($p_score,$s_score,query({!dismax%20qf=%22user_query_edge^1%20user_query^0.5%20user_query_fuzzy%22%20v=cancer})),100))%20desc,s_query%20asc



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Issue with passing local params 4.7

2014-03-25 Thread William Bell
"sort": "rint(product(sum($p_score,$s_score,$q_score),100)) desc,s_query asc
","tie": "1","q1": "$q","q_score": "query({!dismax qf=\"user_query_edge^1
user_query^0.5 user_query_fuzzy\" v=$q1})",
I also tried q1=cancer... It does not work unless I set v='cancer'


On Tue, Mar 25, 2014 at 9:12 PM, William Bell  wrote:

> &q_score=cancer
>
>
>
> http://hgsolr2testsl:8983/solr/autosuggest/select?omitHeader=false&q=cancer&pt1=39.740009,-104.992264&qt=joinautopraccond2&wt=json&rows=100&echoParams=all&fl=user_query,$p_score,$s_score,q_score:query({!dismax%20qf=%22user_query_edge^1%20user_query^0.5%20user_query_fuzzy%22%20v=$q_score}),priority,synonym_rank
>
>
> That does not work. We need to use v=cancer or v='cancer' and it works.
> Why?
>
>
>
> http://hgsolr2testsl:8983/solr/autosuggest/select?omitHeader=false&q=cancer&pt1=39.740009,-104.992264&qt=joinautopraccond2&wt=json&rows=100&echoParams=all&fl=user_query,$p_score,$s_score,q_score:query({!dismax%20qf=%22user_query_edge^1%20user_query^0.5%20user_query_fuzzy%22%20v=cancer}),priority,synonym_rank&sort=rint(product(sum($p_score,$s_score,query({!dismax%20qf=%22user_query_edge^1%20user_query^0.5%20user_query_fuzzy%22%20v=cancer})),100))%20desc,s_query%20asc
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: search WITH or WITHOUT accents (selection at runtime) + highlights

2014-03-25 Thread Alexandre Rafalovitch
Why can't you use just two fields? Both stored, one has accent folding
filter, another does not. Then just chose the one you want search. Are
your fields so big you are worried about content duplication? Could be
premature optimization.


As to Ok/Not-Ok, have you by any chance changed field definition and
did not reindex? That could do it. Otherwise, the treatment should be
the same (well, I would expect both !stored fields to fail with
highlight actually).

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
- Time is the quality of nature that keeps events from happening all
at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
book)


On Tue, Mar 25, 2014 at 10:10 PM, elfu  wrote:
> Hello,
>
> I have the following problem to resolve using solr :
>
> search WITH or WITHOUT accents (selection at runtime) + highlights
>
> how can i configure the schema to realize this ?
>
> for example:
>
> inputString  "aaa près bbb pres"
>
> A) accent sensitive
>
>   1.search for *près*   highlight  ="aaa près bbb pres"
>   2.search for *pres*   highlight  ="aaa près bbb pres"
>
> B) accent insensitive
>
>   1. search for *près*highlight = "aaa près bbb pres"
>   2. search for *pres*highlight = "aaa près bbb pres"
>
> I try with 3 field : 1 for inputString storage and highlight and 2 for indexs
>
> field_S (stored, !indexed, used for highlight = accent insensitive)
> field_A (!stored, indexed, keep accents  = accent sensitive)
> field_B (!stored, indexed, remove accents = accent insensitive)
>
> when i search without accents on field_B the highlight it's ok (field_S 
> query/index op  like field_B)
> but when i search with accents on field_A the highlight i't not ok anymore ..
>
> thx for help
> regards,
> e
>
> The highlight engine can work with indexAnalyzer and/or queryAnalyzer from 
> other schema fields ?
> (if i want to highlight the result from accent sensitive search the content 
> can be parsed with indexAnalyzer and the
> hl.q can use queryAnalyzer from field_A ?)
>
>