Re: inconsistent results when faceting on multivalued field

2011-10-21 Thread pravesh
Could u clarify on below:
>>When I make a search on facet.qua_code=1234567 ??

Are u trying to say, when u fire a fresh search for a facet item, like;
q=qua_code:1234567??

This this would fetch for documents where qua_code fields contains either
the terms 1234567 OR both terms (1234567 & 9384738.and others terms).
This would be since its a multivalued field and hence if you see the facet,
then its shown for both the terms.

>>If I reword the query as 'facet.query=qua_code:1234567 TO 1234567', I only
get the expected counts

You will get facet for documents which have term 1234567 only (facet.query
would apply to the facets,so as to which facet to be picked/shown)

Regds
Pravesh



--
View this message in context: 
http://lucene.472066.n3.nabble.com/inconsistent-results-when-faceting-on-multivalued-field-tp3438991p3440128.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Painfully slow indexing

2011-10-21 Thread pravesh
Are you posting through HTTP/SOLRJ?

Your script time 'T' includes time between sending POST request -to- the
response fetched after successful response right??

Try sending in small batches like 10-20.  BTW how many documents are u
indexing???

Regds
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Painfully-slow-indexing-tp3434399p3440175.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: hierarchical synonym

2011-10-21 Thread Lukáš Vlček
Hi,

I think what you are looking for are synonym rules like this:

dog, cat, bird => animal

I think the following link can be interesting to you as well:
http://wisdombase.net/wiki/index.php?title=Hiearchy_synonym_search_solution_in_solr

I am not a Solr expert but speaking about Lucene synonyms it is also
possible to use synonyms in Wordnet format, and it is possible to describe
hierarchical synonyms in Wordnet, but unfortunately, this feature is not
currently supported by Lucene Wordnet code (as far as I understand).

Regards,
Lukas


On Fri, Oct 21, 2011 at 7:13 AM, cmd  wrote:

> if solr support hierarchical synonym
> for example:
> animal->
>  dog
>  cat
>  bird
>
> use animal as query term and the result set should be contains
> animal,dog,cat,bird
> use dog as query term and the result set should be only contains dog rather
> than other words
>
> thanks.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/hierarchical-synonym-tp344p344.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: inconsistent results when faceting on multivalued field

2011-10-21 Thread Alain Rogister
Pravesh,

Not exactly. Here is the search I do, in more details (different field name,
but same issue).

I want to get a count for a specific value of the sou_codeMetier field,
which is multivalued. I expressed this by including a fq clause :

/select/?q=*:*&facet=true&facet.field=sou_codeMetier&fq=sou_codeMetier:1213206&rows=0

The response (excerpt only):



1281
476
285
260
208
171
152
...

As you see, I get back both the expected results and extra results I would
expect to be filtered out by the fq clause.

I can eliminate the extra results with a
'f.sou_codeMetier.facet.prefix=1213206' clause.

But I wonder if Solr's behavior is correct and how the fq filtering works
exactly.

If I replace the facet.field clause with a facet.query clause, like this:

/select/?q=*:*&facet=true&facet.query=sou_codeMetier:[1213206 TO
1213206]&rows=0

The results contain a single item:


1281


The 'fq=sou_codeMetier:1213206' clause isn't necessary here and does not
affect the results.

Thanks,

Alain

On Fri, Oct 21, 2011 at 9:18 AM, pravesh  wrote:

> Could u clarify on below:
> >>When I make a search on facet.qua_code=1234567 ??
>
> Are u trying to say, when u fire a fresh search for a facet item, like;
> q=qua_code:1234567??
>
> This this would fetch for documents where qua_code fields contains either
> the terms 1234567 OR both terms (1234567 & 9384738.and others terms).
> This would be since its a multivalued field and hence if you see the facet,
> then its shown for both the terms.
>
> >>If I reword the query as 'facet.query=qua_code:1234567 TO 1234567', I
> only
> get the expected counts
>
> You will get facet for documents which have term 1234567 only (facet.query
> would apply to the facets,so as to which facet to be picked/shown)
>
> Regds
> Pravesh
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/inconsistent-results-when-faceting-on-multivalued-field-tp3438991p3440128.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: inconsistent results when faceting on multivalued field

2011-10-21 Thread Darren Govoni

My interpretation of your results are that your FQ found 1281 documents
with 1213206 value in sou_codeMetier field. Of those results, 476 also
had 1212104 as a value...and so on. Since ALL the results will have
the field value in your FQ, then I would expect the "other" values to
be equal or less occurring from the result set, which they appear to be.



On 10/21/2011 03:55 AM, Alain Rogister wrote:

Pravesh,

Not exactly. Here is the search I do, in more details (different field name,
but same issue).

I want to get a count for a specific value of the sou_codeMetier field,
which is multivalued. I expressed this by including a fq clause :

/select/?q=*:*&facet=true&facet.field=sou_codeMetier&fq=sou_codeMetier:1213206&rows=0

The response (excerpt only):



1281
476
285
260
208
171
152
...

As you see, I get back both the expected results and extra results I would
expect to be filtered out by the fq clause.

I can eliminate the extra results with a
'f.sou_codeMetier.facet.prefix=1213206' clause.

But I wonder if Solr's behavior is correct and how the fq filtering works
exactly.

If I replace the facet.field clause with a facet.query clause, like this:

/select/?q=*:*&facet=true&facet.query=sou_codeMetier:[1213206 TO
1213206]&rows=0

The results contain a single item:


1281


The 'fq=sou_codeMetier:1213206' clause isn't necessary here and does not
affect the results.

Thanks,

Alain

On Fri, Oct 21, 2011 at 9:18 AM, pravesh  wrote:


Could u clarify on below:

When I make a search on facet.qua_code=1234567 ??

Are u trying to say, when u fire a fresh search for a facet item, like;
q=qua_code:1234567??

This this would fetch for documents where qua_code fields contains either
the terms 1234567 OR both terms (1234567&  9384738.and others terms).
This would be since its a multivalued field and hence if you see the facet,
then its shown for both the terms.


If I reword the query as 'facet.query=qua_code:1234567 TO 1234567', I

only
get the expected counts

You will get facet for documents which have term 1234567 only (facet.query
would apply to the facets,so as to which facet to be picked/shown)

Regds
Pravesh



--
View this message in context:
http://lucene.472066.n3.nabble.com/inconsistent-results-when-faceting-on-multivalued-field-tp3438991p3440128.html
Sent from the Solr - User mailing list archive at Nabble.com.





Getting single documents by fq on unique field, performance

2011-10-21 Thread Robert Brown

Hi,

We do regular searches against documents, with highlighting on.  To 
then view a document in more detail, we re-do the search but using 
fq=id:12345 to return the single document of interest, but still want 
highlighting on, so sending the q param back again.


Is there anything you would recommend doing to increase performance 
(it's not currently a problem, more a curiosity)?  I had the following 
in mind, but wanted to gauge whether they'd actually be worthwhile...


1. Using a different request handler with no boosts, etc.
2. Setting rows=1 since we know there's only 1 doc coming back.

Thanks,
Rob

--

IntelCompute
Web Design & Local Online Marketing

http://www.intelcompute.com



Re: Painfully slow indexing

2011-10-21 Thread Alain Rogister
As an alternative, I can suggest this one which worked great for me:

- generate the ready-for-indexing XML documents on a file system
- use curl to feed them into Solr

I am not dealing with huge volumes, but was surprised at how *fast* Solr was
indexing my documents using this simple approach. Also, the workflow is easy
to manage. And the XML contents can easily be provisioned to multiple
systems e.g. for setting up test environments.

Regards,

Alain

On Fri, Oct 21, 2011 at 9:46 AM, pravesh  wrote:

> Are you posting through HTTP/SOLRJ?
>
> Your script time 'T' includes time between sending POST request -to- the
> response fetched after successful response right??
>
> Try sending in small batches like 10-20.  BTW how many documents are u
> indexing???
>
> Regds
> Pravesh
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Painfully-slow-indexing-tp3434399p3440175.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Highlighting misses some characters

2011-10-21 Thread Dirceu Vieira
Hi,

Are you using any kind of NGram tokenizer?
At first I'd have said it is caused by stemming, but since it's not like the
stem and it's derived word are being highlighted, it's more like parts of it
are...

If you use NGram or EdgeNGram this will generate tokens for each part of the
word (the size of the token is configurable).

If you're not using that my second guess is that the term is being truncated
somehow.

If you could provide some more info about this case it would be better

On Fri, Oct 21, 2011 at 4:49 AM, docmattman  wrote:

> I have highlighting on in query.  If I do a search for "Apple", it will
> highlight "Appl".  If I do a search for "deleted" it will highlight
> "delet",
> "agreed" will highlight "agre".  How can I get it to highlight the full
> term
> that I'm searching for and not leave off certain letters?
>
> I'm pretty new to Solr, so please let me know if there is any additional
> information needed to assist me with this problem.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Highlighting-misses-some-characters-tp3439778p3439778.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dirceu Vieira Júnior
---
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr


Re: Getting single documents by fq on unique field, performance

2011-10-21 Thread pravesh
This approach seems fine. You might benchmark it through load test etc.

Regds
Pravesh

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-single-documents-by-fq-on-unique-field-performance-tp3440229p3440351.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Can Solr handle large text files?

2011-10-21 Thread karsten-solr
Hi Peter,

highlighting in large text files can not be fast without dividing the original 
text in small piece.
So take a look in
http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
and in
http://www.lucidimagination.com/blog/2010/09/16/2446/

Which means that you should divide your files and use
Result Grouping / Field Collapsing
to list only one hit per original document.

(xtf also would solve your problem "out of the box" but xtf does not use solr).

Best regards
  Karsten

 Original-Nachricht 
> Datum: Thu, 20 Oct 2011 17:59:04 -0700
> Von: Peter Spam 
> An: solr-user@lucene.apache.org
> Betreff: Can Solr handle large text files?

> I have about 20k text files, some very small, but some up to 300MB, and
> would like to do text searching with highlighting.
> 
> Imagine the text is the contents of your syslog.
> 
> I would like to type in some terms, such as "error" and "mail", and have
> Solr return the syslog lines with those terms PLUS two lines of context. 
> Pretty much just like Google's highlighting.
> 
> 1) Can Solr handle this?  I had extremely long query times when I tried
> this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking
> the files into 1MB pieces, but searching would be wonky => return the wrong
> number of documents (ie. if one file had a term 5 times, and that was the
> only file that had the term, I want 1 result, not 5 results).  
> 
> 2) What sort of tokenizer would be best?  Here's what I'm using:
> 
> multiValued="false" termVectors="true" termPositions="true" 
> termOffsets="true" />
> 
> 
>   
> 
> 
>  generateWordParts="0" generateNumberParts="0" catenateWords="0" 
> catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>   
> 
> 
> 
> Thanks!
> Pete


arbitrary results

2011-10-21 Thread Peter A. Kirk
Hi

is it possible to set up Solr so that a search for a particular term results in 
some arbitrary (selected) documents?
For example, I want a search for "elephant" to return documents with id's 17, 
18 and 36. Even though these documents would not normally occur in a result for 
a search for "elephant".

Thanks,
Peter


Re: arbitrary results

2011-10-21 Thread Dirceu Vieira
HI Peter,

You might wanna check out
http://wiki.apache.org/solr/QueryElevationComponent.

Regards,

Dirceu

On Fri, Oct 21, 2011 at 11:44 AM, Peter A. Kirk wrote:

> Hi
>
> is it possible to set up Solr so that a search for a particular term
> results in some arbitrary (selected) documents?
> For example, I want a search for "elephant" to return documents with id's
> 17, 18 and 36. Even though these documents would not normally occur in a
> result for a search for "elephant".
>
> Thanks,
> Peter
>



-- 
Dirceu Vieira Júnior
---
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr


Re: Painfully slow indexing

2011-10-21 Thread Simon Willnauer
On Wed, Oct 19, 2011 at 3:58 PM, Pranav Prakash  wrote:
> Hi guys,
>
> I have set up a Solr instance and upon attempting to index document, the
> whole process is painfully slow. I will try to put as much info as I can in
> this mail. Pl. feel free to ask me anything else that might be required.
>
> I am sending documents in batches not exceeding 2,000. The size of each of
> them depends but usually is around 10-15MiB. My indexing script tells me
> that Solr took T seconds to add N documents of size S. For the same data,
> the Solr Log add QTime is QT. Some of the sample data are:
>
>   N                     S                T               QT
> -
>  390 docs  |   3,478,804 Bytes   | 14.5s    |  2297
>  852 docs  |   6,039,535 Bytes   | 25.3s    |  4237
> 1345 docs | 11,147,512 Bytes   |  47s      |  8543
> 1147 docs |   9,457,717 Bytes   |  44s      |  2297
> 1096 docs | 13,058,204 Bytes   |  54.3s   |   8782
>
> The time T includes the time of converting an array of Hash objects into
> XML, POSTing it to Solr and response acknowledged from Solr. Clearly, there
> is a huge difference between both the time T and QT. After a lot of efforts,
> I have no clue why these times do not match.
>
> The Server has 16 cores, 48GiB RAM. JVM options are -Xms5000M -Xmx5000M
> -XX:+UseParNewGC
>
> I believe my Indexing is getting slow. Relevant portion from my schema file
> are as follows. On a related note, every document has one dynamic field.
> Based on this rate, it takes me ~30hrs to do a full index of my database.
> I would really appreciate kindness of community in order to get this
> indexing faster.
>
> 
>
> false
>
> 
>
> 10
>
> 10
>
>  
>
> 2048
>
> 2147483647
>
> 300
>
> 1000
>
> 5
>
> 256
>
> 10
>
> false
>
> 
>
> 
>
> 
>
> true
>
> true
>
> 
>
>  1
>
> 0
>
> 
>
> false
>
> 
>
> 
>
> 
>
>  10
>
> 
>
> 
>
>
> *Pranav Prakash*
>
> "temet nosce"
>
> Twitter  | Blog  |
> Google 
>

hey,

are you calling commit after your batches or do an optimize by any chance?

I would suggest you to stream your documents to solr and try to commit
only if you really need to. Set your RAM Buffer to something between
256 and 320 MB and remove the maxBufferedDocs setting completely. You
can also experiment with your merge settings a little and 10 merging
threads seem to be a lot. I know you have lots of CPU but IO will be
the bottleneck here.

simon


Re: Want to support "did you mean xxx" but is Chinese

2011-10-21 Thread Li Li
we have implemented one supporting "did you mean" and preffix suggestion
for Chinese. But we base our working on solr 1.4 and we did many
modifications so it will cost time to integrate it to current solr/lucene.

 Here are our solution. glad to see any advices.

 1. offline words and phrases discovery.
   we discovery new words and new phrases by mining query logs

 2. online matching algorithm
   for each word, e.g., 贝多芬
   we convert it to pinyin bei duo fen, then we indexing it using
n-gram, which means gram3:bei gram3:eid ...
   to get "did you mean" result, we convert query 背朵分 into n-gram,
it's a boolean or query, so there are many results( the words' pinyin
similar to query will be ranked top)
  Then we reranks top 500 results by fine-grained algorithm
  we use edit distance to align query and result, we also take
character into consideration. e.g query 十度,matches are 十渡 and 是度,their
pinyins are exactly the same the 十渡 is better than 是度 because 十 occured in
both query and match
  also you need consider the hotness(popular degree) of different
words/phrases. which can be known from query logs

  Another question is to convert Chinese into pinyin. because some
character has more than one pinyin.
 e.g. 长沙 长大 长's pinyin is chang in 长沙,you should segment query and
words/phrases first. word segmentation is a basic problem is Chinese IR


2011/10/21 Floyd Wu 

> Does anybody know how to implement this idea in SOLR. Please kindly
> point me a direction.
>
> For example, when user enter a keyword in Chinese "貝多芬" (this is
> Beethoven in Chinese)
> but key in a wrong combination of characters  "背多分" (this is
> pronouncation the same with previous keyword "貝多芬").
>
> There in solr index exist token "貝多芬" actually. How to hit documents
> where "貝多芬" exist when "背多分" is enter.
>
> This is basic function of commercial search engine especially in
> Chinese processing. I wonder how to implements in SOLR and where is
> the start point.
>
> Floyd
>


RE: Can Solr handle large text files?

2011-10-21 Thread Anand.Nigam
Hi,

I was also facing the issue of highlighting the large text files. I applied the 
solution proposed here and it worked. But I am getting following error :


Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I get 
this file from. Its reference is present in browse.vm


  #if($response.response.get('grouped'))
#foreach($grouping in $response.response.get('grouped'))
  #parse("hitGrouped.vm")
#end
  #else
#foreach($doc in $response.results)
  #parse("hit.vm")
#end
  #end



HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
cwd=C:\glassfish3\glassfish\domains\domain1\config java.lang.RuntimeException: 
Can't find resource 'hitGrouped.vm' in classpath or 
'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
cwd=C:\glassfish3\glassfish\domains\domain1\config at 
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268)
 at 
org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42)
 at org.apache.velocity.Template.process(Template.java:98) at 
org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446)
 at 

Thanks & Regards,
Anand
Anand Nigam
RBS Global Banking & Markets
Office: +91 124 492 5506   


-Original Message-
From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] 
Sent: 21 October 2011 14:58
To: solr-user@lucene.apache.org
Subject: Re: Can Solr handle large text files?

Hi Peter,

highlighting in large text files can not be fast without dividing the original 
text in small piece.
So take a look in
http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
and in
http://www.lucidimagination.com/blog/2010/09/16/2446/

Which means that you should divide your files and use Result Grouping / Field 
Collapsing to list only one hit per original document.

(xtf also would solve your problem "out of the box" but xtf does not use solr).

Best regards
  Karsten

 Original-Nachricht 
> Datum: Thu, 20 Oct 2011 17:59:04 -0700
> Von: Peter Spam 
> An: solr-user@lucene.apache.org
> Betreff: Can Solr handle large text files?

> I have about 20k text files, some very small, but some up to 300MB, 
> and would like to do text searching with highlighting.
> 
> Imagine the text is the contents of your syslog.
> 
> I would like to type in some terms, such as "error" and "mail", and 
> have Solr return the syslog lines with those terms PLUS two lines of context.
> Pretty much just like Google's highlighting.
> 
> 1) Can Solr handle this?  I had extremely long query times when I 
> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
> tried breaking the files into 1MB pieces, but searching would be wonky 
> => return the wrong number of documents (ie. if one file had a term 5 
> times, and that was the only file that had the term, I want 1 result, not 5 
> results).
> 
> 2) What sort of tokenizer would be best?  Here's what I'm using:
> 
> multiValued="false" termVectors="true" termPositions="true" 
> termOffsets="true" />
> 
> 
>   
> 
> 
>  generateWordParts="0" generateNumberParts="0" catenateWords="0" 
> catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="0"/>
>   
> 
> 
> 
> Thanks!
> Pete

***
 
The Royal Bank of Scotland plc. Registered in Scotland No 90312. 
Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
Authorised and regulated by the Financial Services Authority. The 
Royal Bank of Scotland N.V. is authorised and regulated by the 
De Nederlandsche Bank and has its seat at Amsterdam, the 
Netherlands, and is registered in the Commercial Register under 
number 33002587. Registered Office: Gustav Mahlerlaan 350, 
Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and 
The Royal Bank of Scotland plc are authorised to act as agent for each 
other in certain jurisdictions. 
  
This e-mail message is confidential and for use by the addressee only. 
If the message is received by anyone other than the addressee, please 
return the message to the sender by replying to it and then delete the 
message from your computer. Internet e-mails are not necessarily 
secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
N.V. including its affiliates ("RBS group") does not accept responsibility 
for changes made to this message after it was sent. For the protection
of RBS group and its clients and customers, and in compliance with
regulatory requirements, the contents of both incoming and outgoing
e-mail communications, which could include proprietary information and
Non-Public Personal Information, may be read by authorised persons
within RBS group other than the intended recipient(s). 

Whilst all reasonable care has been taken to avoid the transmission of 
viruses, it is the responsibility of 

Re: How to make UnInvertedField faster?

2011-10-21 Thread Simon Willnauer
In trunk we have a feature called IndexDocValues which basically
creates the uninverted structure at index time. You can then simply
suck that into memory or even access it on disk directly
(RandomAccess). Even if I can't help you right now this is certainly
going to help you here. There is no need to uninvert at all anymore in
lucene 4.0

simon

On Wed, Oct 19, 2011 at 8:05 PM, Michael Ryan  wrote:
> I was wondering if anyone has any ideas for making UnInvertedField.uninvert()
> faster, or other alternatives for generating facets quickly.
>
> The vast majority of the CPU time for our Solr instances is spent generating
> UnInvertedFields after each commit. Here's an example of one of our slower 
> fields:
>
> [2011-10-19 17:46:01,055] INFO125974[pool-1-thread-1] - (SolrCore:440) -
> UnInverted multi-valued field 
> {field=authorCS,memSize=38063628,tindexSize=422652,
> time=15610,phase1=15584,nTerms=1558514,bigTerms=0,termInstances=4510674,uses=0}
>
> That is from an index with approximately 8 million documents. After each 
> commit,
> it takes on average about 90 seconds to uninvert all the fields that we facet 
> on.
>
> Any ideas at all would be greatly appreciated.
>
> -Michael
>


SOLRNET combine LocalParams with SolrMultipleCriteriaQuery?

2011-10-21 Thread Grüger , Joscha
Hello,

does anybody know how to combine SolrMultipleCriteriaQuery and LocalParams (in 
SOLRnet)?

I've tried things like that (don't worry about bad the code, it's just to test)

 var test = solr.Query(BuildQuery(parameters), new QueryOptions
{
FilterQueries = bq(),
Facet = new FacetParameters
{
Queries = new[] { 
new SolrFacetFieldQuery(new LocalParams {{"ex", "dt"}} + 
"ju_success") , new SolrFacetFieldQuery(new LocalParams {{"ex", "dt"}} + 
"dr_success") 
}
}
});
...

 public ICollection bq()
{
List i = new List();
i.Add(new LocalParams { { "tag", "dt" } } +   
Query.Field("dr_success").Is("simple"));
List MultiListItems = new List();
var t = new SolrMultipleCriteriaQuery(i, "OR");
MultiListItems.Add(t);
return MultiListItems();
}

 

What I try to do are multi-select-facets with a "OR" operator.

Thanks for all the help!

Grüger



Re: LUCENE-2208 (SOLR-1883) Bug with HTMLStripCharFilter, given patch in next nightly build?

2011-10-21 Thread Vadim Kisselmann
UPDATE:
i checked out the latest trunk-version and patched this with the patch from
LUCENE-2208.
This patch seems not to work. Or i had done something wrong.

My old log snippets:

Http - 500 Internal Server Error
Error: Carrot2 clustering failed

And this was caused by:
Http - 500 Internal Server Error
Error: org.apache.lucene.search.highlight.InvalidTokenOffsetsException:
Token the exceeds length of provided text sized 41

Best Regards
Vadim





2011/10/20 Vadim Kisselmann 

> Hello folks,
>
> i have big problems with InvalidTokenOffsetExceptions with highlighting.
> Looks like a bug in HTMLStripCharFilter.
>
> H.Wang added a patch in LUCENE-2208, but nobody have time to look at this.
> Could someone of the committers please take a look at this patch and commit
> it or is this problem more complicated as i think? :)
> Thanks guys...
>
> Best Regards
> Vadim
>
>
>


Re: How to make UnInvertedField faster?

2011-10-21 Thread Jason Rutherglen
Sweet + Very cool!

On Fri, Oct 21, 2011 at 7:50 AM, Simon Willnauer <
simon.willna...@googlemail.com> wrote:

> In trunk we have a feature called IndexDocValues which basically
> creates the uninverted structure at index time. You can then simply
> suck that into memory or even access it on disk directly
> (RandomAccess). Even if I can't help you right now this is certainly
> going to help you here. There is no need to uninvert at all anymore in
> lucene 4.0
>
> simon
>
> On Wed, Oct 19, 2011 at 8:05 PM, Michael Ryan  wrote:
> > I was wondering if anyone has any ideas for making
> UnInvertedField.uninvert()
> > faster, or other alternatives for generating facets quickly.
> >
> > The vast majority of the CPU time for our Solr instances is spent
> generating
> > UnInvertedFields after each commit. Here's an example of one of our
> slower fields:
> >
> > [2011-10-19 17:46:01,055] INFO125974[pool-1-thread-1] - (SolrCore:440) -
> > UnInverted multi-valued field
> {field=authorCS,memSize=38063628,tindexSize=422652,
> >
> time=15610,phase1=15584,nTerms=1558514,bigTerms=0,termInstances=4510674,uses=0}
> >
> > That is from an index with approximately 8 million documents. After each
> commit,
> > it takes on average about 90 seconds to uninvert all the fields that we
> facet on.
> >
> > Any ideas at all would be greatly appreciated.
> >
> > -Michael
> >
>


Re: How to make UnInvertedField faster?

2011-10-21 Thread Michael McCandless
Well... the limitation of DocValues is that it cannot handle more than
one value per document (which UnInvertedField can).

Hopefully we can fix that at some point :)

Mike McCandless

http://blog.mikemccandless.com

On Fri, Oct 21, 2011 at 7:50 AM, Simon Willnauer
 wrote:
> In trunk we have a feature called IndexDocValues which basically
> creates the uninverted structure at index time. You can then simply
> suck that into memory or even access it on disk directly
> (RandomAccess). Even if I can't help you right now this is certainly
> going to help you here. There is no need to uninvert at all anymore in
> lucene 4.0
>
> simon
>
> On Wed, Oct 19, 2011 at 8:05 PM, Michael Ryan  wrote:
>> I was wondering if anyone has any ideas for making UnInvertedField.uninvert()
>> faster, or other alternatives for generating facets quickly.
>>
>> The vast majority of the CPU time for our Solr instances is spent generating
>> UnInvertedFields after each commit. Here's an example of one of our slower 
>> fields:
>>
>> [2011-10-19 17:46:01,055] INFO125974[pool-1-thread-1] - (SolrCore:440) -
>> UnInverted multi-valued field 
>> {field=authorCS,memSize=38063628,tindexSize=422652,
>> time=15610,phase1=15584,nTerms=1558514,bigTerms=0,termInstances=4510674,uses=0}
>>
>> That is from an index with approximately 8 million documents. After each 
>> commit,
>> it takes on average about 90 seconds to uninvert all the fields that we 
>> facet on.
>>
>> Any ideas at all would be greatly appreciated.
>>
>> -Michael
>>
>


Re: Highlighting misses some characters

2011-10-21 Thread docmattman
Yea, I'm using EdgeNGramFilterFactory, should I remove that?  I actually
inherited this index from another person who used to be part of the project,
so there may be a few things that need to be changed.  Here is my field type
from the schema:



  







  
  







  



I'm not sure what all of these do, but like I said, someone else built the
system and now I'm in charge of getting it running correctly.

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Highlighting-misses-some-characters-tp3439778p3440995.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Want to support "did you mean xxx" but is Chinese

2011-10-21 Thread Ken Krugler
Hi Floyd,

Typically you'd do this by creating a custom analyzer that

 - segments Chinese text into words
 - Converts from words to pinyin or zhuyin

Your index would have both the actual Hanzi characters, plus (via copyfield) 
this phonetic version.

During search, you'd use dismax to search against both fields, with a higher 
weighting to the Hanzi field.

But segmentation can be error prone, and requires embedding specialized code 
that you typically license (for high quality results) from a commercial vendor.

So my first cut approach would be to use the current synonym support to map 
each Hanzi to all possible pronunciations. There are numerous open source 
datasets that contain this information. Note that there might be performance 
issues with having such a huge set of synonyms.

Then, by weighting phrase matches sufficiently high (again using dismax) I 
think you could get reasonable results.

-- Ken
 
On Oct 21, 2011, at 7:33am, Floyd Wu wrote:

> Does anybody know how to implement this idea in SOLR. Please kindly
> point me a direction.
> 
> For example, when user enter a keyword in Chinese "貝多芬" (this is
> Beethoven in Chinese)
> but key in a wrong combination of characters  "背多分" (this is
> pronouncation the same with previous keyword "貝多芬").
> 
> There in solr index exist token "貝多芬" actually. How to hit documents
> where "貝多芬" exist when "背多分" is enter.
> 
> This is basic function of commercial search engine especially in
> Chinese processing. I wonder how to implements in SOLR and where is
> the start point.
> 
> Floyd

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr





Re: Highlighting misses some characters

2011-10-21 Thread Dirceu Vieira
Whether removing the filter of not really depends on the use of it in the
search and what result is expected from it.
Have a look at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.EdgeNGramFilterFactory

I'd say you should find out what exactly are the requirements for the
search, read a bit about the TokenFactories and TokenFilters, then you could
define exactly what to do.

I wild guess says to me that you should remove the EdgeNGramFilter from the
query analyzers.
I believe when you do that your query will return "apple" when searching for
"appl" but will not return "appl" when searching for "apple".

Regards,

Dirceu

On Fri, Oct 21, 2011 at 4:39 PM, docmattman  wrote:

> Yea, I'm using EdgeNGramFilterFactory, should I remove that?  I actually
> inherited this index from another person who used to be part of the
> project,
> so there may be a few things that need to be changed.  Here is my field
> type
> from the schema:
>
>
> 
>  
>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt" enablePositionIncrements="true" />
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"
> preserveOriginal="1"/>
>
> protected="protwords.txt"/>
> maxGramSize="15"/>
>  
>  
>
> ignoreCase="true" expand="true"/>
> words="stopwords.txt" enablePositionIncrements="true" />
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>
> protected="protwords.txt"/>
> maxGramSize="15"/>
>  
>
>
>
> I'm not sure what all of these do, but like I said, someone else built the
> system and now I'm in charge of getting it running correctly.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Highlighting-misses-some-characters-tp3439778p3440995.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Dirceu Vieira Júnior
---
+47 9753 2473
dirceuvjr.blogspot.com
twitter.com/dirceuvjr


success with indexing Wikipedia - lessons learned

2011-10-21 Thread Fred Zimmerman
http://business.zimzaz.com/wordpress/2011/10/how-to-clone-wikipedia-mirror-and-index-wikipedia-with-solr/


data import in 4.0

2011-10-21 Thread Adeel Qureshi
Hi I am trying to setup the data import handler with solr 4.0 and having
some unexpected problems. I have a multi-core setup and only one core needed
the dataimport handler so I have added the request handler to it and added
the lib imports in config file




for some reason this doesnt works .. it still keeps giving me ClassNoFound
error message so I moved the jars files to the shared lib folder and then
atleast I was able to see the admin screen with the dataimport plugin
loaded. But when I try to do the import its throwing this error message

INFO: Starting Full Import
Oct 21, 2011 11:35:41 AM org.apache.solr.core.SolrCore execute
INFO: [DW] webapp=/solr path=/select params={command=status&qt=/dataimport}
status=0 QTime=0
Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.SolrWriter
readIndexerProperties
WARNING: Unable to read: dataimport.properties
Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.DataImporter
doFullImport
SEVERE: Full Import failed
java.lang.NoSuchMethodError: org.apache.solr.update.DeleteUpdateCommand:
method ()V not found
at
org.apache.solr.handler.dataimport.SolrWriter.doDeleteAll(SolrWriter.java:193)
at
org.apache.solr.handler.dataimport.DocBuilder.cleanByQuery(DocBuilder.java:1012)
at
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:183)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.SolrWriter
rollback
SEVERE: Exception while solr rollback.
java.lang.NoSuchMethodError: org.apache.solr.update.RollbackUpdateCommand:
method ()V not found
at
org.apache.solr.handler.dataimport.SolrWriter.rollback(SolrWriter.java:184)
at
org.apache.solr.handler.dataimport.DocBuilder.rollback(DocBuilder.java:249)
at
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:340)
at
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
at
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
Oct 21, 2011 11:35:43 AM org.apache.solr.core.SolrCore execute
INFO: [DW] webapp=/solr path=/select params={command=status&qt=/dataimport}
status=0 QTime=0

Any ideas whats going on .. its complaining about a missing method in
dataimport classes which doesnt makes sense. Is this some kind of version
mismatch or what is going on.

I would appreciate any comments.
Thanks
Adeel


Re: Can Solr handle large text files?

2011-10-21 Thread Peter Spam
Thanks for the response, Karsten.

1) What's the recommended maximum chunk size?
2) Does my tokenizer look reasonable?


Thanks!
Pete

On Oct 21, 2011, at 2:28 AM, karsten-s...@gmx.de wrote:

> Hi Peter,
> 
> highlighting in large text files can not be fast without dividing the 
> original text in small piece.
> So take a look in
> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
> and in
> http://www.lucidimagination.com/blog/2010/09/16/2446/
> 
> Which means that you should divide your files and use
> Result Grouping / Field Collapsing
> to list only one hit per original document.
> 
> (xtf also would solve your problem "out of the box" but xtf does not use 
> solr).
> 
> Best regards
>  Karsten
> 
>  Original-Nachricht 
>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>> Von: Peter Spam 
>> An: solr-user@lucene.apache.org
>> Betreff: Can Solr handle large text files?
> 
>> I have about 20k text files, some very small, but some up to 300MB, and
>> would like to do text searching with highlighting.
>> 
>> Imagine the text is the contents of your syslog.
>> 
>> I would like to type in some terms, such as "error" and "mail", and have
>> Solr return the syslog lines with those terms PLUS two lines of context. 
>> Pretty much just like Google's highlighting.
>> 
>> 1) Can Solr handle this?  I had extremely long query times when I tried
>> this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I tried breaking
>> the files into 1MB pieces, but searching would be wonky => return the wrong
>> number of documents (ie. if one file had a term 5 times, and that was the
>> only file that had the term, I want 1 result, not 5 results).  
>> 
>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>> 
>>   > multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>> 
>>
>>  
>>
>>
>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>> catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>  
>>
>> 
>> 
>> Thanks!
>> Pete



Re: Can Solr handle large text files?

2011-10-21 Thread Peter Spam
Thanks for your note, Anand.  What was the maximum chunk size for you?  Could 
you post the relevant portions of your configuration file?


Thanks!
Pete

On Oct 21, 2011, at 4:20 AM, anand.ni...@rbs.com wrote:

> Hi,
> 
> I was also facing the issue of highlighting the large text files. I applied 
> the solution proposed here and it worked. But I am getting following error :
> 
> 
> Basically 'hitGrouped.vm' is not found. I am using solr-3.4.0. Where can I 
> get this file from. Its reference is present in browse.vm
> 
> 
>  #if($response.response.get('grouped'))
>#foreach($grouping in $response.response.get('grouped'))
>  #parse("hitGrouped.vm")
>#end
>  #else
>#foreach($doc in $response.results)
>  #parse("hit.vm")
>#end
>  #end
> 
> 
> 
> HTTP Status 500 - Can't find resource 'hitGrouped.vm' in classpath or 
> 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
> cwd=C:\glassfish3\glassfish\domains\domain1\config 
> java.lang.RuntimeException: Can't find resource 'hitGrouped.vm' in classpath 
> or 'C:\caprice\workspace\caprice\dist\DEV\solr\.\conf/', 
> cwd=C:\glassfish3\glassfish\domains\domain1\config at 
> org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:268)
>  at 
> org.apache.solr.response.SolrVelocityResourceLoader.getResourceStream(SolrVelocityResourceLoader.java:42)
>  at org.apache.velocity.Template.process(Template.java:98) at 
> org.apache.velocity.runtime.resource.ResourceManagerImpl.loadResource(ResourceManagerImpl.java:446)
>  at 
> 
> Thanks & Regards,
> Anand
> Anand Nigam
> RBS Global Banking & Markets
> Office: +91 124 492 5506   
> 
> 
> -Original Message-
> From: karsten-s...@gmx.de [mailto:karsten-s...@gmx.de] 
> Sent: 21 October 2011 14:58
> To: solr-user@lucene.apache.org
> Subject: Re: Can Solr handle large text files?
> 
> Hi Peter,
> 
> highlighting in large text files can not be fast without dividing the 
> original text in small piece.
> So take a look in
> http://xtf.cdlib.org/documentation/under-the-hood/#Chunking
> and in
> http://www.lucidimagination.com/blog/2010/09/16/2446/
> 
> Which means that you should divide your files and use Result Grouping / Field 
> Collapsing to list only one hit per original document.
> 
> (xtf also would solve your problem "out of the box" but xtf does not use 
> solr).
> 
> Best regards
>  Karsten
> 
>  Original-Nachricht 
>> Datum: Thu, 20 Oct 2011 17:59:04 -0700
>> Von: Peter Spam 
>> An: solr-user@lucene.apache.org
>> Betreff: Can Solr handle large text files?
> 
>> I have about 20k text files, some very small, but some up to 300MB, 
>> and would like to do text searching with highlighting.
>> 
>> Imagine the text is the contents of your syslog.
>> 
>> I would like to type in some terms, such as "error" and "mail", and 
>> have Solr return the syslog lines with those terms PLUS two lines of context.
>> Pretty much just like Google's highlighting.
>> 
>> 1) Can Solr handle this?  I had extremely long query times when I 
>> tried this with Solr 1.4.1 (yes I was using TermVectors, etc.).  I 
>> tried breaking the files into 1MB pieces, but searching would be wonky 
>> => return the wrong number of documents (ie. if one file had a term 5 
>> times, and that was the only file that had the term, I want 1 result, not 5 
>> results).
>> 
>> 2) What sort of tokenizer would be best?  Here's what I'm using:
>> 
>>   > multiValued="false" termVectors="true" termPositions="true" 
>> termOffsets="true" />
>> 
>>
>>  
>>
>>
>>> generateWordParts="0" generateNumberParts="0" catenateWords="0" 
>> catenateNumbers="0"
>> catenateAll="0" splitOnCaseChange="0"/>
>>  
>>
>> 
>> 
>> Thanks!
>> Pete
> 
> ***
>  
> The Royal Bank of Scotland plc. Registered in Scotland No 90312. 
> Registered Office: 36 St Andrew Square, Edinburgh EH2 2YB. 
> Authorised and regulated by the Financial Services Authority. The 
> Royal Bank of Scotland N.V. is authorised and regulated by the 
> De Nederlandsche Bank and has its seat at Amsterdam, the 
> Netherlands, and is registered in the Commercial Register under 
> number 33002587. Registered Office: Gustav Mahlerlaan 350, 
> Amsterdam, The Netherlands. The Royal Bank of Scotland N.V. and 
> The Royal Bank of Scotland plc are authorised to act as agent for each 
> other in certain jurisdictions. 
> 
> This e-mail message is confidential and for use by the addressee only. 
> If the message is received by anyone other than the addressee, please 
> return the message to the sender by replying to it and then delete the 
> message from your computer. Internet e-mails are not necessarily 
> secure. The Royal Bank of Scotland plc and The Royal Bank of Scotland 
> N.V. including its affiliates ("RBS group") does not accept responsibility 
> for changes made to this message after it was sent. For the protection
> of RBS group and its client

can solr follow and index hyperlinks embedded in rich text documents (pdf, doc, etc)?

2011-10-21 Thread Tod
I have a feeling the answer is "no" since you wouldn't want to start 
indexing a large volume of office documents containing hyperlinks that 
could lead all over the internet.  But, since there might be a use case 
like "a customer just asked me if it could be done?", I thought I would 
make sure.



Thanks - Tod


Re: data import in 4.0

2011-10-21 Thread Alireza Salimi
Hi,

How do you start Solr, through start.jar or you deploy it to a web
container?
Sometimes problems like this are because of different class loaders.
I hope my answer would help you.

Regards


On Fri, Oct 21, 2011 at 12:47 PM, Adeel Qureshi wrote:

> Hi I am trying to setup the data import handler with solr 4.0 and having
> some unexpected problems. I have a multi-core setup and only one core
> needed
> the dataimport handler so I have added the request handler to it and added
> the lib imports in config file
>
> 
>  regex="apache-solr-dataimporthandler-extras-\d.*\.jar" />
>
> for some reason this doesnt works .. it still keeps giving me ClassNoFound
> error message so I moved the jars files to the shared lib folder and then
> atleast I was able to see the admin screen with the dataimport plugin
> loaded. But when I try to do the import its throwing this error message
>
> INFO: Starting Full Import
> Oct 21, 2011 11:35:41 AM org.apache.solr.core.SolrCore execute
> INFO: [DW] webapp=/solr path=/select params={command=status&qt=/dataimport}
> status=0 QTime=0
> Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.SolrWriter
> readIndexerProperties
> WARNING: Unable to read: dataimport.properties
> Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.DataImporter
> doFullImport
> SEVERE: Full Import failed
> java.lang.NoSuchMethodError: org.apache.solr.update.DeleteUpdateCommand:
> method ()V not found
>at
>
> org.apache.solr.handler.dataimport.SolrWriter.doDeleteAll(SolrWriter.java:193)
>at
>
> org.apache.solr.handler.dataimport.DocBuilder.cleanByQuery(DocBuilder.java:1012)
>at
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:183)
>at
>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
>at
>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
>at
>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
> Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.SolrWriter
> rollback
> SEVERE: Exception while solr rollback.
> java.lang.NoSuchMethodError: org.apache.solr.update.RollbackUpdateCommand:
> method ()V not found
>at
> org.apache.solr.handler.dataimport.SolrWriter.rollback(SolrWriter.java:184)
>at
> org.apache.solr.handler.dataimport.DocBuilder.rollback(DocBuilder.java:249)
>at
>
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:340)
>at
>
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
>at
>
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
> Oct 21, 2011 11:35:43 AM org.apache.solr.core.SolrCore execute
> INFO: [DW] webapp=/solr path=/select params={command=status&qt=/dataimport}
> status=0 QTime=0
>
> Any ideas whats going on .. its complaining about a missing method in
> dataimport classes which doesnt makes sense. Is this some kind of version
> mismatch or what is going on.
>
> I would appreciate any comments.
> Thanks
> Adeel
>



-- 
Alireza Salimi
Java EE Developer


Re: java.lang.NoSuchMethodError: org.slf4j.spi.LocationAwareLogger.log

2011-10-21 Thread Tod

On 10/19/2011 2:58 PM, wrote:

Hi Tod,

I had similar issue with slf4j, but it was NoClassDefFound. Do you
have some other dependencies in your application that use some other
version of slf4j? You can use mvn dependency:tree to get all
dependencies in your application. Or maybe there's some other version
already in your tomcat or application server.

/Tim


I had to start over from scratch but I believe that's exactly what it 
was.  Things are working now.


Thanks.


Re: Error Finding solrconfig.xml

2011-10-21 Thread rocco2004
Hi Hoss,

It ended up been permission issue. I moved the example folder to
/usr/local/jakarta/tomcat/webapps/solr/WEB-INF/ and it was able to find it.
Java seem to be giving file not found even if it doesn't have permissions to
access it. 

I'm wondering in such cases under what user runs tomcat and what is the best
practices to get permissions to external folder like
"/home/datadmin/public_html/apache-solr/example" ?


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Error-Finding-solrconfig-xml-tp3408411p3441554.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: data import in 4.0

2011-10-21 Thread Adeel Qureshi
its deployed on a tomcat server ..

On Fri, Oct 21, 2011 at 12:49 PM, Alireza Salimi
wrote:

> Hi,
>
> How do you start Solr, through start.jar or you deploy it to a web
> container?
> Sometimes problems like this are because of different class loaders.
> I hope my answer would help you.
>
> Regards
>
>
> On Fri, Oct 21, 2011 at 12:47 PM, Adeel Qureshi  >wrote:
>
> > Hi I am trying to setup the data import handler with solr 4.0 and having
> > some unexpected problems. I have a multi-core setup and only one core
> > needed
> > the dataimport handler so I have added the request handler to it and
> added
> > the lib imports in config file
> >
> > 
> >  > regex="apache-solr-dataimporthandler-extras-\d.*\.jar" />
> >
> > for some reason this doesnt works .. it still keeps giving me
> ClassNoFound
> > error message so I moved the jars files to the shared lib folder and then
> > atleast I was able to see the admin screen with the dataimport plugin
> > loaded. But when I try to do the import its throwing this error message
> >
> > INFO: Starting Full Import
> > Oct 21, 2011 11:35:41 AM org.apache.solr.core.SolrCore execute
> > INFO: [DW] webapp=/solr path=/select
> params={command=status&qt=/dataimport}
> > status=0 QTime=0
> > Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.SolrWriter
> > readIndexerProperties
> > WARNING: Unable to read: dataimport.properties
> > Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.DataImporter
> > doFullImport
> > SEVERE: Full Import failed
> > java.lang.NoSuchMethodError: org.apache.solr.update.DeleteUpdateCommand:
> > method ()V not found
> >at
> >
> >
> org.apache.solr.handler.dataimport.SolrWriter.doDeleteAll(SolrWriter.java:193)
> >at
> >
> >
> org.apache.solr.handler.dataimport.DocBuilder.cleanByQuery(DocBuilder.java:1012)
> >at
> >
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:183)
> >at
> >
> >
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
> >at
> >
> >
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
> >at
> >
> >
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
> > Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.SolrWriter
> > rollback
> > SEVERE: Exception while solr rollback.
> > java.lang.NoSuchMethodError:
> org.apache.solr.update.RollbackUpdateCommand:
> > method ()V not found
> >at
> >
> org.apache.solr.handler.dataimport.SolrWriter.rollback(SolrWriter.java:184)
> >at
> >
> org.apache.solr.handler.dataimport.DocBuilder.rollback(DocBuilder.java:249)
> >at
> >
> >
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:340)
> >at
> >
> >
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
> >at
> >
> >
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
> > Oct 21, 2011 11:35:43 AM org.apache.solr.core.SolrCore execute
> > INFO: [DW] webapp=/solr path=/select
> params={command=status&qt=/dataimport}
> > status=0 QTime=0
> >
> > Any ideas whats going on .. its complaining about a missing method in
> > dataimport classes which doesnt makes sense. Is this some kind of version
> > mismatch or what is going on.
> >
> > I would appreciate any comments.
> > Thanks
> > Adeel
> >
>
>
>
> --
> Alireza Salimi
> Java EE Developer
>


Sorting fields with letters?

2011-10-21 Thread Peter Spam
Hi everyone,

I have a field that has a letter in it (for example, 1A1, 2A1, 11C15, etc.).  
Sorting it seems to work most of the time, except for a few things, like 10A1 
is lower than 8A100, and 10A100 is lower than 10A99.  Any ideas?  I bet if my 
data had leading zeros (ie 10A099), it would behave better?  (But I can't 
really change my data now, as it would take a few days to re-inject - which is 
possible but a hassle).


Thanks!
Pete


Re: data import in 4.0

2011-10-21 Thread Alireza Salimi
So to me it heightens the probability of classloader conflicts,
I haven't worked with Solr 4.0, so I don't know if set of JAR files
are the same with Solr 3.4. Anyway, make sure that there is only
ONE instance of apache-solr-dataimporthandler-***.jar in your
whole tomcat+webapp.

Maybe you have this jar file in CATALINA_HOME\lib folder.

On Fri, Oct 21, 2011 at 3:06 PM, Adeel Qureshi wrote:

> its deployed on a tomcat server ..
>
> On Fri, Oct 21, 2011 at 12:49 PM, Alireza Salimi
> wrote:
>
> > Hi,
> >
> > How do you start Solr, through start.jar or you deploy it to a web
> > container?
> > Sometimes problems like this are because of different class loaders.
> > I hope my answer would help you.
> >
> > Regards
> >
> >
> > On Fri, Oct 21, 2011 at 12:47 PM, Adeel Qureshi  > >wrote:
> >
> > > Hi I am trying to setup the data import handler with solr 4.0 and
> having
> > > some unexpected problems. I have a multi-core setup and only one core
> > > needed
> > > the dataimport handler so I have added the request handler to it and
> > added
> > > the lib imports in config file
> > >
> > >  />
> > >  > > regex="apache-solr-dataimporthandler-extras-\d.*\.jar" />
> > >
> > > for some reason this doesnt works .. it still keeps giving me
> > ClassNoFound
> > > error message so I moved the jars files to the shared lib folder and
> then
> > > atleast I was able to see the admin screen with the dataimport plugin
> > > loaded. But when I try to do the import its throwing this error message
> > >
> > > INFO: Starting Full Import
> > > Oct 21, 2011 11:35:41 AM org.apache.solr.core.SolrCore execute
> > > INFO: [DW] webapp=/solr path=/select
> > params={command=status&qt=/dataimport}
> > > status=0 QTime=0
> > > Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.SolrWriter
> > > readIndexerProperties
> > > WARNING: Unable to read: dataimport.properties
> > > Oct 21, 2011 11:35:41 AM
> org.apache.solr.handler.dataimport.DataImporter
> > > doFullImport
> > > SEVERE: Full Import failed
> > > java.lang.NoSuchMethodError:
> org.apache.solr.update.DeleteUpdateCommand:
> > > method ()V not found
> > >at
> > >
> > >
> >
> org.apache.solr.handler.dataimport.SolrWriter.doDeleteAll(SolrWriter.java:193)
> > >at
> > >
> > >
> >
> org.apache.solr.handler.dataimport.DocBuilder.cleanByQuery(DocBuilder.java:1012)
> > >at
> > >
> >
> org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:183)
> > >at
> > >
> > >
> >
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:335)
> > >at
> > >
> > >
> >
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
> > >at
> > >
> > >
> >
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
> > > Oct 21, 2011 11:35:41 AM org.apache.solr.handler.dataimport.SolrWriter
> > > rollback
> > > SEVERE: Exception while solr rollback.
> > > java.lang.NoSuchMethodError:
> > org.apache.solr.update.RollbackUpdateCommand:
> > > method ()V not found
> > >at
> > >
> >
> org.apache.solr.handler.dataimport.SolrWriter.rollback(SolrWriter.java:184)
> > >at
> > >
> >
> org.apache.solr.handler.dataimport.DocBuilder.rollback(DocBuilder.java:249)
> > >at
> > >
> > >
> >
> org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:340)
> > >at
> > >
> > >
> >
> org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:393)
> > >at
> > >
> > >
> >
> org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:374)
> > > Oct 21, 2011 11:35:43 AM org.apache.solr.core.SolrCore execute
> > > INFO: [DW] webapp=/solr path=/select
> > params={command=status&qt=/dataimport}
> > > status=0 QTime=0
> > >
> > > Any ideas whats going on .. its complaining about a missing method in
> > > dataimport classes which doesnt makes sense. Is this some kind of
> version
> > > mismatch or what is going on.
> > >
> > > I would appreciate any comments.
> > > Thanks
> > > Adeel
> > >
> >
> >
> >
> > --
> > Alireza Salimi
> > Java EE Developer
> >
>



-- 
Alireza Salimi
Java EE Developer


Re: SOLR CLOUD IN TWO DIFFERENT HOSTS

2011-10-21 Thread Mark Miller
On a quick pass this looks okay - especially if it works on the same host. 
Seems odd you would get a 404 with the zk link - without more info I don't know 
what is up with that, but perhaps when Solr tries to determine the local 
machines address, it's not finding what you want? We use localHost = "http://"; 
+ InetAddress.getLocalHost().getHostName(); by default.

You can override it by settings host= in solr.xml on solr/cores. If you want to 
use a system property instead, host=${nameOfSysProp}

- Mark

On Oct 16, 2011, at 8:27 PM, prakash wrote:

> start master node in host-1 (dot)
> dot> java -Dbootstrap_confdir=./solr/conf -Dcollection.configName=myconf
> -DzkRun -jar start.jar
> 
> start another node in host-2 (megatron)
> megatron> java -Djetty.port=8000 -DhostPort=8000 -DzkHost=dot:8983 -jar
> start.jar
> 
> added ipod_video.xml file to dot
> dot> java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar
> ipod_video.xml
> 
> added mem.xml file to megatron
> megatron> java -Durl=http://localhost:8000/solr/collection1/update -jar
> post.jar mem.xml
> 
> now when i search for contents in dot, i m not getting contents from
> megatron (and vice-versa)
> 
> but the same kind of setup worked well within same host and with different
> ports (eg: dot:8983 and dot:8000)
> 
> Am i missing something in this setup ??
> 
> and
> 
> How to check the status of nodes ?? (zoo-keeper link says : HTTP ERROR 404) 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SOLR-CLOUD-IN-TWO-DIFFERENT-HOSTS-tp3426396p3426396.html
> Sent from the Solr - User mailing list archive at Nabble.com.

- Mark Miller
lucidimagination.com
2011.lucene-eurocon.org | Oct 17-20 | Barcelona












Re: NRT and replication

2011-10-21 Thread Mark Miller
Yeah - a distributed update processor like the one Yonik wrote will do fine in 
simple situations.

On Oct 17, 2011, at 7:33 PM, Esteban Donato wrote:

> thanks Yonik.  Any idea of when this should be completed?  In the
> meantime I think I will have to add docs to every replica, possibly
> implementing an update processor.  Something similar to SOLR-2355?
> 
> On Fri, Oct 14, 2011 at 7:31 PM, Yonik Seeley
>  wrote:
>> On Fri, Oct 14, 2011 at 5:49 PM, Esteban Donato
>>  wrote:
>>>  I found soft commits very useful for NRT search requirements.
>>> However I couldn't figure out how replication works with this feature.
>>>  I mean, if I have N replicas of an index for load balancing purposes,
>>> when I soft commit a doc in one of this nodes, is there any way that
>>> those "in-memory" docs get replicated to the rest of replicas?
>> 
>> Nope.  Index replication isn't really that compatible with NRT.
>> But the new distributed indexing features we're working on will be!
>> The parent issue for this effort is SOLR-2358.
>> 
>> -Yonik
>> http://www.lucene-eurocon.com - The Lucene/Solr User Conference
>> 

- Mark Miller
lucidimagination.com
2011.lucene-eurocon.org | Oct 17-20 | Barcelona












Re: Sorting fields with letters?

2011-10-21 Thread Tomás Fernández Löbbe
Well, yes. You probably have a string field for that content, right? so the
content is being compared as strings, not as numbers, that why something
like 1000 is lower than 2. Leading zeros would be an option. Another option
is to separate the field into numeric fields and sort by those (this last
option is only recommended if your data always look similar).
Something like 11C15 to field1: 11, field2:C field3: 15. Then use
"sort=field1,field2,field3".

Anyway, both this options require reindexing.

Regards,

Tomás

On Fri, Oct 21, 2011 at 4:57 PM, Peter Spam  wrote:

> Hi everyone,
>
> I have a field that has a letter in it (for example, 1A1, 2A1, 11C15,
> etc.).  Sorting it seems to work most of the time, except for a few things,
> like 10A1 is lower than 8A100, and 10A100 is lower than 10A99.  Any ideas?
>  I bet if my data had leading zeros (ie 10A099), it would behave better?
>  (But I can't really change my data now, as it would take a few days to
> re-inject - which is possible but a hassle).
>
>
> Thanks!
> Pete
>


Re: can solr follow and index hyperlinks embedded in rich text documents (pdf, doc, etc)?

2011-10-21 Thread Tomás Fernández Löbbe
Hi Tod, Solr doesn't actually crawl, If you need to feed Solr with that kind
of information you'll have to use some crawling tool or implement that
yourself.

Regards,

Tomás

On Fri, Oct 21, 2011 at 2:48 PM, Tod  wrote:

> I have a feeling the answer is "no" since you wouldn't want to start
> indexing a large volume of office documents containing hyperlinks that could
> lead all over the internet.  But, since there might be a use case like "a
> customer just asked me if it could be done?", I thought I would make sure.
>
>
> Thanks - Tod
>


Re: NRT and replication

2011-10-21 Thread Tomás Fernández Löbbe
I was thinking in this, would it make sense to keep the master / slave
architecture, adding documents to the master and the slaves, do soft commits
(only) to the slaves and hard commits to the master? That way you wouldn't
be doing any merges on slaves. Would that make sense?

On Fri, Oct 21, 2011 at 5:43 PM, Mark Miller  wrote:

> Yeah - a distributed update processor like the one Yonik wrote will do fine
> in simple situations.
>
> On Oct 17, 2011, at 7:33 PM, Esteban Donato wrote:
>
> > thanks Yonik.  Any idea of when this should be completed?  In the
> > meantime I think I will have to add docs to every replica, possibly
> > implementing an update processor.  Something similar to SOLR-2355?
> >
> > On Fri, Oct 14, 2011 at 7:31 PM, Yonik Seeley
> >  wrote:
> >> On Fri, Oct 14, 2011 at 5:49 PM, Esteban Donato
> >>  wrote:
> >>>  I found soft commits very useful for NRT search requirements.
> >>> However I couldn't figure out how replication works with this feature.
> >>>  I mean, if I have N replicas of an index for load balancing purposes,
> >>> when I soft commit a doc in one of this nodes, is there any way that
> >>> those "in-memory" docs get replicated to the rest of replicas?
> >>
> >> Nope.  Index replication isn't really that compatible with NRT.
> >> But the new distributed indexing features we're working on will be!
> >> The parent issue for this effort is SOLR-2358.
> >>
> >> -Yonik
> >> http://www.lucene-eurocon.com - The Lucene/Solr User Conference
> >>
>
> - Mark Miller
> lucidimagination.com
> 2011.lucene-eurocon.org | Oct 17-20 | Barcelona
>
>
>
>
>
>
>
>
>
>
>


Re: Sorting fields with letters?

2011-10-21 Thread Peter Spam
Is there a way to use a custom sorter, to avoid re-indexing?


Thanks!
Pete

On Oct 21, 2011, at 2:13 PM, Tomás Fernández Löbbe wrote:

> Well, yes. You probably have a string field for that content, right? so the
> content is being compared as strings, not as numbers, that why something
> like 1000 is lower than 2. Leading zeros would be an option. Another option
> is to separate the field into numeric fields and sort by those (this last
> option is only recommended if your data always look similar).
> Something like 11C15 to field1: 11, field2:C field3: 15. Then use
> "sort=field1,field2,field3".
> 
> Anyway, both this options require reindexing.
> 
> Regards,
> 
> Tomás
> 
> On Fri, Oct 21, 2011 at 4:57 PM, Peter Spam  wrote:
> 
>> Hi everyone,
>> 
>> I have a field that has a letter in it (for example, 1A1, 2A1, 11C15,
>> etc.).  Sorting it seems to work most of the time, except for a few things,
>> like 10A1 is lower than 8A100, and 10A100 is lower than 10A99.  Any ideas?
>> I bet if my data had leading zeros (ie 10A099), it would behave better?
>> (But I can't really change my data now, as it would take a few days to
>> re-inject - which is possible but a hassle).
>> 
>> 
>> Thanks!
>> Pete
>> 



Re: Sorting fields with letters?

2011-10-21 Thread Tomás Fernández Löbbe
I don't know if you'll find exactly what you need, but you can sort by any
field or FunctionQuery. See http://wiki.apache.org/solr/FunctionQuery

On Fri, Oct 21, 2011 at 7:03 PM, Peter Spam  wrote:

> Is there a way to use a custom sorter, to avoid re-indexing?
>
>
> Thanks!
> Pete
>
> On Oct 21, 2011, at 2:13 PM, Tomás Fernández Löbbe wrote:
>
> > Well, yes. You probably have a string field for that content, right? so
> the
> > content is being compared as strings, not as numbers, that why something
> > like 1000 is lower than 2. Leading zeros would be an option. Another
> option
> > is to separate the field into numeric fields and sort by those (this last
> > option is only recommended if your data always look similar).
> > Something like 11C15 to field1: 11, field2:C field3: 15. Then use
> > "sort=field1,field2,field3".
> >
> > Anyway, both this options require reindexing.
> >
> > Regards,
> >
> > Tomás
> >
> > On Fri, Oct 21, 2011 at 4:57 PM, Peter Spam  wrote:
> >
> >> Hi everyone,
> >>
> >> I have a field that has a letter in it (for example, 1A1, 2A1, 11C15,
> >> etc.).  Sorting it seems to work most of the time, except for a few
> things,
> >> like 10A1 is lower than 8A100, and 10A100 is lower than 10A99.  Any
> ideas?
> >> I bet if my data had leading zeros (ie 10A099), it would behave better?
> >> (But I can't really change my data now, as it would take a few days to
> >> re-inject - which is possible but a hassle).
> >>
> >>
> >> Thanks!
> >> Pete
> >>
>
>


Date boosting with dismax question

2011-10-21 Thread Craig Stadler

Solr Specification Version: 1.4.0
Solr Implementation Version: 1.4.0 833479 - grantingersoll - 2009-11-06 
12:33:40

Lucene Specification Version: 2.9.1
Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25

precisionStep="6" positionIncrementGap="0"/>


stored="false" omitNorms="true"  required="false" 
omitTermFreqAndPositions="true" />


I am using 'created' as the name of the date field.

My dates are being populated as such :
1980-01-01T00:00:00Z

Search handler (solrconfig) :



dismax
explicit
0.1
name0^2 other ^1
name0^2 other ^1
3
3
*:*



--

Query :

/solr/ftf/dismax/?q=libya
&debugQuery=off
&hl=true
&start=
&rows=10
--

I am trying to factor in created to the SCORE. (boost) I have tried a 
million ways to do this, no success. I know the dates are populating 
correctly because I can sort by them. Can anyone help me implement date 
boosting with dismax under this scenario???


-Craig