Re: PDF extraction leads to reversed words

2010-03-09 Thread Dominique Bejean

Hi,

The problem comes form PDFBox 
(http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. 
However Tika doesn't yet use this version of PDFBox.

So for PDF text extraction, I doesn't use Tika but pdftotext.

Dominique


Le 09/03/10 06:00, Robert Muir a écrit :

it is an optional dependency of PDFBox. If ICU is available, then it
is capable of processing Arabic PDF files.

The problem is that Arabic "text" in PDF files is really glyphs
(encoded in visual order) and needs to be 'unshaped' with some stuff
that isn't in the JDK.

If the size of the default ICU jar file is the issue here, we can
consider an alternative: The default ICU jar is very large as it
includes everything, yet it can be customized to only include what is
needed: http://apps.icu-project.org/datacustom/

We did this in lucene for the collation contrib, to shrink the jar
about 2MB: http://issues.apache.org/jira/browse/LUCENE-1867

For this use-case, it could be even smaller, as most of the huge size
of ICU comes from large CJK collation tables (needed for collation,
but not for this Arabic PDF extraction).

In reality I don't really like doing this as it might confuse users
(e.g. people that want collation, too), and ICU is useful for other
things, but if thats what we have to do, we should do it so that
Arabic PDF files will work.

On Mon, Mar 8, 2010 at 11:53 PM, Lance Norskog  wrote:
   

Is this a mistake in the Tika library collection in the Solr trunk?

On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir  wrote:
 

I think the problem is that Solr does not include the ICU4J jar, so it
won't work with Arabic PDF files.

Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your classpath.

On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID  wrote:
   

Hi,
Posting arabic pdf files to Solr using a web form (to solr/update/extract)
get extracted texts and each words displayed in reverse direction(instead of
right to left).
When perform search against these texts with -always- reversed key-words I
get results but reversed.
This problem doesn't occur when posting MsWord document.
I think the problem come from Tika !

Any clue ?

--
elsadek
Software Engineer- J2EE / WEB / ESB MULE

 



--
Robert Muir
rcm...@gmail.com

   



--
Lance Norskog
goks...@gmail.com

 



   


Re: Tomcat save my Index temp ...

2010-03-09 Thread stocki

okay i install my solr so like how the wiki said. and a new try. here one of
my two XML-files:

/var/lib/conf/Catalina/localhost/suggest.xml


   



should i set name="solr/home" to --> name="$SOLR_HOME" ??? 

id did not find the reason.

Solr Home is set by :
export JAVA_OPTS="$JAVA_OPTS
-Dsolr.solr.home=/home/sites/path/to/home/cores"

in my cores/ home folder ist the serv.xml from my post above.





Jens Kapitza-2 wrote:
> 
> Am 08.03.2010 15:08, schrieb stocki:
>> Hello.
>>
>> is use 2 cores for solr.
>>
>> when is restart my tomcat on debian, tomcat delete my index.
>>
> you should check your tomcat-setup.
>> is set data.dir to
>> ${solr.data.dir:./suggest/data}
>> and
>> ${solr.data.dir:./search/data}
>>
>>
> use an absolute path [you have not set the solr.home path] this is 
> working/tmp dir from tomcat per default.
>> 
>>  > dataDir="/search/data/index"/>
>>  > dataDir="/suggest/data/index"/>
>> 
>>
>>
> is ok. but this is relative from solr.home.
>> so. why is my index only temp ?
>>
>>
> try to setup solr again.
> http://wiki.apache.org/solr/SolrTomcat
> 
> try to setup with Context fragment.
> 
> Create a Tomcat Context fragment to point /docBase/ to the 
> /$SOLR_HOME/apache-solr-1.3.0.war/ file and /solr/home/ to /$SOLR_HOME/:
> 
> 
> and avoid storing the data in .../tmp/
> 
> 
> 



-- 
View this message in context: 
http://old.nabble.com/Tomcat-save-my-Index-temp-...-tp27819967p27833705.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: PDF extraction leads to reversed words

2010-03-09 Thread Robert Muir
sorry for the link to the wrong JIRA issue, was looking at another issue.

its here: https://issues.apache.org/jira/browse/SOLR-1813

again you will need to apply it to trunk I think, as thats the only
place I have tested it.


-- 
Robert Muir
rcm...@gmail.com


indexing key/value field type

2010-03-09 Thread muneeb

Hi,

I have built an index of several million documents with all primitive type
fields, either String, text or int. I have another multivalued field to
index now for each document which is a list of tags as a hashmap, so:

tags , where key is String and value is Int.
key is a given tag and value is a count of how many users used this tag for
a given document.

How can I index and store a key/value type of field? such that one can
search on the values as well as keys of this field. 

I have looked at FAQs, where one mailing-list suggests using the dynamic
field type such as: 



but how would we search on the dynamic field names?

I guess the final option I have is to make my custom field type and analyzer
to deal with this... Any suggestions on how i should i go about this??

Thanks very much in advance.

-Ali
-- 
View this message in context: 
http://old.nabble.com/indexing-key-value-field-type-tp27836710p27836710.html
Sent from the Solr - User mailing list archive at Nabble.com.



Confused by Solr Ranking

2010-03-09 Thread abhishes

I am indexing a column in a database. I have chosen field type of text for
this column (this type was defined in the sample schema file which comes in
the Solr Example).

When I search for the word "impress" and top 3 results. I get these 3
documents

bare desire pronounce villainy draught beasts blockish
impression acquit 
bare impression villainy pronounce beasts desire blockish
draught acquit 
beasts desire villainy pronounce bare acquit impression
draught blockish 

But here the TEXT doesn't really contain the word "impress" it contains the
word "impression"

Now the database does contain a few rows where the word "impress" is there,
but those rows do not come in top 3 results.

So my question is that why did the rows containing the word "impression" got
ranked higher than the rows containing the word "impress" when I searched
for "impress"?

My field type Text is defined as follows in the schema.


  







  
  







  




-- 
View this message in context: 
http://old.nabble.com/Confused-by-Solr-Ranking-tp27834227p27834227.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
nor 3.8 version does change anythings !

On 3/9/10, Robert Muir  wrote:
>
> I think the problem is that Solr does not include the ICU4J jar, so it
> won't work with Arabic PDF files.
>
> Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
> classpath.
>
>
> On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID 
> wrote:
> > Hi,
> > Posting arabic pdf files to Solr using a web form (to
> solr/update/extract)
> > get extracted texts and each words displayed in reverse direction(instead
> of
> > right to left).
> > When perform search against these texts with -always- reversed key-words
> I
> > get results but reversed.
> > This problem doesn't occur when posting MsWord document.
> > I think the problem come from Tika !
> >
> > Any clue ?
> >
> > --
> > elsadek
> > Software Engineer- J2EE / WEB / ESB MULE
> >
>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB / ESB MULE


Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
I'm using 1.4 version of Solr

On 3/9/10, Robert Muir  wrote:
>
> On Tue, Mar 9, 2010 at 9:44 AM, Abdelhamid  ABID 
> wrote:
> > I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with
> > ICU4J 3.8
> >
>
>
> Hello, what version of Solr are you using? I think you will need to
> use the trunk version.
>
> I created a patch for this issue that you can apply to trunk (with all
> necessary resources)
> here: https://issues.apache.org/jira/browse/SOLR-1657
>
> The included testcase fails without adding icu4j to the lib directory
> (as the arabic text
> is reversed), and passes with it.
>
>
> --
>
> Robert Muir
> rcm...@gmail.com
>



-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB / ESB MULE


Re: Wildcard question -- case issue

2010-03-09 Thread cjkadakia

Understood. My solution was to convert any search terms with an asterisk to
lowercase prior to submitting to solr and it seems to be working correctly
now. Thanks for your help.
-- 
View this message in context: 
http://old.nabble.com/Wildcard-questioncase-issue-tp27823332p27836740.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Confused by Solr Ranking

2010-03-09 Thread Avi Rosenschein
>
>
> > I kind of suspected stemming to be the reason behind this.
> > But I consider stemming to be a good feature.
>
> This is the side effect of stemming. Stemming increases recall while
> harming precision.
>

This is a side effect of stemming, the way it is currently implemented in
Lucene. Stemming could theoretically increase recall without hurting
precision or relevancy. One way to do this would be to always store the
original token, along with the stemmed token. Then, at scoring time, give a
boost to matches which are closer to the original form.

-- Avi


Re: PDF extraction leads to reversed words

2010-03-09 Thread Robert Muir
this depends on what version of solr you are using, the trunk version
has a version of tika that supports this. See SOLR-1813

On Tue, Mar 9, 2010 at 3:59 AM, Dominique Bejean
 wrote:
> Hi,
>
> The problem comes form PDFBox
> (http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However
> Tika doesn't yet use this version of PDFBox.
> So for PDF text extraction, I doesn't use Tika but pdftotext.
>
> Dominique
>
>
> Le 09/03/10 06:00, Robert Muir a écrit :
>>
>> it is an optional dependency of PDFBox. If ICU is available, then it
>> is capable of processing Arabic PDF files.
>>
>> The problem is that Arabic "text" in PDF files is really glyphs
>> (encoded in visual order) and needs to be 'unshaped' with some stuff
>> that isn't in the JDK.
>>
>> If the size of the default ICU jar file is the issue here, we can
>> consider an alternative: The default ICU jar is very large as it
>> includes everything, yet it can be customized to only include what is
>> needed: http://apps.icu-project.org/datacustom/
>>
>> We did this in lucene for the collation contrib, to shrink the jar
>> about 2MB: http://issues.apache.org/jira/browse/LUCENE-1867
>>
>> For this use-case, it could be even smaller, as most of the huge size
>> of ICU comes from large CJK collation tables (needed for collation,
>> but not for this Arabic PDF extraction).
>>
>> In reality I don't really like doing this as it might confuse users
>> (e.g. people that want collation, too), and ICU is useful for other
>> things, but if thats what we have to do, we should do it so that
>> Arabic PDF files will work.
>>
>> On Mon, Mar 8, 2010 at 11:53 PM, Lance Norskog  wrote:
>>
>>>
>>> Is this a mistake in the Tika library collection in the Solr trunk?
>>>
>>> On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir  wrote:
>>>

 I think the problem is that Solr does not include the ICU4J jar, so it
 won't work with Arabic PDF files.

 Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
 classpath.

 On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID
  wrote:

>
> Hi,
> Posting arabic pdf files to Solr using a web form (to
> solr/update/extract)
> get extracted texts and each words displayed in reverse
> direction(instead of
> right to left).
> When perform search against these texts with -always- reversed
> key-words I
> get results but reversed.
> This problem doesn't occur when posting MsWord document.
> I think the problem come from Tika !
>
> Any clue ?
>
> --
> elsadek
> Software Engineer- J2EE / WEB / ESB MULE
>
>


 --
 Robert Muir
 rcm...@gmail.com


>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>>
>>
>>
>>
>



-- 
Robert Muir
rcm...@gmail.com


Re: [ANN] Zoie Solr Plugin - Zoie Solr Plugin enables real-time update functionality for Apache Solr 1.4+

2010-03-09 Thread Shalin Shekhar Mangar
I think Don is talking about Zoie - it requires a long uniqueKey.

On Tue, Mar 9, 2010 at 10:18 AM, Lance Norskog  wrote:

> Solr unique ids can be any type. The QueryElevateComponent complains
> if the unique id is not a string, but you can comment out the QEC.  I
> have one benchmark test with 2 billion documents with an integer id.
> Works great.
>
> On Mon, Mar 8, 2010 at 5:06 PM, Don Werve  wrote:
> > Too bad it requires integer (long) primary keys... :/
> >
> > 2010/3/8 Ian Holsman 
> >
> >>
> >> I just saw this on twitter, and thought you guys would be interested.. I
> >> haven't tried it, but it looks interesting.
> >>
> >> http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Solr+Plugin
> >>
> >> Thanks for the RT Shalin!
> >>
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Regards,
Shalin Shekhar Mangar.


RE: HTML encode extracted docs - Problems with solr.HTMLStripCharFilter

2010-03-09 Thread Mark Roberts
Sounds like "solr.HTMLStripCharFilter" may work... except, I'm getting a couple 
of problems:

1) HTML still seems to be getting into my content field

All I did was add  to the 
index analyzer for the my "text" fieldType.


2) Some it seems to have broken my highlighting, I get this error:

'org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token wrong 
exceeds length of provided text sized 3862'



Any ideas how I can fix this?





-Original Message-
From: Lance Norskog [mailto:goks...@gmail.com] 
Sent: 09 March 2010 04:36
To: solr-user@lucene.apache.org
Subject: Re: HTML encode extracted docs

A Tika integration with the DataImportHandler is in the Solr trunk.
With this, you can copy the raw HTML into different fields and process
one copy with Tika.

If it's just straight HTML, would the HTMLStripCharFilter be good enough?

http://www.lucidimagination.com/search/document/CDRG_ch05_5.7.2

On Mon, Mar 8, 2010 at 5:50 AM, Mark Roberts  wrote:
> I'm uploading .htm files to be extracted - some of these files are "include" 
> files that have snippets of HTML rather than fully formed html documents.
>
> solr-cell stores the raw HTML for these items, rather than extracting the 
> text. Is there any way I can get solr to encode this content prior to storing 
> it?
>
> At the moment, I have the problem that when the highlighted snippets are  
> retrieved via search, I need to parse the snippet and HTML encode the bits of 
> HTML that where indexed, whilst *not* encoding the bits that where added by 
> the highlighter, which is messy and time consuming.
>
> Thanks! Mark,
>



-- 
Lance Norskog
goks...@gmail.com


Re: Tomcat save my Index temp ...

2010-03-09 Thread stocki

okay i got it .. iam studid XD  i set my dataDir to /var/data/solr/... and
gives the correct rights now it runs.



Jens Kapitza-2 wrote:
> 
> Am 08.03.2010 15:08, schrieb stocki:
>> Hello.
>>
>> is use 2 cores for solr.
>>
>> when is restart my tomcat on debian, tomcat delete my index.
>>
> you should check your tomcat-setup.
>> is set data.dir to
>> ${solr.data.dir:./suggest/data}
>> and
>> ${solr.data.dir:./search/data}
>>
>>
> use an absolute path [you have not set the solr.home path] this is 
> working/tmp dir from tomcat per default.
>> 
>>  > dataDir="/search/data/index"/>
>>  > dataDir="/suggest/data/index"/>
>> 
>>
>>
> is ok. but this is relative from solr.home.
>> so. why is my index only temp ?
>>
>>
> try to setup solr again.
> http://wiki.apache.org/solr/SolrTomcat
> 
> try to setup with Context fragment.
> 
> Create a Tomcat Context fragment to point /docBase/ to the 
> /$SOLR_HOME/apache-solr-1.3.0.war/ file and /solr/home/ to /$SOLR_HOME/:
> 
> 
> and avoid storing the data in .../tmp/
> 
> 
> 





-- 
View this message in context: 
http://old.nabble.com/Tomcat-save-my-Index-temp-...-tp27819967p27834924.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Warning : no lockType configured for...

2010-03-09 Thread Mani EZZAT

Ok I think I know where the problem is

@Deprecated
169  public SolrIndexWriter(String name, String path, 
DirectoryFactory dirFactory, boolean create, IndexSchema schema, 
SolrIndexConfig config) throws IOException {
170super(getDirectory(path, dirFactory, null), 
config.luceneAutoCommit, schema.getAnalyzer(), create);

171init(name, schema, config);
172  }

It's  the constructor used by SolrCore  in r772051
As you can see, the getDirectory method take a null param instead of 
config, resulting a lockType parameter not setted


But in the current trunk, its another constructor thats used so 
everything is fine (the SolrConfig instance is in the parameter instead 
of null)


Just keeping you updated :)

Thanks for your help

PS : should I fill some kind of bug report even if everything is ok now 
? (I'm asking because I didn't see anything related to this problem in 
JIRA, so maybe if you want to keep a trace...)


Mani EZZAT wrote:

Should I fill a bug ?

Mani EZZAT wrote:
  
I tired using the default solrconfig and schema (from the example in 1.3 
release) and I still get the same warnings


When I look at the log, the solrconfig seems correcly loaded, but 
something is strange :

newSearcher warming query from solrconfig.xml}]}
2010-03-04 10:35:32,545 DEBUG [Config] solrconfig.xml missing optional 
mainIndex/deletionPolicy/@class
2010-03-04 10:35:32,556 DEBUG [Config] solrconfig.xml 
mainIndex/unlockOnStartup=false
2010-03-04 10:35:32,563 WARN  [SolrCore] [core] Solr index directory 
'./solr/data/index' doesn't exist. Creating new index...
2010-03-04 10:35:32,589 WARN  [SolrIndexWriter] No lockType configured 
for ./solr/data/index/ assuming 'simple'


Here I can see solr checking the properties in the Config (or maybe 
SolrConfig, not sure about the class) and the lockType property isn't 
here... and here comes the warning..


I'm not sure what it means.  The information is lost somwhere maybe, but 
everything seems fine to me when I look the source code


Also, It happens for the first core I create (and every cores after), so 
I don't think its related to the fact that I create dynamically several 
cores. Even If i create only 1 core, I'll get the warning since I get it 
for the first one anyway


Mani EZZAT wrote:
  

I don't know, I didn't try because I have the need to create a different 
core each time.


I'll do some tests with the default config and will report back to all 
of you

Thank you for your time

Tom Hill. wrote:
  

  

Hi Mani,

Mani EZZAT wrote:
  

  

I'm dynamically creating cores with a new index, using the same schema 
and solrconfig.xml

  

  

Does the problem occur if you use the same configuration in a single, static
core?

Tom

  

  

  

  
  



  




Re: Confused by Solr Ranking

2010-03-09 Thread Michael Lackhoff
On 09.03.2010 16:01 Ahmet Arslan wrote:

> 
>> I kind of suspected stemming to be the reason behind this.
>> But I consider stemming to be a good feature.
> 
> This is the side effect of stemming. Stemming increases recall while harming 
> precision.

But most people want the best possible combination of both, something like:
(raw_field:word OR stemmed_field:word^0.5)
and it is nice that Solr allows such arrangements but it would be even
nicer to have some sort of automatic "take this field, transform the
contents in a couple of ways and do some boosting in the order given".
At least this would be my wish for the recent question about the one
feature I would like to see.
Or even better, allow not only a hierarchy of transformations but also a
hierarchy of fields (like in dismax, but with the full power of the
standard request handler)

-Michael



Dummy boost question

2010-03-09 Thread Mark Roberts
Hi, 

I have indexed some documents that have title, content and keyword 
(multi-value).

I want to *search* on title and content, and then, within these results *boost* 
by keyword.

I have set up my qf as such:

  
content^0.5 title^1.0
  

And my bq as such:

keyword:(*.*)^1.0

But I'm fairly sure that this is boosting on all keywords (not just ones 
matching my search term)

Does anyone know how to achieve what I want (I'm using the DisMax query request 
handler btw.)


Thanks!
Mark 


Re: Confused by Solr Ranking

2010-03-09 Thread Erick Erickson
Well, that's a matter of opinion, isn't it? If *your* application
requires this, you could always copy the field to a non-stemmed
field and apply boosts...

Erick

On Tue, Mar 9, 2010 at 9:21 AM, abhishes  wrote:

>
> I kind of suspected stemming to be the reason behind this. But I consider
> stemming to be a good feature.
>
> The point is that if an exact match exists, then solr should report that
> first and then stemmed results should be reported.
>
> disabling stemming altogether would be a step in the wrong direction.
>
>
>
> Shalin Shekhar Mangar wrote:
> >
> > On Tue, Mar 9, 2010 at 4:38 PM, abhishes  wrote:
> >
> >>
> >> I am indexing a column in a database. I have chosen field type of text
> >> for
> >> this column (this type was defined in the sample schema file which comes
> >> in
> >> the Solr Example).
> >>
> >> When I search for the word "impress" and top 3 results. I get these 3
> >> documents
> >>
> >> bare desire pronounce villainy draught beasts blockish
> >> impression acquit
> >> bare impression villainy pronounce beasts desire
> >> blockish
> >> draught acquit
> >> beasts desire villainy pronounce bare acquit impression
> >> draught blockish
> >>
> >> But here the TEXT doesn't really contain the word "impress" it contains
> >> the
> >> word "impression"
> >>
> >> Now the database does contain a few rows where the word "impress" is
> >> there,
> >> but those rows do not come in top 3 results.
> >>
> >> So my question is that why did the rows containing the word "impression"
> >> got
> >> ranked higher than the rows containing the word "impress" when I
> searched
> >> for "impress"?
> >>
> >>
> > The "text" type is configured to do stemming on the input. So I'm
> guessing
> > that "impression" and "impress" both stem to the same form. You can
> remove
> > the EnglishPorterFilterFactory from the text type if you don't need
> > stemming.
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
> >
>
> --
> View this message in context:
> http://old.nabble.com/Confused-by-Solr-Ranking-tp27834227p27836299.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with
ICU4J 3.8

On 3/9/10, Robert Muir  wrote:
>
> I think the problem is that Solr does not include the ICU4J jar, so it
> won't work with Arabic PDF files.
>
> Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
> classpath.
>
>
> On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID 
> wrote:
> > Hi,
> > Posting arabic pdf files to Solr using a web form (to
> solr/update/extract)
> > get extracted texts and each words displayed in reverse direction(instead
> of
> > right to left).
> > When perform search against these texts with -always- reversed key-words
> I
> > get results but reversed.
> > This problem doesn't occur when posting MsWord document.
> > I think the problem come from Tika !
> >
> > Any clue ?
> >
> > --
> > elsadek
> > Software Engineer- J2EE / WEB / ESB MULE
> >
>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB / ESB MULE


Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
I tried couples of times to get this patch, but downloads fail, filesize
missmach or someting like error poped up
is there another link

On 3/9/10, Dominique Bejean  wrote:
>
> Hi,
>
> The problem comes form PDFBox (
> http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However
> Tika doesn't yet use this version of PDFBox.
> So for PDF text extraction, I doesn't use Tika but pdftotext.
>
> Dominique
>
>
> Le 09/03/10 06:00, Robert Muir a écrit :
>
>  it is an optional dependency of PDFBox. If ICU is available, then it
>> is capable of processing Arabic PDF files.
>>
>> The problem is that Arabic "text" in PDF files is really glyphs
>> (encoded in visual order) and needs to be 'unshaped' with some stuff
>> that isn't in the JDK.
>>
>> If the size of the default ICU jar file is the issue here, we can
>> consider an alternative: The default ICU jar is very large as it
>> includes everything, yet it can be customized to only include what is
>> needed: http://apps.icu-project.org/datacustom/
>>
>> We did this in lucene for the collation contrib, to shrink the jar
>> about 2MB: http://issues.apache.org/jira/browse/LUCENE-1867
>>
>> For this use-case, it could be even smaller, as most of the huge size
>> of ICU comes from large CJK collation tables (needed for collation,
>> but not for this Arabic PDF extraction).
>>
>> In reality I don't really like doing this as it might confuse users
>> (e.g. people that want collation, too), and ICU is useful for other
>> things, but if thats what we have to do, we should do it so that
>> Arabic PDF files will work.
>>
>> On Mon, Mar 8, 2010 at 11:53 PM, Lance Norskog  wrote:
>>
>>
>>> Is this a mistake in the Tika library collection in the Solr trunk?
>>>
>>> On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir  wrote:
>>>
>>>
 I think the problem is that Solr does not include the ICU4J jar, so it
 won't work with Arabic PDF files.

 Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
 classpath.

 On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID
  wrote:


> Hi,
> Posting arabic pdf files to Solr using a web form (to
> solr/update/extract)
> get extracted texts and each words displayed in reverse
> direction(instead of
> right to left).
> When perform search against these texts with -always- reversed
> key-words I
> get results but reversed.
> This problem doesn't occur when posting MsWord document.
> I think the problem come from Tika !
>
> Any clue ?
>
> --
> elsadek
> Software Engineer- J2EE / WEB / ESB MULE
>
>
>


 --
 Robert Muir
 rcm...@gmail.com



>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>>
>>>
>>
>>
>>
>>
>


-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB / ESB MULE


Re: Confused by Solr Ranking

2010-03-09 Thread abhishes

I kind of suspected stemming to be the reason behind this. But I consider
stemming to be a good feature.

The point is that if an exact match exists, then solr should report that
first and then stemmed results should be reported.

disabling stemming altogether would be a step in the wrong direction.



Shalin Shekhar Mangar wrote:
> 
> On Tue, Mar 9, 2010 at 4:38 PM, abhishes  wrote:
> 
>>
>> I am indexing a column in a database. I have chosen field type of text
>> for
>> this column (this type was defined in the sample schema file which comes
>> in
>> the Solr Example).
>>
>> When I search for the word "impress" and top 3 results. I get these 3
>> documents
>>
>> bare desire pronounce villainy draught beasts blockish
>> impression acquit
>> bare impression villainy pronounce beasts desire
>> blockish
>> draught acquit
>> beasts desire villainy pronounce bare acquit impression
>> draught blockish
>>
>> But here the TEXT doesn't really contain the word "impress" it contains
>> the
>> word "impression"
>>
>> Now the database does contain a few rows where the word "impress" is
>> there,
>> but those rows do not come in top 3 results.
>>
>> So my question is that why did the rows containing the word "impression"
>> got
>> ranked higher than the rows containing the word "impress" when I searched
>> for "impress"?
>>
>>
> The "text" type is configured to do stemming on the input. So I'm guessing
> that "impression" and "impress" both stem to the same form. You can remove
> the EnglishPorterFilterFactory from the text type if you don't need
> stemming.
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Confused-by-Solr-Ranking-tp27834227p27836299.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: PDF extraction leads to reversed words

2010-03-09 Thread Robert Muir
On Tue, Mar 9, 2010 at 9:44 AM, Abdelhamid  ABID  wrote:
> I put ICU4J 4.2 in the lib of Solr, nothing changed, I'm trying now with
> ICU4J 3.8
>

Hello, what version of Solr are you using? I think you will need to
use the trunk version.

I created a patch for this issue that you can apply to trunk (with all
necessary resources)
here: https://issues.apache.org/jira/browse/SOLR-1657

The included testcase fails without adding icu4j to the lib directory
(as the arabic text
is reversed), and passes with it.

-- 
Robert Muir
rcm...@gmail.com


Re: Confused by Solr Ranking

2010-03-09 Thread Ahmet Arslan


> I kind of suspected stemming to be the reason behind this.
> But I consider stemming to be a good feature.

This is the side effect of stemming. Stemming increases recall while harming 
precision.


  


Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
I doen't know about pdftotext, is it pluggable with Solr, or do we need
hard-code the step of extraction before Solr turn.

On 3/9/10, Dominique Bejean  wrote:
>
> Hi,
>
> The problem comes form PDFBox (
> http://brutus.apache.org/jira/browse/PDFBOX-377) and is fixed now. However
> Tika doesn't yet use this version of PDFBox.
> So for PDF text extraction, I doesn't use Tika but pdftotext.
>
> Dominique
>
>
> Le 09/03/10 06:00, Robert Muir a écrit :
>
>  it is an optional dependency of PDFBox. If ICU is available, then it
>> is capable of processing Arabic PDF files.
>>
>> The problem is that Arabic "text" in PDF files is really glyphs
>> (encoded in visual order) and needs to be 'unshaped' with some stuff
>> that isn't in the JDK.
>>
>> If the size of the default ICU jar file is the issue here, we can
>> consider an alternative: The default ICU jar is very large as it
>> includes everything, yet it can be customized to only include what is
>> needed: http://apps.icu-project.org/datacustom/
>>
>> We did this in lucene for the collation contrib, to shrink the jar
>> about 2MB: http://issues.apache.org/jira/browse/LUCENE-1867
>>
>> For this use-case, it could be even smaller, as most of the huge size
>> of ICU comes from large CJK collation tables (needed for collation,
>> but not for this Arabic PDF extraction).
>>
>> In reality I don't really like doing this as it might confuse users
>> (e.g. people that want collation, too), and ICU is useful for other
>> things, but if thats what we have to do, we should do it so that
>> Arabic PDF files will work.
>>
>> On Mon, Mar 8, 2010 at 11:53 PM, Lance Norskog  wrote:
>>
>>
>>> Is this a mistake in the Tika library collection in the Solr trunk?
>>>
>>> On Mon, Mar 8, 2010 at 5:15 PM, Robert Muir  wrote:
>>>
>>>
 I think the problem is that Solr does not include the ICU4J jar, so it
 won't work with Arabic PDF files.

 Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
 classpath.

 On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID
  wrote:


> Hi,
> Posting arabic pdf files to Solr using a web form (to
> solr/update/extract)
> get extracted texts and each words displayed in reverse
> direction(instead of
> right to left).
> When perform search against these texts with -always- reversed
> key-words I
> get results but reversed.
> This problem doesn't occur when posting MsWord document.
> I think the problem come from Tika !
>
> Any clue ?
>
> --
> elsadek
> Software Engineer- J2EE / WEB / ESB MULE
>
>
>


 --
 Robert Muir
 rcm...@gmail.com



>>>
>>>
>>> --
>>> Lance Norskog
>>> goks...@gmail.com
>>>
>>>
>>>
>>
>>
>>
>>
>


-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB / ESB MULE


Re: PDF extraction leads to reversed words

2010-03-09 Thread Robert Muir
On Tue, Mar 9, 2010 at 10:10 AM, Abdelhamid  ABID  wrote:
> nor 3.8 version does change anythings !
>

the patch (https://issues.apache.org/jira/browse/SOLR-1813) can only
work on Solr trunk. It will not work with Solr 1.4.


Solr 1.4 uses pdfbox-0.7.3.jar, which does not support Arabic.
Solr trunk uses pdfbox-0.8.0-incubating.jar, which does support
Arabic, if you also put ICU in the classpath.

-- 
Robert Muir
rcm...@gmail.com


Re: PDF extraction leads to reversed words

2010-03-09 Thread Abdelhamid ABID
nor 3.8 version does change anythings !

On 3/9/10, Robert Muir  wrote:
>
> I think the problem is that Solr does not include the ICU4J jar, so it
> won't work with Arabic PDF files.
>
> Try putting ICU4J 3.8 (http://site.icu-project.org/download) in your
> classpath.
>
>
> On Mon, Mar 8, 2010 at 6:30 PM, Abdelhamid  ABID 
> wrote:
> > Hi,
> > Posting arabic pdf files to Solr using a web form (to
> solr/update/extract)
> > get extracted texts and each words displayed in reverse direction(instead
> of
> > right to left).
> > When perform search against these texts with -always- reversed key-words
> I
> > get results but reversed.
> > This problem doesn't occur when posting MsWord document.
> > I think the problem come from Tika !
> >
> > Any clue ?
> >
> > --
> > elsadek
> > Software Engineer- J2EE / WEB / ESB MULE
> >
>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Abdelhamid ABID
Software Engineer- J2EE / WEB / ESB MULE


Re: Search on dynamic fields which contains spaces /special characters

2010-03-09 Thread Erick Erickson
Please repost as a separate thread..

From:
http://people.apache.org/~hossman/#threadhijack

When starting a new discussion on a mailing list, please do not reply to
an existing message, instead start a fresh email.  Even if you change the
subject line of your email, other mail headers still track which thread
you replied to and your question is "hidden" in that thread and gets less
attention.   It makes following discussions in the mailing list archives
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking


On Mon, Mar 8, 2010 at 11:09 PM, Dennis Gearon wrote:

> I'm starting to learn Soir/Lucene. I'm working on a shared server and have
> to use a stand alone Java install. Anyone tell me how to install OpenJDK for
> a shared server account?
>
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Mon, 3/8/10, Israel Ekpo  wrote:
>
> > From: Israel Ekpo 
> > Subject: Re: Search on dynamic fields which contains spaces /special
>  characters
> > To: solr-user@lucene.apache.org
> > Date: Monday, March 8, 2010, 12:44 PM
> > I do not believe the SOLR or LUCENE
> > syntax allows this
> >
> > You need to get rid of all the spaces in the field name
> >
> > If not, then you will be searching for "short" in the
> > default field and then
> > "name1" in the "name" field.
> >
> > http://wiki.apache.org/solr/SolrQuerySyntax
> >
> > http://lucene.apache.org/java/2_9_2/queryparsersyntax.html
> >
> >
> > On Mon, Mar 8, 2010 at 2:17 PM, JavaGuy84 
> > wrote:
> >
> > >
> > > Hi,
> > >
> > > We have some dynamic fields getting indexed using
> > SOLR. Some of the dynamic
> > > fields contains spaces / special character (something
> > like: short name,
> > > Full
> > > Name etc...). Is there a way to search on these fields
> > (which contains the
> > > spaces etc..). Can someone let me know the filter I
> > need to pass to do this
> > > type of search?
> > >
> > > I tried with short name:name1 --> this didnt
> > work..
> > >
> > > Thanks,
> > > Barani
> > > --
> > > View this message in context:
> > >
> http://old.nabble.com/Search-on-dynamic-fields-which-contains-spaces--special-characters-tp27826147p27826147.html
> > > Sent from the Solr - User mailing list archive at
> > Nabble.com.
> > >
> > >
> >
> >
> > --
> > "Good Enough" is not good enough.
> > To give anything less than your best is to sacrifice the
> > gift.
> > Quality First. Measure Twice. Cut Once.
> > http://www.israelekpo.com/
> >
>


Re: Child entities in document not loading

2010-03-09 Thread John Ament
So right now I'm thinking that solr just doesn't like me.

I just noticed that the following document config doesn't work for me






































e.g. solr ignores the  wrote:

> All
>
> It seems like my issue is simply on the concept of child entities.
>
> I had to add a second table to my query to pull pricing info.  At first, I
> was putting it in a separate entity.  Didn't work, even though I added the
> fields.
>
> When I rewrote my query as
>
> 
>
> It loaded.
>
> I'm wondering if there's something I have to activate to make child
> entities work?
>
> Thanks,
>
> John
>
>
> On Mon, Mar 8, 2010 at 12:17 PM, John Ament  wrote:
>
>> Ok - downloaded the binary off of google code and it's loading.  The 3
>> child entities do not appear as I had suspected.
>>
>> Thanks,
>>
>> John
>>
>>
>> On Mon, Mar 8, 2010 at 12:12 PM, John Ament  wrote:
>>
>>> The issue's not about indexing, the issue's about storage.  It seems like
>>> the fields (sections, colors, sizes) are all not being stored, even though
>>> store=true.
>>>
>>> I could not get Luke to work, no.  The webstart just hangs at downloading
>>> 0%.
>>>
>>> Thanks,
>>>
>>> John
>>>
>>>
>>> On Mon, Mar 8, 2010 at 12:06 PM, Erick Erickson >> > wrote:
>>>
 Sorry, won't be able to really look till tonight. Did you try Luke? What
 did
 it
 show?

 One thing I did notice though...

 field name="sections" type="string" indexed="true" stored="true"
 multiValued="true"/>

 string types are not analyzed, so the entire input is indexed as
 a single token. You might want "text" here

 Erick

 On Mon, Mar 8, 2010 at 11:37 AM, John Ament 
 wrote:

 > Erick,
 >
 > I'm sorry, but it's not helping much.  I don't see anything on the
 admin
 > screen that allows me to browse my index.  Even using Luke, my
 assumption
 > is
 > that it's not loading correctly in the index.  What parameters can I
 change
 > in the logs to make it print out more information? I want to see what
 the
 > query is returning I guess.
 >
 > Thanks,
 >
 > John
 >
 > On Mon, Mar 8, 2010 at 11:23 AM, Erick Erickson <
 erickerick...@gmail.com
 > >wrote:
 >
 > > Try http:///solr/admin. You'll see a
 bunch
 > > of links that'll allow you to examine many aspects of your
 installation.
 > >
 > > Additionally, get a copy of Luke (Google Lucene Luke) and point it
 at
 > > your index for a detailed look at the index.
 > >
 > > Finally, the SOLR log file might give you some clues...
 > >
 > > HTH
 > > Erick
 > >
 > > On Mon, Mar 8, 2010 at 10:49 AM, John Ament 
 > wrote:
 > >
 > > > Where would I see this? I do believe the fields are not ending up
 in
 > the
 > > > index.
 > > >
 > > > Thanks
 > > >
 > > > John
 > > >
 > > > On Mon, Mar 8, 2010 at 10:34 AM, Erick Erickson <
 > erickerick...@gmail.com
 > > > >wrote:
 > > >
 > > > > What does the solr admin page show you is actually in your
 index?
 > > > >
 > > > > Luke will also help.
 > > > >
 > > > > Erick
 > > > >
 > > > > On Mon, Mar 8, 2010 at 10:06 AM, John Ament <
 my.repr...@gmail.com>
 > > > wrote:
 > > > >
 > > > > > All,
 > > > > >
 > > > > > So I think I have my first issue figured out, need to add
 terms to
 > > the
 > > > > > default search.  That's fine.
 > > > > >
 > > > > > New issue is that I'm trying to load child entities in with my
 > > entity.
 > > > > >
 > > > > > I added the appropriate fields to solrconfig.xml
 > > > > >
 > > > > >>>> > stored="true"
 > > > > > multiValued="true"/>
 > > > > >>>> stored="true"
 > > > > > multiValued="true"/>
 > > > > >>>> stored="true"
 > > > > > multiValued="true"/>
 > > > > >
 > > > > > And I updated my document to match
 > > > > >
 > > > > >
 > > > > >
 > > > > >
 > > > > >
 > > > > >
 > > > > >
 > > > > >
 > > > > >
 > > > > > So my expectation is that there will be 3 new fields
 associated
 > with
 > > it
 > > > > > that
 > > > > > are multivalued: sizes, colors, and sections.
 > > > > >
 > > > > > The full-import seems to work correctly.  I get the
 appropriate
 > > number
 > > > of
 > > > > > documents in my searche

Re: Confused by Solr Ranking

2010-03-09 Thread Shalin Shekhar Mangar
On Tue, Mar 9, 2010 at 4:38 PM, abhishes  wrote:

>
> I am indexing a column in a database. I have chosen field type of text for
> this column (this type was defined in the sample schema file which comes in
> the Solr Example).
>
> When I search for the word "impress" and top 3 results. I get these 3
> documents
>
> bare desire pronounce villainy draught beasts blockish
> impression acquit
> bare impression villainy pronounce beasts desire blockish
> draught acquit
> beasts desire villainy pronounce bare acquit impression
> draught blockish
>
> But here the TEXT doesn't really contain the word "impress" it contains the
> word "impression"
>
> Now the database does contain a few rows where the word "impress" is there,
> but those rows do not come in top 3 results.
>
> So my question is that why did the rows containing the word "impression"
> got
> ranked higher than the rows containing the word "impress" when I searched
> for "impress"?
>
>
The "text" type is configured to do stemming on the input. So I'm guessing
that "impression" and "impress" both stem to the same form. You can remove
the EnglishPorterFilterFactory from the text type if you don't need
stemming.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Can't delete from curl

2010-03-09 Thread Paul Tomblin
On Mon, Mar 8, 2010 at 9:39 PM, Lance Norskog  wrote:

> ... curl http://xen1.xcski.com:8080/solrChunk/nutch/select
>
> that should be /update, not /select


Ah, that seems to have fixed it.  Thanks.



-- 
http://www.linkedin.com/in/paultomblin
http://careers.stackoverflow.com/ptomblin


Embedded solr - SLF4J exception

2010-03-09 Thread John Ament
While attempting to work around my other issue, I'm trying to use an
embedded solr server to try to programatically load data into solr.

It seems though that I can't deploy my app, as a result of this exception:

: java.lang.IllegalAccessError: tried to access field
org.slf4j.impl.StaticLoggerBinder.SINGLETON from class
org.slf4j.LoggerFactory

I'm attempting to glassfish v3, which uses weld for CDI and I can only
assume it's a conflict between SLF versions.  Is it safe to upgrade to a
newer SLF version?

Thanks,

John


Re: Solr Startup CPU Spike

2010-03-09 Thread John Williams
Yonik,

I got yourkit setup to profile the Tomcat instance and as you will see in the 
graph below all of the   http threads are blocked (red) until around 4:40. This 
is the point where the instance becomes responsive and CPU usage drops. I have 
also ruled out GC being the issue by using the GC monitoring in yourkit. Let me 
know your thoughts and if you have any questions.

Thanks for your assistance.

Thanks,
John

--
John Williams
System Administrator
37signals

<>
On Mar 8, 2010, at 5:28 PM, Yonik Seeley wrote:

> On Mon, Mar 8, 2010 at 6:07 PM, John Williams  wrote:
>> Yonik,
>> 
>> In all cases our "autowarmCount" is set to 0. Also, here is a link to our 
>> config. http://pastebin.com/iUgruqPd
> 
> Weird... on a quick glance, I don't see anything in your config that
> would cause work to be done on a commit.
> I expected something like autowarming, or rebuilding a spellcheck
> index, etc.  I assume this is happening even w/o any requests hitting
> the server?
> 
> Could it be GC?  You could use -verbose:gc or jconsole to check if
> this corresponds to a big GC (which could naturally hit on an index
> change).  5 minutes is really excessive though, and I wouldn't expect
> it on startup.
> 
> If it's not GC, perhaps the next step is to get some stack traces
> during the spike (or use a profiler) to figure out where the time is
> being spent.  And verify that the solrconfig.xml shown actually still
> matches the one you provided.
> 
> -Yonik
> http://www.lucidimagination.com
> 
> 
> 
>> Thanks,
>> John
>> 
>> --
>> John Williams
>> System Administrator
>> 37signals
>> 
>> On Mar 8, 2010, at 4:44 PM, Yonik Seeley wrote:
>> 
>>> Is this just autowarming?
>>> Check your autowarmCount parameters in solrconfig.xml
>>> 
>>> -Yonik
>>> http://www.lucidimagination.com
>>> 
>>> On Mon, Mar 8, 2010 at 5:37 PM, John Williams  wrote:
 Good afternoon.
 
 We have been experiencing an odd issue with one of our Solr nodes. Upon 
 startup or when bringing in a new index we get a CPU spike for 5 minutes 
 or so. I have attached a graph of this spike. During this time simple 
 queries return without a problem but more complex queries do not return. 
 Here are some more details about the instance:
 
 Index Size: ~16G
 Max Heap: 6144M
 GC Option: -XX:+UseConcMarkSweepGC
 System Memory: 16G
 
 We have a very similar instance to this one but with a much larger index 
 that we are not seeing this sort of issue.
 
 Your help is greatly appreciated. Let me know if you need any additional 
 information.
 
 Thanks,
 John
 
 --
 John Williams
 System Administrator
 37signals
 
 
 
>> 
>> 



smime.p7s
Description: S/MIME cryptographic signature


tmp

2010-03-09 Thread Dino Di Cola
tmp


Filter to cut out all zeors?

2010-03-09 Thread Sebastian F
Hey there,

I'm trying to figure out the best way to cut out all zeros of an input string 
like "01.10." or "022.300"...
Is there such a filter in Solr or anything similar that I can adapt to do the 
task?

Thanks for any help



  

Re: QueryElevationComponent blues

2010-03-09 Thread Ryan Grange
I'd read that too, but in the debug data queryBoosting is showing 
matches on our int typed identifiers (though it does show it as 
123456).  Is the problem that it can match against an 
integer, but it can't reorder them in the results?  This seems unlikely 
as using a standard query and elevation does cause otherwise lower 
results to jump to the top of the results.


I've looked at the source and noticed the check for a string type in 
there.  I'm not sure why my Solr instance seems okay with an int for a 
unique identifier.


Tried forceElevation=true with qt=dismax and still no effect on placement.

We don't want to give up field, phrase, and formula boosting when using 
the standard request handler just to have elevation work.


Ryan T. Grange, IT Manager
DollarDays International, Inc.
rgra...@dollardays.com (480)922-8155 x106


On 3/8/2010 11:13 PM, Jon Baer wrote:

Maybe some things to try:

* make sure your uniqueKey is string field type (ie if using int it will not 
work)
* forceElevation to true (if sorting)

- Jon

On Mar 9, 2010, at 12:34 AM, Ryan Grange wrote:

   

Using Solr 1.4.
Was using the standard query handler, but needed the boost by field 
functionality of qf from dismax.
So we altered the query to boost certain phrases against a given field.
We were using QueryElevationComponent ("elevator" from solrconfig.xml) for one 
particular entry we wanted at the top, but because we aren't using a pure q value, 
elevator never finds a match to boost.  We didn't realize it at the time because the 
record we were elevating eventually became the top response anyway.
Recently added a _val_:formula to the q value to juice records based on a value 
in the record.
Now we have need to push a few other records to the top, but we've lost the 
ability to use elevate.xml to do it.

Tried switching to dismax using qf, pf, qs, ps, and bf with a "pure" q value, 
and debug showed queryBoost with a match and records, but they weren't moved to the top 
of the result set.

What would really help is if there was something for elevator akin to 
spellcheck.q like elevation.q so I could pass in the actual user phrase while 
still performing all the other field score boosts in the q parameter. 
Alternatively, if anyone can explain why I'm running into problems getting 
QueryElevationComponent to move the results in a dismax query, I'd be very 
thankful.

--
Ryan T. Grange

 



   


master/slave

2010-03-09 Thread Dino Di Cola
Dear all, I am trying to setup a master/slave index replication
with two slaves embedded in a tomcat cluster and a master kept in a separate
machine.
I would like to know if is it possible to configure slaves with a
ReplicationHandler able to access master
by starting an embedded server instead of using http communication.

I understand that HTTP is the preferred way to work with solr,
but for some annoying reasons I cannot startup another http server. Thus, I
wonder to know if (and possibly how)
this approach can be technically 'feasible', already conscious that it
cannot be definitively 'reasonable'... :)

Many thanks for the support,
Dino.
--


Re: Solr Startup CPU Spike

2010-03-09 Thread John Williams
Yonik,

I have provided an image below gives details on what is causing the blocked 
http thread. Is there any way to resolve this issue.

Thanks,
John

--
John Williams
System Administrator
37signals

<>
On Mar 9, 2010, at 10:41 AM, John Williams wrote:

> Yonik,
> 
> I got yourkit setup to profile the Tomcat instance and as you will see in the 
> graph below all of the   http threads are blocked (red) until around 4:40. 
> This is the point where the instance becomes responsive and CPU usage drops. 
> I have also ruled out GC being the issue by using the GC monitoring in 
> yourkit. Let me know your thoughts and if you have any questions.
> 
> Thanks for your assistance.
> 
> Thanks,
> John
> 
> --
> John Williams
> System Administrator
> 37signals
> 
> 
> On Mar 8, 2010, at 5:28 PM, Yonik Seeley wrote:
> 
>> On Mon, Mar 8, 2010 at 6:07 PM, John Williams  wrote:
>>> Yonik,
>>> 
>>> In all cases our "autowarmCount" is set to 0. Also, here is a link to our 
>>> config. http://pastebin.com/iUgruqPd
>> 
>> Weird... on a quick glance, I don't see anything in your config that
>> would cause work to be done on a commit.
>> I expected something like autowarming, or rebuilding a spellcheck
>> index, etc.  I assume this is happening even w/o any requests hitting
>> the server?
>> 
>> Could it be GC?  You could use -verbose:gc or jconsole to check if
>> this corresponds to a big GC (which could naturally hit on an index
>> change).  5 minutes is really excessive though, and I wouldn't expect
>> it on startup.
>> 
>> If it's not GC, perhaps the next step is to get some stack traces
>> during the spike (or use a profiler) to figure out where the time is
>> being spent.  And verify that the solrconfig.xml shown actually still
>> matches the one you provided.
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 
>> 
>> 
>>> Thanks,
>>> John
>>> 
>>> --
>>> John Williams
>>> System Administrator
>>> 37signals
>>> 
>>> On Mar 8, 2010, at 4:44 PM, Yonik Seeley wrote:
>>> 
 Is this just autowarming?
 Check your autowarmCount parameters in solrconfig.xml
 
 -Yonik
 http://www.lucidimagination.com
 
 On Mon, Mar 8, 2010 at 5:37 PM, John Williams  wrote:
> Good afternoon.
> 
> We have been experiencing an odd issue with one of our Solr nodes. Upon 
> startup or when bringing in a new index we get a CPU spike for 5 minutes 
> or so. I have attached a graph of this spike. During this time simple 
> queries return without a problem but more complex queries do not return. 
> Here are some more details about the instance:
> 
> Index Size: ~16G
> Max Heap: 6144M
> GC Option: -XX:+UseConcMarkSweepGC
> System Memory: 16G
> 
> We have a very similar instance to this one but with a much larger index 
> that we are not seeing this sort of issue.
> 
> Your help is greatly appreciated. Let me know if you need any additional 
> information.
> 
> Thanks,
> John
> 
> --
> John Williams
> System Administrator
> 37signals
> 
> 
> 
>>> 
>>> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Solr Startup CPU Spike

2010-03-09 Thread Mark Miller
Ah - loading the fieldcache - do you have a *lot* of unique terms in the 
fields you are sorting/faceting on?


localhost:8983/solr/admin/luke is helpful for checking this.


--
- Mark

http://www.lucidimagination.com



On 03/09/2010 12:33 PM, John Williams wrote:

Yonik,

I have provided an image below gives details on what is causing the blocked 
http thread. Is there any way to resolve this issue.

Thanks,
John

--
John Williams
System Administrator
37signals

   




On Mar 9, 2010, at 10:41 AM, John Williams wrote:

   

Yonik,

I got yourkit setup to profile the Tomcat instance and as you will see in the 
graph below all of the   http threads are blocked (red) until around 4:40. This 
is the point where the instance becomes responsive and CPU usage drops. I have 
also ruled out GC being the issue by using the GC monitoring in yourkit. Let me 
know your thoughts and if you have any questions.

Thanks for your assistance.

Thanks,
John

--
John Williams
System Administrator
37signals


On Mar 8, 2010, at 5:28 PM, Yonik Seeley wrote:

 

On Mon, Mar 8, 2010 at 6:07 PM, John Williams  wrote:
   

Yonik,

In all cases our "autowarmCount" is set to 0. Also, here is a link to our 
config. http://pastebin.com/iUgruqPd
 

Weird... on a quick glance, I don't see anything in your config that
would cause work to be done on a commit.
I expected something like autowarming, or rebuilding a spellcheck
index, etc.  I assume this is happening even w/o any requests hitting
the server?

Could it be GC?  You could use -verbose:gc or jconsole to check if
this corresponds to a big GC (which could naturally hit on an index
change).  5 minutes is really excessive though, and I wouldn't expect
it on startup.

If it's not GC, perhaps the next step is to get some stack traces
during the spike (or use a profiler) to figure out where the time is
being spent.  And verify that the solrconfig.xml shown actually still
matches the one you provided.

-Yonik
http://www.lucidimagination.com



   

Thanks,
John

--
John Williams
System Administrator
37signals

On Mar 8, 2010, at 4:44 PM, Yonik Seeley wrote:

 

Is this just autowarming?
Check your autowarmCount parameters in solrconfig.xml

-Yonik
http://www.lucidimagination.com

On Mon, Mar 8, 2010 at 5:37 PM, John Williams  wrote:
   

Good afternoon.

We have been experiencing an odd issue with one of our Solr nodes. Upon startup 
or when bringing in a new index we get a CPU spike for 5 minutes or so. I have 
attached a graph of this spike. During this time simple queries return without 
a problem but more complex queries do not return. Here are some more details 
about the instance:

Index Size: ~16G
Max Heap: 6144M
GC Option: -XX:+UseConcMarkSweepGC
System Memory: 16G

We have a very similar instance to this one but with a much larger index that 
we are not seeing this sort of issue.

Your help is greatly appreciated. Let me know if you need any additional 
information.

Thanks,
John

--
John Williams
System Administrator
37signals



 


 
 
   






Re: Solr Startup CPU Spike

2010-03-09 Thread Yonik Seeley
Ahhh, FieldCache loading... what version of Solr are you using?
It's interesting it would take that long to load too (and maxing out
one CPU - doesn't look particularly IO bound).  How many documents are
in this index?

-Yonik


On Tue, Mar 9, 2010 at 12:33 PM, John Williams  wrote:
> Yonik,
>
> I have provided an image below gives details on what is causing the blocked 
> http thread. Is there any way to resolve this issue.
>
> Thanks,
> John
>
> --
> John Williams
> System Administrator
> 37signals
>
>
>
> On Mar 9, 2010, at 10:41 AM, John Williams wrote:
>
>> Yonik,
>>
>> I got yourkit setup to profile the Tomcat instance and as you will see in 
>> the graph below all of the   http threads are blocked (red) until around 
>> 4:40. This is the point where the instance becomes responsive and CPU usage 
>> drops. I have also ruled out GC being the issue by using the GC monitoring 
>> in yourkit. Let me know your thoughts and if you have any questions.
>>
>> Thanks for your assistance.
>>
>> Thanks,
>> John
>>
>> --
>> John Williams
>> System Administrator
>> 37signals
>>
>> 
>> On Mar 8, 2010, at 5:28 PM, Yonik Seeley wrote:
>>
>>> On Mon, Mar 8, 2010 at 6:07 PM, John Williams  wrote:
 Yonik,

 In all cases our "autowarmCount" is set to 0. Also, here is a link to our 
 config. http://pastebin.com/iUgruqPd
>>>
>>> Weird... on a quick glance, I don't see anything in your config that
>>> would cause work to be done on a commit.
>>> I expected something like autowarming, or rebuilding a spellcheck
>>> index, etc.  I assume this is happening even w/o any requests hitting
>>> the server?
>>>
>>> Could it be GC?  You could use -verbose:gc or jconsole to check if
>>> this corresponds to a big GC (which could naturally hit on an index
>>> change).  5 minutes is really excessive though, and I wouldn't expect
>>> it on startup.
>>>
>>> If it's not GC, perhaps the next step is to get some stack traces
>>> during the spike (or use a profiler) to figure out where the time is
>>> being spent.  And verify that the solrconfig.xml shown actually still
>>> matches the one you provided.
>>>
>>> -Yonik
>>> http://www.lucidimagination.com
>>>
>>>
>>>
 Thanks,
 John

 --
 John Williams
 System Administrator
 37signals

 On Mar 8, 2010, at 4:44 PM, Yonik Seeley wrote:

> Is this just autowarming?
> Check your autowarmCount parameters in solrconfig.xml
>
> -Yonik
> http://www.lucidimagination.com
>
> On Mon, Mar 8, 2010 at 5:37 PM, John Williams  wrote:
>> Good afternoon.
>>
>> We have been experiencing an odd issue with one of our Solr nodes. Upon 
>> startup or when bringing in a new index we get a CPU spike for 5 minutes 
>> or so. I have attached a graph of this spike. During this time simple 
>> queries return without a problem but more complex queries do not return. 
>> Here are some more details about the instance:
>>
>> Index Size: ~16G
>> Max Heap: 6144M
>> GC Option: -XX:+UseConcMarkSweepGC
>> System Memory: 16G
>>
>> We have a very similar instance to this one but with a much larger index 
>> that we are not seeing this sort of issue.
>>
>> Your help is greatly appreciated. Let me know if you need any additional 
>> information.
>>
>> Thanks,
>> John
>>
>> --
>> John Williams
>> System Administrator
>> 37signals
>>
>>
>>


>>
>
>
>


Re: master/slave

2010-03-09 Thread Peter Sturge
The SolrEmbededServer doesn't have any http, and so you can't use the http
replication.
You can use the script-based replication if you're on LUNIX. See:
http://wiki.apache.org/solr/CollectionDistribution

It would be worth looking at using Solr in a Jetty container and using the
http replication, it is really awesome.



On Tue, Mar 9, 2010 at 5:27 PM, Dino Di Cola  wrote:

> Dear all, I am trying to setup a master/slave index replication
> with two slaves embedded in a tomcat cluster and a master kept in a
> separate
> machine.
> I would like to know if is it possible to configure slaves with a
> ReplicationHandler able to access master
> by starting an embedded server instead of using http communication.
>
> I understand that HTTP is the preferred way to work with solr,
> but for some annoying reasons I cannot startup another http server. Thus, I
> wonder to know if (and possibly how)
> this approach can be technically 'feasible', already conscious that it
> cannot be definitively 'reasonable'... :)
>
> Many thanks for the support,
> Dino.
> --
>


Re: Solr Startup CPU Spike

2010-03-09 Thread John Williams
Mark,

I am trying to load that url but its taking quite a while. I will let 
you know if/when it loads.

-John

--
John Williams
System Administrator
37signals

On Mar 9, 2010, at 11:38 AM, Mark Miller wrote:

> Ah - loading the fieldcache - do you have a *lot* of unique terms in the 
> fields you are sorting/faceting on?
> 
> localhost:8983/solr/admin/luke is helpful for checking this.
> 
> 
> -- 
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 
> On 03/09/2010 12:33 PM, John Williams wrote:
>> Yonik,
>> 
>> I have provided an image below gives details on what is causing the blocked 
>> http thread. Is there any way to resolve this issue.
>> 
>> Thanks,
>> John
>> 
>> --
>> John Williams
>> System Administrator
>> 37signals
>> 
>>   
>> 
>> 
>> On Mar 9, 2010, at 10:41 AM, John Williams wrote:
>> 
>>   
>>> Yonik,
>>> 
>>> I got yourkit setup to profile the Tomcat instance and as you will see in 
>>> the graph below all of the   http threads are blocked (red) until around 
>>> 4:40. This is the point where the instance becomes responsive and CPU usage 
>>> drops. I have also ruled out GC being the issue by using the GC monitoring 
>>> in yourkit. Let me know your thoughts and if you have any questions.
>>> 
>>> Thanks for your assistance.
>>> 
>>> Thanks,
>>> John
>>> 
>>> --
>>> John Williams
>>> System Administrator
>>> 37signals
>>> 
>>> 
>>> On Mar 8, 2010, at 5:28 PM, Yonik Seeley wrote:
>>> 
>>> 
 On Mon, Mar 8, 2010 at 6:07 PM, John Williams  wrote:
   
> Yonik,
> 
> In all cases our "autowarmCount" is set to 0. Also, here is a link to our 
> config. http://pastebin.com/iUgruqPd
> 
 Weird... on a quick glance, I don't see anything in your config that
 would cause work to be done on a commit.
 I expected something like autowarming, or rebuilding a spellcheck
 index, etc.  I assume this is happening even w/o any requests hitting
 the server?
 
 Could it be GC?  You could use -verbose:gc or jconsole to check if
 this corresponds to a big GC (which could naturally hit on an index
 change).  5 minutes is really excessive though, and I wouldn't expect
 it on startup.
 
 If it's not GC, perhaps the next step is to get some stack traces
 during the spike (or use a profiler) to figure out where the time is
 being spent.  And verify that the solrconfig.xml shown actually still
 matches the one you provided.
 
 -Yonik
 http://www.lucidimagination.com
 
 
 
   
> Thanks,
> John
> 
> --
> John Williams
> System Administrator
> 37signals
> 
> On Mar 8, 2010, at 4:44 PM, Yonik Seeley wrote:
> 
> 
>> Is this just autowarming?
>> Check your autowarmCount parameters in solrconfig.xml
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 
>> On Mon, Mar 8, 2010 at 5:37 PM, John Williams  wrote:
>>   
>>> Good afternoon.
>>> 
>>> We have been experiencing an odd issue with one of our Solr nodes. Upon 
>>> startup or when bringing in a new index we get a CPU spike for 5 
>>> minutes or so. I have attached a graph of this spike. During this time 
>>> simple queries return without a problem but more complex queries do not 
>>> return. Here are some more details about the instance:
>>> 
>>> Index Size: ~16G
>>> Max Heap: 6144M
>>> GC Option: -XX:+UseConcMarkSweepGC
>>> System Memory: 16G
>>> 
>>> We have a very similar instance to this one but with a much larger 
>>> index that we are not seeing this sort of issue.
>>> 
>>> Your help is greatly appreciated. Let me know if you need any 
>>> additional information.
>>> 
>>> Thanks,
>>> John
>>> 
>>> --
>>> John Williams
>>> System Administrator
>>> 37signals
>>> 
>>> 
>>> 
>>> 
> 
> 
>>> 
>>   
> 
> 
> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Solr Startup CPU Spike

2010-03-09 Thread John Williams
Yonik,

We are on Solr 1.3. The total number of documents is  54173459. Let me 
know if need any additional info.

Thanks,
John

--
John Williams
System Administrator
37signals

On Mar 9, 2010, at 11:39 AM, Yonik Seeley wrote:

> Ahhh, FieldCache loading... what version of Solr are you using?
> It's interesting it would take that long to load too (and maxing out
> one CPU - doesn't look particularly IO bound).  How many documents are
> in this index?
> 
> -Yonik
> 
> 
> On Tue, Mar 9, 2010 at 12:33 PM, John Williams  wrote:
>> Yonik,
>> 
>> I have provided an image below gives details on what is causing the blocked 
>> http thread. Is there any way to resolve this issue.
>> 
>> Thanks,
>> John
>> 
>> --
>> John Williams
>> System Administrator
>> 37signals
>> 
>> 
>> 
>> On Mar 9, 2010, at 10:41 AM, John Williams wrote:
>> 
>>> Yonik,
>>> 
>>> I got yourkit setup to profile the Tomcat instance and as you will see in 
>>> the graph below all of the   http threads are blocked (red) until around 
>>> 4:40. This is the point where the instance becomes responsive and CPU usage 
>>> drops. I have also ruled out GC being the issue by using the GC monitoring 
>>> in yourkit. Let me know your thoughts and if you have any questions.
>>> 
>>> Thanks for your assistance.
>>> 
>>> Thanks,
>>> John
>>> 
>>> --
>>> John Williams
>>> System Administrator
>>> 37signals
>>> 
>>> 
>>> On Mar 8, 2010, at 5:28 PM, Yonik Seeley wrote:
>>> 
 On Mon, Mar 8, 2010 at 6:07 PM, John Williams  wrote:
> Yonik,
> 
> In all cases our "autowarmCount" is set to 0. Also, here is a link to our 
> config. http://pastebin.com/iUgruqPd
 
 Weird... on a quick glance, I don't see anything in your config that
 would cause work to be done on a commit.
 I expected something like autowarming, or rebuilding a spellcheck
 index, etc.  I assume this is happening even w/o any requests hitting
 the server?
 
 Could it be GC?  You could use -verbose:gc or jconsole to check if
 this corresponds to a big GC (which could naturally hit on an index
 change).  5 minutes is really excessive though, and I wouldn't expect
 it on startup.
 
 If it's not GC, perhaps the next step is to get some stack traces
 during the spike (or use a profiler) to figure out where the time is
 being spent.  And verify that the solrconfig.xml shown actually still
 matches the one you provided.
 
 -Yonik
 http://www.lucidimagination.com
 
 
 
> Thanks,
> John
> 
> --
> John Williams
> System Administrator
> 37signals
> 
> On Mar 8, 2010, at 4:44 PM, Yonik Seeley wrote:
> 
>> Is this just autowarming?
>> Check your autowarmCount parameters in solrconfig.xml
>> 
>> -Yonik
>> http://www.lucidimagination.com
>> 
>> On Mon, Mar 8, 2010 at 5:37 PM, John Williams  wrote:
>>> Good afternoon.
>>> 
>>> We have been experiencing an odd issue with one of our Solr nodes. Upon 
>>> startup or when bringing in a new index we get a CPU spike for 5 
>>> minutes or so. I have attached a graph of this spike. During this time 
>>> simple queries return without a problem but more complex queries do not 
>>> return. Here are some more details about the instance:
>>> 
>>> Index Size: ~16G
>>> Max Heap: 6144M
>>> GC Option: -XX:+UseConcMarkSweepGC
>>> System Memory: 16G
>>> 
>>> We have a very similar instance to this one but with a much larger 
>>> index that we are not seeing this sort of issue.
>>> 
>>> Your help is greatly appreciated. Let me know if you need any 
>>> additional information.
>>> 
>>> Thanks,
>>> John
>>> 
>>> --
>>> John Williams
>>> System Administrator
>>> 37signals
>>> 
>>> 
>>> 
> 
> 
>>> 
>> 
>> 
>> 



smime.p7s
Description: S/MIME cryptographic signature


Re: master/slave

2010-03-09 Thread Dino Di Cola
Ok Peter for script-based replication; I forgot to mention I already
verified that mechanism.

When I configure the slave as follows



http://localhost:8983/solr/admin/replication

00:00:20
...



SOLR uses the org.apache.solr.handler.ReplicationHandler to access every 20s
the masterUrl via http.
My question is: is it possible to use another ReplicationHandler that polls
the master by pointing directly to its SOLR home?
I mean, is technically feasible?

Please correct me if I am not clear.
Thanks.
--
2010/3/9 Peter Sturge 

> The SolrEmbededServer doesn't have any http, and so you can't use the http
> replication.
> You can use the script-based replication if you're on LUNIX. See:
>http://wiki.apache.org/solr/CollectionDistribution
>
> It would be worth looking at using Solr in a Jetty container and using the
> http replication, it is really awesome.
>
>
>
> On Tue, Mar 9, 2010 at 5:27 PM, Dino Di Cola  wrote:
>
> > Dear all, I am trying to setup a master/slave index replication
> > with two slaves embedded in a tomcat cluster and a master kept in a
> > separate
> > machine.
> > I would like to know if is it possible to configure slaves with a
> > ReplicationHandler able to access master
> > by starting an embedded server instead of using http communication.
> >
> > I understand that HTTP is the preferred way to work with solr,
> > but for some annoying reasons I cannot startup another http server. Thus,
> I
> > wonder to know if (and possibly how)
> > this approach can be technically 'feasible', already conscious that it
> > cannot be definitively 'reasonable'... :)
> >
> > Many thanks for the support,
> > Dino.
> > --
> >
>


Re: Filter to cut out all zeors?

2010-03-09 Thread Ahmet Arslan
> I'm trying to figure out the best way to cut out all zeros
> of an input string like "01.10." or "022.300"...
> Is there such a filter in Solr or anything similar that I
> can adapt to do the task?

With solr.MappingCharFilterFactory[1] you can replace all zeros with "" before 
tokenizer. 



SolrHome/conf/mapping.txt file will contain this line:

"0" => ""

So that "01.10." will become "1.1." and  "022.300" will become "22.3" Is that 
you want?

[1]http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.MappingCharFilterFactory
 


  


Re: SolrConfig - constructing the object

2010-03-09 Thread Kimberly Kantola

Thank you Mark for your help.
Now a few days later I am thinking, I need access to the SolrConfig object
in multiple classes.  Maybe I should not be reloading it over and over?   I
see that there is a getSolrConfig() method in the SolrCore class that will
return the SolrConfig object.  
Should I maybe just take all my classes and have them implement the
SolrCoreAware interface so that they can call core.getSolrConfig(); ??

Thanks  
Kim 



markrmiller wrote:
> 
> On 03/05/2010 10:29 AM, Kimberly Kantola wrote:
>> Hi All,
>>I am new to using the Solr classes in development.  I am trying to
>> determine how to create  a SolrConfig object.
>>Is it just a matter of calling new SolrConfig with the location of the
>> solrconfig.xml file ?
>>
>> SolrConfig config = new SolrConfig("/path/to/solrconfig.xml");
>>
>> Thanks for any help!
>> Kim
>>
> 
> Sure, that's one way that will work if you are happy with all of the 
> other defaults that will occur.
> 
> -- 
> - Mark
> 
> http://www.lucidimagination.com
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/SolrConfig---constructing-the-object-tp27795339p27839895.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: SolrConfig - constructing the object

2010-03-09 Thread Mark Miller
Yes - I think you should if you can. If you can make them SolrAware that 
is - only certain plugin classes have the ability to do so (due to a 
runtime check against a list of approved classes)


- Mark

On 03/09/2010 01:28 PM, Kimberly Kantola wrote:

Thank you Mark for your help.
Now a few days later I am thinking, I need access to the SolrConfig object
in multiple classes.  Maybe I should not be reloading it over and over?   I
see that there is a getSolrConfig() method in the SolrCore class that will
return the SolrConfig object.
Should I maybe just take all my classes and have them implement the
SolrCoreAware interface so that they can call core.getSolrConfig(); ??

Thanks
Kim



markrmiller wrote:
   

On 03/05/2010 10:29 AM, Kimberly Kantola wrote:
 

Hi All,
I am new to using the Solr classes in development.  I am trying to
determine how to create  a SolrConfig object.
Is it just a matter of calling new SolrConfig with the location of the
solrconfig.xml file ?

SolrConfig config = new SolrConfig("/path/to/solrconfig.xml");

Thanks for any help!
Kim

   

Sure, that's one way that will work if you are happy with all of the
other defaults that will occur.

--
- Mark

http://www.lucidimagination.com





 
   



--
- Mark

http://www.lucidimagination.com





Re: Store input text after analyzers and token filters

2010-03-09 Thread JCodina

Otis,
I've been thinking on it, and trying to figure out the different solutions
- Try to solve it doing a bridge between  solr and clustering.
- Try to solve it before/during indexing

The second option, of course is better for performance, but how to do it??

I think a good option may be to create a new type derived type from the
FieldType class
like the  SortableIntField which has the toInternal(String val) function.
Then the problem is how to include the result of the analysis of anoter
field type in the  toInternal function

So there would be a new type that can be used on copy fields , that takes
the analysis of the source
field and injects in the code. It takes as parameter the field from which
takes the analysis .

So, how can I get the result of the analysis of a given text by a given
field using internal functions??





Otis Gospodnetic wrote:
> 
> Hi Joan,
> 
> You could use the FieldAnalysisRequestHandler:
> http://www.search-lucene.com/?q=FieldAnalysisRequestHandler
> 
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Hadoop ecosystem search :: http://search-hadoop.com/
> 
> 
-- 
View this message in context: 
http://old.nabble.com/Store-input-text-after-analyzers-and-token-filters-tp27792550p27840488.html
Sent from the Solr - User mailing list archive at Nabble.com.



Is "UniqueKey" in schema and "pk" attribute for DataimportHandler entities still optional in solr 1.4?

2010-03-09 Thread Alexandr Savochkin
I allways build solr index from scratch, so I don't have neither "pk"
attribute in "entity" tag (dataconfig.xml file) nor "UniqueKey" in index
schema. When I updated solr from 1.3 to 1.4 I got the following exception
during solr initialization:
--
SEVERE: Exception while loading DataImporter
java.lang.NullPointerException
at
org.apache.solr.handler.dataimport.DataImporter.identifyPk(DataImporter.java:152)
at
org.apache.solr.handler.dataimport.DataImporter.(DataImporter.java:111)
at
org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportHandler.java:113)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:486)
at org.apache.solr.core.SolrCore.(SolrCore.java:588)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3709)
--

Is "UniqueKey" in schema still optional in solr 1.4? Is "pk" attribute in
still optional in solr 1.4 DataImportHandler entities?  As I can see in
org.apache.solr.handler.dataimport.DataImporter class source exception
allways occure when UniqueKey not specified for index schema.


Cleaning up dirty OCR

2010-03-09 Thread Burton-West, Tom
Hello all,

We have been indexing a large collection of OCR'd text. About 5 million books 
in over 200 languages.  With 1.5 billion OCR'd pages, even a small OCR error 
rate creates a relatively large number of meaningless unique terms.  (See  
http://www.hathitrust.org/blogs/large-scale-search/too-many-words )

We would like to remove some *fraction* of these nonsense words caused by OCR 
errors prior to indexing. ( We don't want to remove "real" words, so we need 
some method with very few false positives.)

A dictionary based approach does not seem feasible given the number of 
languages and the inclusion of proper names, place names, and technical terms.  
 We are considering using some heuristics, such as looking for strings over a 
certain length or strings containing more than some number of punctuation 
characters.

This paper has a few such heuristics:
Kazem Taghva, Tom Nartker, Allen Condit, and Julie Borsack. Automatic Removal 
of ``Garbage Strings'' in OCR Text: An Implementation. In The 5th World 
Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida, 
July 2001. http://www.isri.unlv.edu/publications/isripub/Taghva01b.pdf

Can anyone suggest any practical solutions to removing some fraction of the 
tokens containing OCR errors from our input stream?

Tom Burton-West
University of Michigan Library
www.hathitrust.org



Re: Cleaning up dirty OCR

2010-03-09 Thread Robert Muir
> Can anyone suggest any practical solutions to removing some fraction of the 
> tokens containing OCR errors from our input stream?

one approach would be to try http://issues.apache.org/jira/browse/LUCENE-1812

and filter terms that only appear once in the document.


-- 
Robert Muir
rcm...@gmail.com


Re: master/slave

2010-03-09 Thread Peter Sturge
Hi Dino,

I suppose you could write your own ReplicationHandler to do the replication
yourself, but I should think the effort involved would be better spent
deploying the existing Solr http replication or using a Hadoop-based
solution, or UNIX scripting.

By far, the easiest path to replication is to use Solr within a Jetty or
similar container.
I've been down the EmbeddedServer route, and the embedded server is very
good and fast, but if you want replication, your practical choices are unix
scripting or http.

Hope this helps.


On Tue, Mar 9, 2010 at 6:02 PM, Dino Di Cola  wrote:

> Ok Peter for script-based replication; I forgot to mention I already
> verified that mechanism.
>
> When I configure the slave as follows
>
> 
>
>http://localhost:8983/solr/admin/replication
> 
>
>00:00:20
>...
>
> 
>
> SOLR uses the org.apache.solr.handler.ReplicationHandler to access every
> 20s
> the masterUrl via http.
> My question is: is it possible to use another ReplicationHandler that polls
> the master by pointing directly to its SOLR home?
> I mean, is technically feasible?
>
> Please correct me if I am not clear.
> Thanks.
> --
> 2010/3/9 Peter Sturge 
>
> > The SolrEmbededServer doesn't have any http, and so you can't use the
> http
> > replication.
> > You can use the script-based replication if you're on LUNIX. See:
> >http://wiki.apache.org/solr/CollectionDistribution
> >
> > It would be worth looking at using Solr in a Jetty container and using
> the
> > http replication, it is really awesome.
> >
> >
> >
> > On Tue, Mar 9, 2010 at 5:27 PM, Dino Di Cola 
> wrote:
> >
> > > Dear all, I am trying to setup a master/slave index replication
> > > with two slaves embedded in a tomcat cluster and a master kept in a
> > > separate
> > > machine.
> > > I would like to know if is it possible to configure slaves with a
> > > ReplicationHandler able to access master
> > > by starting an embedded server instead of using http communication.
> > >
> > > I understand that HTTP is the preferred way to work with solr,
> > > but for some annoying reasons I cannot startup another http server.
> Thus,
> > I
> > > wonder to know if (and possibly how)
> > > this approach can be technically 'feasible', already conscious that it
> > > cannot be definitively 'reasonable'... :)
> > >
> > > Many thanks for the support,
> > > Dino.
> > > --
> > >
> >
>


Highlighting

2010-03-09 Thread Lee Smith
Hey All

I have indexed a whole bunch of documents and now I want to search against them.

My search is going great all but highlighting.

I have these items set

hl=true
hl.snippets=2
hl.fl = attr_content
hl.fragsize=100

Everything works apart from the highlighted text found not being surrounded 
with a 

Am I missing a setting ?

Lee

Re: Highlighting

2010-03-09 Thread Joe Calderon
did u enable the highlighting component in solrconfig.xml? try setting
debugQuery=true to see if the highlighting component is even being
called...

On Tue, Mar 9, 2010 at 12:23 PM, Lee Smith  wrote:
> Hey All
>
> I have indexed a whole bunch of documents and now I want to search against 
> them.
>
> My search is going great all but highlighting.
>
> I have these items set
>
> hl=true
> hl.snippets=2
> hl.fl = attr_content
> hl.fragsize=100
>
> Everything works apart from the highlighted text found not being surrounded 
> with a 
>
> Am I missing a setting ?
>
> Lee


Distributed search fault tolerance

2010-03-09 Thread Shawn Heisey
I attended the Webinar on March 4th.  Many thanks to Yonik for putting 
that on.  That has led to some questions about the best way to bring 
fault tolerance to our distributed search.  High level question: Should 
I go with SolrCloud, or stick with 1.4 and use load balancing?  I hope 
the rest of this email isn't too disjointed for understanding.


We are using virtual machines on 8-core servers with 32GB of RAM to 
house all this.  For initial deployment, there are two of these, but we 
will have a total of four once we migrate off our current indexing 
solution.  We won't be able to bring fault tolerance into the mix until 
we have all four hosts, but I need to know what direction we are going 
before initial deployment.


One choice is to stick with version 1.4 for stability and use load 
balancing on the shards.  I had already planned to have a pair of load 
balancer VMs to handle redundancy on what I'm calling the broker 
(explained further down), so it would not be a major step to have it do 
the shards as well.


I have been looking into SolrCloud.  I tried to just swap out the .war 
file with one compiled from the cloud branch, but that didn't work.  A 
little digging showed that the cloud branch uses a core for the 
collection.  I already have cores defined so I can build indexes and 
swap them into place quickly.  A big question - can I continue to use 
this multi-core approach with SolrCloud, or does it supplant cores with 
its collection logic?


Due to the observed high CPU requirements involved in sorting results 
from multiple shards into a final result, I have so far opted to go with 
an architecture that puts an empty index into a broker core, which lives 
on its own VM host separate from the large static shards.  This core's 
solrconfig.xml has a list of all the shards that get queried.  My 
application has no idea that it's talking to anything other than a 
single SOLR instance.  Once we get the caches warmed, performance is 
quite good.


The VM host with the broker will also have another VM with the shard 
where all new data goes, a concept we call the incrememental.  On a 
nightly basis, some of the documents in the incremental will be 
redistributed to the static shards and everything will get reoptimized.


How would you recommend I pursue fault tolerance?  I had already planned 
to set up a load balancer VM to handle redundancy for the broker, so it 
would not be a HUGE step to have it load balance the shards too.




Re: Highlighting

2010-03-09 Thread Lee Smith
Yes it shows when I run the debug 

-
0.0
 

Any other ideas ?

On 9 Mar 2010, at 21:06, Joe Calderon wrote:

> did u enable the highlighting component in solrconfig.xml? try setting
> debugQuery=true to see if the highlighting component is even being
> called...
> 
> On Tue, Mar 9, 2010 at 12:23 PM, Lee Smith  wrote:
>> Hey All
>> 
>> I have indexed a whole bunch of documents and now I want to search against 
>> them.
>> 
>> My search is going great all but highlighting.
>> 
>> I have these items set
>> 
>> hl=true
>> hl.snippets=2
>> hl.fl = attr_content
>> hl.fragsize=100
>> 
>> Everything works apart from the highlighted text found not being surrounded 
>> with a 
>> 
>> Am I missing a setting ?
>> 
>> Lee



Re: Cleaning up dirty OCR

2010-03-09 Thread simon
On Tue, Mar 9, 2010 at 2:35 PM, Robert Muir  wrote:

> > Can anyone suggest any practical solutions to removing some fraction of
> the tokens containing OCR errors from our input stream?
>
> one approach would be to try
> http://issues.apache.org/jira/browse/LUCENE-1812
>
> and filter terms that only appear once in the document.
>

In another life (and with another search engine) I also had to find a
solution to the dirty OCR problem. Fortunately only in English,
unfortunately a corpus containing many non-American/non-English names, so we
also had to be very conservative and reduce the number of false positives.

There wasn't any completely satisfactory solution; there were a large number
of two and three letter n-grams so we were able to use a dictionary approach
to eliminate those (names tend to be longer).  We also looked for runs of
punctuation,  unlikely mixes of alpha/numeric/punctuation, and also
eliminated longer words which consisted of runs of not-ocurring-in-English
bigrams.

Hope this helps

-Simon

>
> --
>


Re: SolrConfig - constructing the object

2010-03-09 Thread Chris Hostetter

: Now a few days later I am thinking, I need access to the SolrConfig object
: in multiple classes.  Maybe I should not be reloading it over and over?   I
: see that there is a getSolrConfig() method in the SolrCore class that will
: return the SolrConfig object.  
: Should I maybe just take all my classes and have them implement the
: SolrCoreAware interface so that they can call core.getSolrConfig(); ??

Uh... what exactly is it you are doing?

http://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341



-Hoss



Re: SolrJ commit options

2010-03-09 Thread Chris Hostetter

: One technique to control commit times is to do automatic commits: you
: can configure a core to commit every N seconds (really milliseconds,
: but less than 5 minutes becomes difficult) and/or every N documents.
: This promotes a more fixed amount of work per commit.

...but increaseing commit frequency only really helps you if the slowdown 
you are seeing is coming from the actaul commit -- if it's coming from 
really resource intensive cache warming that chews up all the CPU then it 
can just make the problem worse -- likewise, if you don't have any 
warming, the perception of "slow/stoped queries during/after commit" can 
sometimes come from time spent initializing FieldCaches (particularly if 
there is one or two fields that almost all queries sort on )

the long and short being: performance issues can be caused by a great many 
differnet things, so you really need to figure out what exactly is going 
on during these "slow" periods in order to dtermine the best way to deal 
with it.


-Hoss



Re: Highlighting

2010-03-09 Thread Ahmet Arslan

> Yes it shows when I run the debug 
> 
> - name="org.apache.solrhandler.component.HighlightComponent">
>     0.0
>  
> 
> Any other ideas ?

is the field attr_content stored?  Are you querying this field? What happens 
when you append &hl.maxAnalyzedChars=-1 to your search ulr?





RE: Index an entire Phrase and not it's constituent parts?

2010-03-09 Thread Christopher Ball
Unfortunately, I don't see how the KeywordTokenizerFactory could work given
the field in question is delimited text (paragraphs) and the
KeywordTokenizerFactory essentially does nothing to the inbound content.

 

Feel like I must be missing something . . . but can't figure out what.

 

Do I really need to write a custom analyzer for this?

 

  _  

>From Erick Erickson  Subject Re: Index an entire
Phrase and not it's constituent parts? Date Thu, 04 Mar 2010 19:55:58 GMT 

Try KeywordTokenizerFactory. This page is very useful:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 
HTH
Erick
 
On Thu, Mar 4, 2010 at 2:31 PM, Christopher Ball <
christopher.b...@metaheuristica.com> wrote:
 
> How can I Index an entire Phrases and not it's constituent parts?
> 
> 
> 
> I want to index collations as a single term in the index, and not as the
> multiple terms that comprise the phrase, for example, I want to index: "as
> much as" but not the independent parts: "as", "much", "as".
> 
> 
> 
> Any guidance appreciated,
> 
> 
> 
> Christopher
 

 



Re: Search on dynamic fields which contains spaces /special characters

2010-03-09 Thread Chris Hostetter
: I do not believe the SOLR or LUCENE syntax allows this

At the lowest level, Solr and Lucene-Java both support any arbitrary 
character you want in the field name -- it's just that sevearl features 
use syntax that doesn't play nicely with characters like whitespace in 
field names.

when using the Lucene query parser, you can backslash escape a whitespace 
character, even in a field name...

http://localhost:8983/solr/select/?debugQuery=true&explainOther=SOLR&q=foo\+bar_t%3Asolr

...however params like "fl" and "sort" don't support this type of 
escaping, so you're really better off trying to use field names that don't 
contain whitespace (or ",", or "|", or "}" any of the other meta 
characters that are used by varoius features when looking at field names 
in request paramaters)


-Hoss



Re: Removing duplicate values from multivalued fields

2010-03-09 Thread Chris Hostetter

: Is there a way to remove duplicate values from the multivalued fields? I am
: using Solrj client with solr 1.4 version. 

not trivially, but you could write an UpdateProcessor to do this fairly 
trivially, or emplement it in the client.




-Hoss



Re: search and count ocurrences

2010-03-09 Thread Chris Hostetter

: I need to implement a search where i should count the number of times
: the string appears on the search field, 
: 
: ie: only return articles that mention the word 'HP' at least 2x.
...
: Is there a way that SOLR does this type of operation for me?

you'd have to implement it in a custom QParser -- if all you are worried 
about is simple TermQuery style matches, then this should be fairly 
trivial using SpanNearQuery.


-Hoss



Re: CoreAdminHandler question

2010-03-09 Thread Chris Hostetter

I *think* that you can use the same instanceDir for multiple cores, the 
key issue being that you need to make sure they each have distinct 
dataDirs (which as i recall can be done using property replacement with 
the core name)

: The action CREATE creates a new core based on preexisting
: instanceDir/solrconfig.xml/schema.xml, and registers it.
: That's what the documentation is stating.
: 
: Is there a way to instruct solr to create the instanceDir if does not exist?
: 
: I'm trying to create new core based on a existing schema/config to rebuild
: the index, after that swap it with the existing old core. The problem is

-Hoss



Re: Weird issue with solr and jconsole/jmx

2010-03-09 Thread Chris Hostetter

: I connected to one of my solr instances with Jconsole today and
: noticed that most of the mbeans under the solr hierarchy are missing.
: The only thing there was a Searcher, which I had no trouble seeing
: attributes for, but the rest of the statistics beans were missing.
: They all show up just fine on the stats.jsp page.
: 
: In the past this always worked fine. I did have the core reload due to
: config file changes this morning. Could that have caused this?

possibly... reloading the core actually causes a whole new SolrCore 
object (with it's own registry of SOlrInfoMBeans) to be created and then 
swapped in place of hte previous core ... so perhaps you are still looking 
at the "stats" of the old core which is no longer in use (and hasn't been 
garbage collected because the JMX Manager still had a refrence to it for 
you? ... i'm guessing at this point)

did disconnecting from jconsole and reconnecting show you the correct 
stats?


-Hoss



Re: Index an entire Phrase and not it's constituent parts?

2010-03-09 Thread Erick Erickson
I think you need to back up and tell us what you're
trying to accomplish from a higher level.
See Hossman's apache page:

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341

Erick

On Tue, Mar 9, 2010 at 6:16 PM, Christopher Ball <
christopher.b...@metaheuristica.com> wrote:

> Unfortunately, I don't see how the KeywordTokenizerFactory could work given
> the field in question is delimited text (paragraphs) and the
> KeywordTokenizerFactory essentially does nothing to the inbound content.
>
>
>
> Feel like I must be missing something . . . but can't figure out what.
>
>
>
> Do I really need to write a custom analyzer for this?
>
>
>
>  _
>
> From Erick Erickson  Subject Re: Index an entire
> Phrase and not it's constituent parts? Date Thu, 04 Mar 2010 19:55:58 GMT
>
> Try KeywordTokenizerFactory. This page is very useful:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>
> HTH
> Erick
>
> On Thu, Mar 4, 2010 at 2:31 PM, Christopher Ball <
> christopher.b...@metaheuristica.com> wrote:
>
> > How can I Index an entire Phrases and not it's constituent parts?
> >
> >
> >
> > I want to index collations as a single term in the index, and not as the
> > multiple terms that comprise the phrase, for example, I want to index:
> "as
> > much as" but not the independent parts: "as", "much", "as".
> >
> >
> >
> > Any guidance appreciated,
> >
> >
> >
> > Christopher
>
>
>
>
>


Re: Warning : no lockType configured for...

2010-03-09 Thread Chris Hostetter

: Ok I think I know where the problem is
...
: It's  the constructor used by SolrCore  in r772051

Ughhh... so to be clear: you haven't been using Solr 1.4 at any point in 
this thread?

that explains why no one else could recreate the problem you were 
describing.

For future refrence: if you aren't using the most recently 
released version of Solr when you post a question about a possible bug, 
please make that very clear right up at the top of your message, and if 
you think you've found a bug, pelase make sure to test against the most 
recently released version to see if it's already been fixed.

: PS : should I fill some kind of bug report even if everything is ok now ? (I'm
: asking because I didn't see anything related to this problem in JIRA, so maybe
: if you want to keep a trace...)

If you can recreate the problem using Solr 1.3, then feel free to file a 
bug, noting that it was only a problem in 1.3, but has already been fixed 
in 1.4 ... but we don't usually bother tracking bugs against arbitrary 
unlreased points from the trunk (unless they are current).  I'm sure there 
are lots of bugs that existed only transiently as features were being 
fleshed out.


-Hoss



digest

2010-03-09 Thread Dennis Gearon
Is there a digest mode to this list? 


It's very active and helpful. I'm just not fully 'dove in' to using it yet. 
Just need to look in the digests for answers to my questions.

Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Tue, 3/9/10, Robert Muir  wrote:

> From: Robert Muir 
> Subject: Re: PDF extraction leads to reversed words
> To: solr-user@lucene.apache.org
> Date: Tuesday, March 9, 2010, 7:13 AM
> On Tue, Mar 9, 2010 at 10:10 AM,
> Abdelhamid  ABID 
> wrote:
> > nor 3.8 version does change anythings !
> >
> 
> the patch (https://issues.apache.org/jira/browse/SOLR-1813) can
> only
> work on Solr trunk. It will not work with Solr 1.4.
> 
> 
> Solr 1.4 uses pdfbox-0.7.3.jar, which does not support
> Arabic.
> Solr trunk uses pdfbox-0.8.0-incubating.jar, which does
> support
> Arabic, if you also put ICU in the classpath.
> 
> -- 
> Robert Muir
> rcm...@gmail.com
>


Re: Documents disappearing

2010-03-09 Thread Chris Hostetter

: A quick check did show me a couple of duplicates, but if I understand
: correctly, even if two different process send the same document, the last
: one should update the previous. If I send the same documents 10 times, in
: the end, it should only be in my index once, no?

it should yes ... i didn't say i could explain your problem, i'm just 
trying to speculate about things that might give us insight into figureing 
out if/where a bug exists.

the only thing i can possibly think of that would cause a situation like 
this (where the number of documents decreases w/o any deletes happening) 
is if some of the "add" commands use overwrite="false" and some use 
overwrite="true" ... in that 
situation, you might get 10 docs added with the same uniqueKey 
value using overwrite="false" and so you'll have 10 docs in your index.  
then you might index one more doc with the same uniqueKey value, but this 
time using overwrite="true" and that one document will overwrite all 10 of 
the previous documents, causing your doc count to decrease from 10 to 1.

But nothing in your description of how you are using Solr gimplies that 
you were doing this, hence my question of what exactly your indexing code 
looks like.

My best guess is that maybe the deduplication UpdateProcessors hav a bug 
in them, but w/o a reproducible test case demonstrating hte problem it 
will be nearly impossible to even know where (or if that's actaully the 
problem at all)



-Hoss



Re: Extracting content from mailman managed mail list archive

2010-03-09 Thread Chris Hostetter

: I just checked popular search services and it seems that neither
: lucidimagination search nor search-lucene support this:

it really depends on what you want to do ... most people i know who index 
email want to included quoted portions in the message because it's part of 
hte context of the message  ... if you are looking for emails from "Jim" 
send on day "X" about "Foo" then shouldn't a message match even if 
Jim never wrote the word "Foo" himself but it appeared in the quoted text 
he was commenting on several times?

your opinion may vary, but that's the reasoning i've heard behind why 
people index quoted portions even when they've identified them for the 
purposes of display which is what seems to be happening in the lucid 
example you cited...

: 
http://www.lucidimagination.com/search/document/954e8589ebbc4b16/terminating_slashes_in_url_normalization

...Jukka sent that message in plain text, but the lucid system detected 
the quoted portion and converted it to an html blockquote tag.  


-Hoss



Re: Index an entire Phrase and not it's constituent parts?

2010-03-09 Thread Erick Erickson
P.S. although phrase queries with fields that do NOT
have stopwords removed feels kinda like what you're
hinting at.

Erick

On Tue, Mar 9, 2010 at 6:49 PM, Erick Erickson wrote:

> I think you need to back up and tell us what you're
> trying to accomplish from a higher level.
> See Hossman's apache page:
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
>
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
>
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
> Erick
>
>
> On Tue, Mar 9, 2010 at 6:16 PM, Christopher Ball <
> christopher.b...@metaheuristica.com> wrote:
>
>> Unfortunately, I don't see how the KeywordTokenizerFactory could work
>> given
>> the field in question is delimited text (paragraphs) and the
>> KeywordTokenizerFactory essentially does nothing to the inbound content.
>>
>>
>>
>> Feel like I must be missing something . . . but can't figure out what.
>>
>>
>>
>> Do I really need to write a custom analyzer for this?
>>
>>
>>
>>  _
>>
>> From Erick Erickson  Subject Re: Index an entire
>> Phrase and not it's constituent parts? Date Thu, 04 Mar 2010 19:55:58 GMT
>>
>> Try KeywordTokenizerFactory. This page is very useful:
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>>
>> HTH
>> Erick
>>
>> On Thu, Mar 4, 2010 at 2:31 PM, Christopher Ball <
>> christopher.b...@metaheuristica.com> wrote:
>>
>> > How can I Index an entire Phrases and not it's constituent parts?
>> >
>> >
>> >
>> > I want to index collations as a single term in the index, and not as the
>> > multiple terms that comprise the phrase, for example, I want to index:
>> "as
>> > much as" but not the independent parts: "as", "much", "as".
>> >
>> >
>> >
>> > Any guidance appreciated,
>> >
>> >
>> >
>> > Christopher
>>
>>
>>
>>
>>
>


Using SOLR

2010-03-09 Thread CP Hennessy
Hi,
  I'm trying to figure out if SOLR is the component I need and if so that 
I'm asking the right questions :)

I need to index a large set of multilingual documents against a project 
specific taxonomy. 

From what I've read SOLR should be perfect for this. 

However I'm not sure that my approach is correct. I've been able to run the 
example solr setup and index the given documents. 

Now I want to add my taxonomy (in English first), and this is where I'm 
stumbling (or not understanding the documentation).

To do this I understand that I need to define a field to store the result of 
the taxonomy analysis. I also need to define the analysis steps used to 
generate the values for this field ( lowercase, synonyms, stemming, etc).

In the file solr/conf/schema.xml in the  I've added :


  






  


and 

   

I am able to test my fieldType thru the /solr/admin/analysis.jsp page and it 
seems to be doing what I expect. 

When I now add a test document containing several words from the keepwords.txt 
file the result seems to indicate that it was processed correctly.  

How can I get the details of what has been indexed for my file?


Also I do not know how to perform a search based on the taxonomy ?

Any pointers would be greatly appreciated.

Thanks in advance,
CPH


Re: digest

2010-03-09 Thread Erick Erickson
Not that I know of, but you can certainly search it at:
http://old.nabble.com/Solr-f14479.html
or
http://www.lucidimagination.com/search/

and there's the Wiki at:
http://wiki.apache.org/solr/FrontPage

Erick

On Tue, Mar 9, 2010 at 7:12 PM, Dennis Gearon  wrote:

> Is there a digest mode to this list?
>
>
> It's very active and helpful. I'm just not fully 'dove in' to using it yet.
> Just need to look in the digests for answers to my questions.
>
> Dennis Gearon
>
> Signature Warning
> 
> EARTH has a Right To Life,
>  otherwise we all die.
>
> Read 'Hot, Flat, and Crowded'
> Laugh at http://www.yert.com/film.php
>
>
> --- On Tue, 3/9/10, Robert Muir  wrote:
>
> > From: Robert Muir 
> > Subject: Re: PDF extraction leads to reversed words
> > To: solr-user@lucene.apache.org
> > Date: Tuesday, March 9, 2010, 7:13 AM
> > On Tue, Mar 9, 2010 at 10:10 AM,
> > Abdelhamid  ABID 
> > wrote:
> > > nor 3.8 version does change anythings !
> > >
> >
> > the patch (https://issues.apache.org/jira/browse/SOLR-1813) can
> > only
> > work on Solr trunk. It will not work with Solr 1.4.
> >
> >
> > Solr 1.4 uses pdfbox-0.7.3.jar, which does not support
> > Arabic.
> > Solr trunk uses pdfbox-0.8.0-incubating.jar, which does
> > support
> > Arabic, if you also put ICU in the classpath.
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>


Architectural help

2010-03-09 Thread blargy

I was wondering if someone could be so kind to give me some architectural
guidance.

A little about our setup. We are RoR shop that is currently using Ferret (no
laughs please) as our search technology. Our indexing process at the moment
is quite poor as well as our search results. After some deliberation we have
decided to switch to Solr to satisfy our search requirements. 

We have about 5M records ranging in size all coming from a DB source (only 2
tables). What will be the most efficient way of indexing all of these
documents? I am looking at DIH but before I go down that road I wanted to
get some guidance. Are there any pitfalls I should be aware of before I
start? Anything I can do now that will help me down the road?

I have also been exploring the Sunspot rails plugin
(http://outoftime.github.com/sunspot/) which so far seems amazing. There is
an easy way to reindex all of your models like Model.reindex but I doubt
this is the most efficient. Has anyone had any experience using Sunspot with
their rails environment and if so should I bother with the DIH?

Please let me know of any suggestions/opinions you may have. Thanks.


-- 
View this message in context: 
http://old.nabble.com/Architectural-help-tp27844268p27844268.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Using SOLR

2010-03-09 Thread Erick Erickson
Well, the LukeRequestHandler lets you peek at the
index, see:
http://wiki.apache.org/solr/LukeRequestHandler

warning: it'll take a bit for this to make lots of sense.

You can get a copy of Luke (google Lucene Luke) for
what the above is based on, point it at your index and
have at it.

One bit of warning though. It'll be easy to confuse
what you stored (which is just a raw copy of
your input) with what you indexed (which is
what's searched on). If you're looking at either tool
and what you see looks suspiciously like
your raw data, look further to see it you can find
the terms...

To answer your question about searching, it all depends
(tm). What do you mean by Taxonomy? Different
people use that term...er...differently. Some example
inputs and how searching should behave in your
problem space would be very helpful.

HTH
Erick

On Tue, Mar 9, 2010 at 7:53 PM, CP Hennessy  wrote:

> Hi,
>  I'm trying to figure out if SOLR is the component I need and if so that
> I'm asking the right questions :)
>
> I need to index a large set of multilingual documents against a project
> specific taxonomy.
>
> From what I've read SOLR should be perfect for this.
>
> However I'm not sure that my approach is correct. I've been able to run the
> example solr setup and index the given documents.
>
> Now I want to add my taxonomy (in English first), and this is where I'm
> stumbling (or not understanding the documentation).
>
> To do this I understand that I need to define a field to store the result
> of
> the taxonomy analysis. I also need to define the analysis steps used to
> generate the values for this field ( lowercase, synonyms, stemming, etc).
>
> In the file solr/conf/schema.xml in the  I've added :
>
>
>  
>
>
>
> language="English"/>
>
> ignoreCase="true"/>
>  
>
>
> and
>
>required="true" multiValued="true"/>
>
> I am able to test my fieldType thru the /solr/admin/analysis.jsp page and
> it
> seems to be doing what I expect.
>
> When I now add a test document containing several words from the
> keepwords.txt
> file the result seems to indicate that it was processed correctly.
>
> How can I get the details of what has been indexed for my file?
>
>
> Also I do not know how to perform a search based on the taxonomy ?
>
> Any pointers would be greatly appreciated.
>
> Thanks in advance,
> CPH
>


Scaling indexes with high document count

2010-03-09 Thread Peter Sturge
Hello,

I wonder if anyone might have some insight/advice on index scaling for high
document count vs size deployments...

The nature of the incoming data is a steady stream of, on average, 4GB per
day. Importantly, the number of documents inserted during this time is
~7million (i.e. lots of small entries).
The plan is to partition shards on a per month basis, and hold 6 months of
data.

On the search side, this would mean 6 shards (as replicas), each holding
~120GB with ~210million document entries.
It is envisioned to deploy 2 indexing cores of which one is active at a
time. When the active core gets 'full' (e.g. a month has passed), the other
core kicks in for live indexing while the other completes its replication to
it searcher(s). It's then cleared, ready for the next time period. Each time
there is a 'switch', the next available replica is cleared and told to
replicate to the newly active indexing core. After 6 months, the first
replica is re-used, and so on...
This type of layout allows indexing to carry on pretty much uninterrupted,
and makes it relatively easy to manage replicas separately from the indexers
(e.g. add replicas to store, say, 9 months, backup, forward etc.).

As searching would always be performed on replicas - the indexing cores
wouldn't be tuned with much autowarming/read cache, but have loads of
'maxdocs' cache. The searchers would be the other way 'round - lots of
filter/fieldvalue cache. Please correct me if I'm wrong about these. (btw,
client searches use faceting in a big way)

The 120GB disk footprint is perfectly reasonable. Searching on potentially
1.3billion document entries, each with up to 30-80 facets (+potentially lots
of unique values), plus date faceting and range queries, and still keep
search performance up is where I could use some advice.
Is this a case of simply throwing enough tin at the problem to handle the
caching/faceting/distributed searches?

What advice would you give to get the best performance out of such a
scenario?
Any experiences/insight etc. is greatly appreciated.

Thanks,
Peter

BTW: Many thanks, Yonik and Lucid for your excellent Mastering Solr webinar
- really useful and highly informative!


Re: digest

2010-03-09 Thread Chris Hostetter

: Mailing-List: contact solr-user-h...@lucene.apache.org; run by ezmlm
: Precedence: bulk
: List-Help: 

...if you send mail to that address it should have info about subscribing 
in digest mode.

And PS...


: Subject: digest
: In-Reply-To: <8f0ad1f31003090713h6580f413gb7759713bf3dc...@mail.gmail.com>

http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists

When starting a new discussion on a mailing list, please do not reply to 
an existing message, instead start a fresh email.  Even if you change the 
subject line of your email, other mail headers still track which thread 
you replied to and your question is "hidden" in that thread and gets less 
attention.   It makes following discussions in the mailing list archives 
particularly difficult.
See Also:  http://en.wikipedia.org/wiki/User:DonDiego/Thread_hijacking




-Hoss



Solr ad on stackoverflow.com

2010-03-09 Thread Mauricio Scheffer
Stackoverflow.com is serving ads for open source projects:
http://meta.stackoverflow.com/questions/31913/open-source-advertising-sidebar-1h-2010

I think it would be good publicity for Solr to have a banner there... anyone
up for designing one? (if it's ok with the Solr dev team, of course)

Cheers,
Mauricio


Re: More contextual information in anlyzers

2010-03-09 Thread Chris Hostetter

: If I write a custom analyser that accept a specific attribut in the
: constructor
: 
: public MyCustomAnalyzer(String myAttribute);
: 
: Is there a way to dynamically send a value for this attribute from Solr at
: index time in the XML Message ?
: 
: 
:   
: .

fundementally there are two problems with trying to add functionality like 
this into Solr...

1) the XML Update syntax is just *one* of several differnet pathways that 
data can make it into Solr, and well before it reaches your custom 
analyzer, it's converted into what is essentially just a list of triplets 
(fieldName,fieldvalue,boost).  So it would be hard to generalize out 
additional metadata attributes associated with field values in a way that 
could be generalized.

2) In Solr (and in Lucene in general) you don't get a seperate ANalyzer 
instance per field/value pair -- one Analyzer is reused over and over for 
every field=>value in a doc (and in fact: the same analyzer is used over 
and over for every document as well)

This is why people typically encode their "attributes" in the value, and 
then write their Tokenizers in such a way that it decodes that info and 
stores it as a Payload on the terms -- because even if you bypassed Solr's 
pipeline for adding documents directly from some custom RequestHandler 
that knew about your extended XML syntax, there wouldn't be anyway to pass 
that metadata to the (Long lived) Analyzer instance.



-Hoss



Re: Is "UniqueKey" in schema and "pk" attribute for DataimportHandler entities still optional in solr 1.4?

2010-03-09 Thread Chris Hostetter

: I allways build solr index from scratch, so I don't have neither "pk"
: attribute in "entity" tag (dataconfig.xml file) nor "UniqueKey" in index
: schema. When I updated solr from 1.3 to 1.4 I got the following exception
: during solr initialization:

This is in fact a bug in Solr 1.4...
https://issues.apache.org/jira/browse/SOLR-1638

...
: Is "UniqueKey" in schema still optional in solr 1.4? Is "pk" attribute in
: still optional in solr 1.4 DataImportHandler entities?  As I can see in

It is (suppose to be) optional.  Some functionality requires is 
(QueryElevationComponent) but DataImportHandler is not (suppose to be) one 
of them.


-Hoss



Re: More contextual information in anlyzers

2010-03-09 Thread dbejean

So, the way I made my analyzer is the good one. Thank you.


hossman wrote:
> 
> 
> : If I write a custom analyser that accept a specific attribut in the
> : constructor
> : 
> : public MyCustomAnalyzer(String myAttribute);
> : 
> : Is there a way to dynamically send a value for this attribute from Solr
> at
> : index time in the XML Message ?
> : 
> : 
> :   
> : .
> 
> fundementally there are two problems with trying to add functionality like 
> this into Solr...
> 
> 1) the XML Update syntax is just *one* of several differnet pathways that 
> data can make it into Solr, and well before it reaches your custom 
> analyzer, it's converted into what is essentially just a list of triplets 
> (fieldName,fieldvalue,boost).  So it would be hard to generalize out 
> additional metadata attributes associated with field values in a way that 
> could be generalized.
> 
> 2) In Solr (and in Lucene in general) you don't get a seperate ANalyzer 
> instance per field/value pair -- one Analyzer is reused over and over for 
> every field=>value in a doc (and in fact: the same analyzer is used over 
> and over for every document as well)
> 
> This is why people typically encode their "attributes" in the value, and 
> then write their Tokenizers in such a way that it decodes that info and 
> stores it as a Payload on the terms -- because even if you bypassed Solr's 
> pipeline for adding documents directly from some custom RequestHandler 
> that knew about your extended XML syntax, there wouldn't be anyway to pass 
> that metadata to the (Long lived) Analyzer instance.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/More-contextual-information-in-analyser-tp27819298p27845893.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: [ANN] Zoie Solr Plugin - Zoie Solr Plugin enables real-time update functionality for Apache Solr 1.4+

2010-03-09 Thread Don Werve
2010/3/9 Shalin Shekhar Mangar 

> I think Don is talking about Zoie - it requires a long uniqueKey.
>

Yep; we're using UUIDs.