date:20090729

Re: Is there a multi-shard optimize message?

2009-07-29 Thread Shalin Shekhar Mangar

On Wed, Jul 29, 2009 at 2:48 AM, Phillip Farber  wrote:

>
> Normally to optimize an index you POST  to /solr/update.  Is
> there any way to POST an optimize message to one instance and have it
> propagate to all shards sort of like the select?
>
> /solr-shard-1/select?q=dog... shards=shard-1,shard2
>

No, you'll need to send optimize to each host separately.
-- 
Regards,
Shalin Shekhar Mangar.

refering/alias other Solr documents

2009-07-29 Thread ravi.gidwani


Hi all:
Is in solr, that will allow documents referring each other ? In
other words, if a search for "abc" matches on document 1 , I should be able
to return document 2 even though the index does any fields matching "abc".
Here is the scenario with some more details:

Solr version:1.3

Scenario:
1) Solr Document 1 with say some field title="abc" and Solr Document 2 with
its own data.
2) User searches for "abc" and gets Document 1 as it matches on title field

Expected results:
When the user searches for "abc"  he it also get Document 2 along with
Document 1. 

I understand one way of doing this is to make sure Document 2 has all the
contents of Document 1. But this introduces a issue of keeping the two
documents (and hence their solr index) in sync with each other. 

I think I am looking for a mechanism like this:

Document 1 refers => document 2, Document 3. 

Hence whenever document 1 in part of search results, document 2 and document
3 will also be returned as search results .

I may be totally off on this expectation but am trying to solve a "Contains"
problem where lets say a book (represented as Document 1 in solr) "contains"
Chapters (represented by Document 2,3,4..) in solr. 

I hope this is not too confusing ;) 

TIA
~Ravi Gidwani
-- 
View this message in context: 
http://www.nabble.com/refering-alias-other-Solr-documents-tp24713855p24713855.html
Sent from the Solr - User mailing list archive at Nabble.com.

Boosting ('bq') on multi-valued fields

2009-07-29 Thread KaktuChakarabati


Hey,
I have a field defined as such:

 

with the string type defined as:



When I try using some query-time boost parameters using the bq on values of
this field it seems to behave
strangely in case of documents actually having multiple values:
If i'd do a boost for a particular value ( "site_id:5^1.1" ) it seems like
all the cases where this field is actually
populated with multiple ones ( i.e a document with field value "5|6" ) do
not get boosted at all. I verified this using
debugQuery & explainOther=doc_id:.
is this a known issue/bug? any work arounds? (i'm using a nightly solr build
from a few months back.. )

Thanks,
-Chak
-- 
View this message in context: 
http://www.nabble.com/Boosting-%28%27bq%27%29-on-multi-valued-fields-tp24713905p24713905.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: update some index documents after indexing process is done with DIH

2009-07-29 Thread Noble Paul നോബിള്‍ नोब्ळ्

On Tue, Jul 28, 2009 at 5:17 PM, Marc Sturlese wrote:
>
> That really sounds the best way to reach my goal. How could I invoque a
> listener from the newSearcher?Would be something like:
>    
>      
>         solr 0  name="rows">10 
>         rocks 0  name="rows">10 
>        static newSearcher warming query from
> solrconfig.xml
>      
>    
>    
>
> And MyCustomListener would be the class who open the reader:
>
>        RefCounted searchHolder = null;
>        try {
>          searchHolder = dataImporter.getCore().getSearcher();
>          IndexReader reader = searchHolder.get().getReader();
>
>          //Here I iterate over the reader doing docuemnt modifications
>
>        } finally {
>           if (searchHolder != null) searchHolder.decref();
>        }
>        } catch (Exception ex) {
>            LOG.info("error");
>        }

you may not be able to access the DIH API from a newSearcher event .
But the API would give you the searcher directly as a method
parameter.
>
> Finally, to access to documents and add fields to some of them, I have
> thought in using SolrDocument classes. Can you please point me where
> something similar is done in solr source (I mean creation of SolrDocuemnts
> and conversion of them to proper lucene docuements).
>
> Does this way for reaching the goal makes sense?
>
> Thanks in advance
>
>
>
> Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
>>
>> when a core is reloaded the event fired is firstSearcher. newSearcher
>> is fired when a commit happens
>>
>>
>> On Tue, Jul 28, 2009 at 4:19 PM, Marc Sturlese
>> wrote:
>>>
>>> Ok, but if I handle it in a newSearcher listener it will be executed
>>> every
>>> time I reload a core, isn't it? The thing is that I want to use an
>>> IndexReader to load in a HashMap some doc fields of the index and
>>> depending
>>> of the values of some field docs modify other docs. Its very memory
>>> consuming (I have tested it with a simple lucene script). Thats why I
>>> wanted
>>> to do it just after the indexing process.
>>>
>>> My ideal case would be to do it in the commit function of
>>> DirectUpdatehandler2.java just before
>>> writer.optimize(cmd.maxOptimizeSegments); is executed. But I don't want
>>> to
>>> mess that code... so trying to find out the best way to do that as a
>>> plugin
>>> instead of a hack as possible.
>>>
>>> Thanks in advance
>>>
>>>
>>> Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:

 It is best handled as a 'newSearcher' listener in solrconfig.xml.
 onImportEnd is invoked before committing

 On Tue, Jul 28, 2009 at 3:13 PM, Marc Sturlese
 wrote:
>
> Hey there,
> I would like to be able to do something like: After the indexing
> process
> is
> done with DIH I would like to open an indexreader, iterate over all
> docs,
> modify some of them depending on others and delete some others. I can
> easy
> do this directly coding with lucene but would like to know if there's a
> way
> to do it with Solr using SolrDocument or SolrInputDocument classes.
> I have thougth in using SolrJ or DIH listener onImportEnd but not sure
> if
> I
> can get an IndexReader in there.
> Any advice?
> Thanks in advance
> --
> View this message in context:
> http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24695947.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com


>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24696872.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> -
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24697751.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
-
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: FieldCollapsing: Two response elements returned?

2009-07-29 Thread Licinio Fernández Maurelo

I've applied latest collapse field related patch (patch-3) and it doesn't work.
Anyone knows how can i get only the collapse response ?


29-jul-2009 11:05:21 org.apache.solr.common.SolrException log
GRAVE: java.lang.ClassCastException:
org.apache.solr.handler.component.CollapseComponent cannot be cast to
org.apache.solr.request.SolrRequestHandler
at 
org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:150)
at org.apache.solr.core.SolrCore.(SolrCore.java:539)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:381)
at org.apache.solr.core.CoreContainer.load(CoreContainer.java:241)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:115)
at 
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at 
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
at 
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
at 
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
at 
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3800)
at 
org.apache.catalina.core.StandardContext.start(StandardContext.java:4450)
at 
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
at 
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526)
at 
org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:987)
at 
org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:909)
at 
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:495)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1206)
at 
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:314)
at 
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:722)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
at 
org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
at 
org.apache.catalina.core.StandardService.start(StandardService.java:516)
at 
org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
at org.apache.catalina.startup.Catalina.start(Catalina.java:583)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)

2009/7/28 Marc Sturlese :
>
> That's provably because you are using both the CollpaseComponent and the
> QueryComponent. I think the 2 or 3 last patches allow full replacement of
> QueryComponent.You shoud just replace:
>
>  class="org.apache.solr.handler.component.QueryComponent" />
> for:
>  class="org.apache.solr.handler.component.CollapseComponent" />
>
> This will sort your problem and make response times faster.
>
>
>
> Jay Hill wrote:
>>
>> I'm doing some testing with field collapsing, and early results look good.
>> One thing seems odd to me however. I would expect to get back one block of
>> results, but I get two - the first one contains the collapsed results, the
>> second one contains the full non-collapsed results:
>>
>>  ... 
>>  ... 
>>
>> This seems somewhat confusing. Is this intended or is this a bug?
>>
>> Thanks,
>> -Jay
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/FieldCollapsing%3A-Two-response-elements-returned--tp24690426p24693960.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Lici

solr/home in web.xml relative to web server home

2009-07-29 Thread Chantal Ackermann


Hi all,

the environment variable (env-entry) in web.xml to configure the 
solr/home is relative to the web server's working directory. I find this 
unusual as all the servlet paths are relative to the web applications 
directory (webapp context, that is). So, I specified solr/home relative 
to the web app dir, as well, at first.


I think it makes deployment in an unknown environment, or in different 
environments using a simple war more complex than it needed to be. If a 
webapp relative path inside the war file could be used, the 
configuration of solr (and cores) could be included in the war file 
completely with no outside dependency - except, of course, of the data 
directory if that is to go some place else.
(In my case, I want to deliver the solr web application including a 
custom entity processor, so that is why I want to include the solr war 
as part of my release cycle. It is easier to deliver that to the system 
administration than to provide them with partial packages they have to 
install into an already installed war, imho.)


Am I the only one who has run into that?

Thanks for any input on that!
Chantal



--
Chantal Ackermann

Re: highlighting performance

2009-07-29 Thread Koji Sekiguchi


Just an FYI, Lucene 2.9 has FastVectorHighlighter:

http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/all/org/apache/lucene/search/vectorhighlight/package-summary.html

Features

   * fast for large docs
   * support N-gram fields
   * support phrase-unit highlighting with slops
   * need Java 1.5
   * highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
   * take into account query boost to score fragments
   * support colored highlight tags
   * pluggable FragListBuilder
   * pluggable FragmentsBuilder

Unfortunately, Solr hasn't incorporated it yet:

https://issues.apache.org/jira/browse/SOLR-1268

Koji


ravi.gidwani wrote:

Hey Matt:
 I have been facing the same issue. I have a text field that I
highlight along with other fields (may be 10 others fields). But If I enable
highlighting on this text field that contains large number of
characters/words ( > 100 000 characters) , highlighting suffers performance.
Queries return in about 15/20 seconds with this field enabled in highlights
as compared to less than a second WITHOUT this enabled in highlight.
I did try termvector=true , but I did not see any performance
gain either. 


Just wondering if you were able to solve your issue OR tweak the performance
in any other way. 


BTW , I use solr 1.3.

~Ravi .

goodieboy wrote:
  

Thanks Otis. I added termVector="true" for those fields, but there isn't a
noticeable difference. So, just to be a little more clear, the dynamic
fields I'm adding... there might be hundreds. Do you see this as a
problem?

Thanks,
Matt

On Fri, May 15, 2009 at 7:48 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:



Matt,

I believe indexing those fields that you will use for highlighting with
term vectors enabled will make things faster (and your index a bit
bigger).


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
  

From: Matt Mitchell 
To: solr-user@lucene.apache.org
Sent: Friday, May 15, 2009 5:08:23 PM
Subject: highlighting performance

Hi,

I'm experimenting with highlighting and am noticing a big drop in
performance with my setup. I have documents that use quite a few


dynamic
  

fields (20-30). The fields are multiValued stored/indexed text fields,


each
  

with a few paragraphs worth of text. My hl.fl param is set to *_t

What kinds of things can I tweak to make this faster? Is it because I'm
highlighting so many different fields?

Thanks,
Matt

  

Quoted from: 
http://www.nabble.com/highlighting-performance-tp23567323p23713406.html




goodieboy wrote:
  

Thanks Otis. I added termVector="true" for those fields, but there isn't a
noticeable difference. So, just to be a little more clear, the dynamic
fields I'm adding... there might be hundreds. Do you see this as a
problem?

Thanks,
Matt

On Fri, May 15, 2009 at 7:48 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:



Matt,

I believe indexing those fields that you will use for highlighting with
term vectors enabled will make things faster (and your index a bit
bigger).


Otis --
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
  

From: Matt Mitchell 
To: solr-user@lucene.apache.org
Sent: Friday, May 15, 2009 5:08:23 PM
Subject: highlighting performance

Hi,

I'm experimenting with highlighting and am noticing a big drop in
performance with my setup. I have documents that use quite a few


dynamic
  

fields (20-30). The fields are multiValued stored/indexed text fields,


each
  

with a few paragraphs worth of text. My hl.fl param is set to *_t

What kinds of things can I tweak to make this faster? Is it because I'm
highlighting so many different fields?

Thanks,
Matt

Re: debugQuery=true issue

2009-07-29 Thread gwk


Hi,

Thanks for your response, I'm still developing so the schema is still in 
flux so I guess that explains it. Oh and regarding the NPE, I updated my 
checkout and recompiled and now it's gone so I guess somewhere between 
revision 787997 and 798482 it's already been fixed.


Regards,

gwk

Robert Petersen wrote:

I had something similar happen where optimize fixed an odd
sorting/scoring problem, and as I understand it the optimize will clear
out index 'lint' from old schemas/documents and so thus could affect
result scores since all the term vectors or something similar are
refreshed etc etc

Re: HTTP Status 500 - java.lang.RuntimeException: Can't find resource 'solrconfig.xml'

2009-07-29 Thread Koji Sekiguchi


As Solr said in the log, Solr couldn't find solrconfig.xml in classpath
or solr.solr.home, cwd.

My guess is that relative path you set for solr.solr.home
was incorrect? Why don't you try:

solr.solr.home=/home/huenzhao/search/tomcat6/bin/solr

instead of:

solr.solr.home=home/huenzhao/search/tomcat6/bin/solr

Koji

huenzhao wrote:

Hi all,

I used ubuntu 8.10 as the solr server OS, and set the
solr.solr.home=home/huenzhao/search/tomcat6/bin/solr.

When I run the tomcat(The tomcat and the solr that I used running on windows
XP has no problem), there has error as :

HTTP Status 500 - Severe errors in solr configuration. Check your log files
for more detailed information on what may be wrong. If you want solr to
continue after configuration errors, change: false in null
-
java.lang.RuntimeException: Can't find resource 'solrconfig.xml' in
classpath or 'home/huenzhao/search/tomcat6/bin/solr/conf/',
cwd=/home/huenzhao/search/tomcat6/bin at
org.apache.solr.core.SolrResourceLoader.openResource(SolrResourceLoader.java:194)
at
org.apache.solr.core.SolrResourceLoader.openConfig(SolrResourceLoader.java:162)
at org.apache.solr.core.Config.(Config.java:100) at
org.apache.solr.core.SolrConfig.(SolrConfig.java:113) at
org.apache.solr.core.SolrConfig.(SolrConfig.java:70) at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:117)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3696)
at 


……

Anybody knows how to do?

enzhao...@gmail.com

Re: solr/home in web.xml relative to web server home

2009-07-29 Thread Shalin Shekhar Mangar

On Wed, Jul 29, 2009 at 2:42 PM, Chantal Ackermann <
chantal.ackerm...@btelligent.de> wrote:

> Hi all,
>
> the environment variable (env-entry) in web.xml to configure the solr/home
> is relative to the web server's working directory. I find this unusual as
> all the servlet paths are relative to the web applications directory (webapp
> context, that is). So, I specified solr/home relative to the web app dir, as
> well, at first.
>
> I think it makes deployment in an unknown environment, or in different
> environments using a simple war more complex than it needed to be. If a
> webapp relative path inside the war file could be used, the configuration of
> solr (and cores) could be included in the war file completely with no
> outside dependency - except, of course, of the data directory if that is to
> go some place else.
> (In my case, I want to deliver the solr web application including a custom
> entity processor, so that is why I want to include the solr war as part of
> my release cycle. It is easier to deliver that to the system administration
> than to provide them with partial packages they have to install into an
> already installed war, imho.)
>

You don't need to create a custom war for that. You can package the
EntityProcessor into a separate jar and add it to solr_home/lib directory.

-- 
Regards,
Shalin Shekhar Mangar.

Relevant results with DisMaxRequestHandler

2009-07-29 Thread Vincent Pérès


Hello,

I did notice several strange behaviors on queries. I would like to share
with you an example, so maybe you can explain to me what is going wrong.

Using the following query :
http://localhost:8983/solr/others/select/?debugQuery=true&q=anna%20lewis&rows=20&start=0&fl=*&qt=dismax

I get back around 100 results. Follow the two first :

Person:151
Victoria Davisson


Person:37
Anna Lewis


And the related debugs :
57.998047 = (MATCH) sum of:
  0.048290744 = (MATCH) sum of:
0.024546575 = (MATCH) max plus 0.01 times others of:
  0.024546575 = (MATCH) weight(text:anna^0.5 in 64288), product of:
0.027395602 = queryWeight(text:anna^0.5), product of:
  0.5 = boost
  5.734427 = idf(docFreq=564, numDocs=30400)
  0.009554783 = queryNorm
0.8960042 = (MATCH) fieldWeight(text:anna in 64288), product of:
  1.0 = tf(termFreq(text:anna)=1)
  5.734427 = idf(docFreq=564, numDocs=30400)
  0.15625 = fieldNorm(field=text, doc=64288)
0.02374417 = (MATCH) max plus 0.01 times others of:
  0.02374417 = (MATCH) weight(text:lewi^0.5 in 64288), product of:
0.026944114 = queryWeight(text:lewi^0.5), product of:
  0.5 = boost
  5.6399217 = idf(docFreq=620, numDocs=30400)
  0.009554783 = queryNorm
0.88123775 = (MATCH) fieldWeight(text:lewi in 64288), product of:
  1.0 = tf(termFreq(text:lewi)=1)
  5.6399217 = idf(docFreq=620, numDocs=30400)
  0.15625 = fieldNorm(field=text, doc=64288)
  57.949757 = (MATCH) FunctionQuery(ord(name_s)), product of:
1213.0 = ord(name_s)=1213
5.0 = boost
0.009554783 = queryNorm

5.006892 = (MATCH) sum of:
  0.038405567 = (MATCH) sum of:
0.021955125 = (MATCH) max plus 0.01 times others of:
  0.021955125 = (MATCH) weight(text:anna^0.5 in 62632), product of:
0.027395602 = queryWeight(text:anna^0.5), product of:
  0.5 = boost
  5.734427 = idf(docFreq=564, numDocs=30400)
  0.009554783 = queryNorm
0.80141056 = (MATCH) fieldWeight(text:anna in 62632), product of:
  2.236068 = tf(termFreq(text:anna)=5)
  5.734427 = idf(docFreq=564, numDocs=30400)
  0.0625 = fieldNorm(field=text, doc=62632)
0.016450444 = (MATCH) max plus 0.01 times others of:
  0.016450444 = (MATCH) weight(text:lewi^0.5 in 62632), product of:
0.026944114 = queryWeight(text:lewi^0.5), product of:
  0.5 = boost
  5.6399217 = idf(docFreq=620, numDocs=30400)
  0.009554783 = queryNorm
0.61053944 = (MATCH) fieldWeight(text:lewi in 62632), product of:
  1.7320508 = tf(termFreq(text:lewi)=3)
  5.6399217 = idf(docFreq=620, numDocs=30400)
  0.0625 = fieldNorm(field=text, doc=62632)
  4.968487 = (MATCH) FunctionQuery(ord(name_s)), product of:
104.0 = ord(name_s)=104
5.0 = boost
0.009554783 = queryNorm

I'm using a simple boost function :
   
 
  dismax
  explicit
  0.01
  
 text^0.5 name_s^5.0
  
  
 name_s^5.0
  
  
 name_s^5.0
  
 
   

Can anyone explain to me why the first result is on top (the query is 'anna
lewis') with a huge weight and nothing related (it seems the weight come
from the name_s field...) ?

A second general question... is it possible to boost a field if the query
match exactly the content of a field?

Thank you !
Vincent
-- 
View this message in context: 
http://www.nabble.com/Relevant-results-with-DisMaxRequestHandler-tp24716870p24716870.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: facet.prefix question

2009-07-29 Thread Koji Sekiguchi


Licinio Fernández Maurelo wrote:

i'm trying to do some filtering in the count list retrieved by solr when
doing a faceting query ,

i'm wondering how can i use facet.prefix to gem something like this:

Query

facet.field=foo&facet.prefix=A OR B

Response


-

12560
5440
2357
.
.
.




How can i achieve such this behaviour?

Best Regards

  


You cannot set a query for facet.prefix parameter. facet.prefix should
be a prefix *string* of terms in the index, and you can set it at a time.
So I think you need to send two requests to get what you want:

...&facet.field=foo&facet.prefix=A
...&facet.field=foo&facet.prefix=B


Koji

Question about formatting the results returned from Solr

2009-07-29 Thread ahammad


Hi all,

Not sure how good my title is, but here is a (hopefully) better explanation
on what I mean.

I am indexing a set of articles from a DB. Each article has an author. The
author is saved in then the DB as an author ID, which is a number.

There is another table in the DB with more relevant information about the
author. Basically it has columns like:

id, firstname, lastname, email, userid

I set up the DIH so that it returns the userid, and it works fine:


   jdoe
   msmith


Would it be possible to return all of the information about the author
(first name, ...) as a subset of the results above?

Here is what I mean:


   
  John
  Doe
  j...@doe.com
   
   ...


Something similar to that at least...

Not sure how descriptive I was, but any pointers would be highly
appreciated.

Cheers

-- 
View this message in context: 
http://www.nabble.com/Question-about-formatting-the-results-returned-from-Solr-tp24719831p24719831.html
Sent from the Solr - User mailing list archive at Nabble.com.

Getting Tika to work in Solr 1.4 nightly

2009-07-29 Thread Kevin Miller

I am working with Solr 1.4 nightly and am running it on a Windows
machine.  Solr is running using the example folder that was installed
from the zip file.  The only alteration that I have made to this default
installation is to add a simple Word document into the exampledocs
folder.

I am trying to get Tika to work in Solr.  When I run the tika-0.3.jar
directed to a Word document it outputs to the screen in XML format.  I
am not able to get Solr to run tika and index the information in the
sample Word document.

I have looked at the following resources: 
Solr mailing list archive (although I could have missed something here);
Documentation and Getting started on the Apache Tika website;
I even found an article called Content Extraction with Tika at this
website:
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles
/Content-Extraction-Tika This article talks about using curl.  Is curl
necessary to use or does Solr have something already configured to do
the same as curl?

I have modified the solrconfig.xml file to include the request handler
for the ExtractingRequestHandler.  I used the modification that was
commented out in the solrconfig.xml file.  Here it is for reference:



  last_modified
  true

  

Is there some modification to this code that I need to make?

Can some one please direct me to a source that can help me get this to
work.


Kevin Miller

Re: FieldCollapsing: Two response elements returned?

2009-07-29 Thread Licinio Fernández Maurelo

My last mail is wrong. Sorry

El 29 de julio de 2009 11:10, Licinio Fernández
Maurelo escribió:
> I've applied latest collapse field related patch (patch-3) and it doesn't 
> work.
> Anyone knows how can i get only the collapse response ?
>
>
> 29-jul-2009 11:05:21 org.apache.solr.common.SolrException log
> GRAVE: java.lang.ClassCastException:
> org.apache.solr.handler.component.CollapseComponent cannot be cast to
> org.apache.solr.request.SolrRequestHandler
>        at 
> org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:150)
>        at org.apache.solr.core.SolrCore.(SolrCore.java:539)
>        at org.apache.solr.core.CoreContainer.create(CoreContainer.java:381)
>        at org.apache.solr.core.CoreContainer.load(CoreContainer.java:241)
>        at 
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:115)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
>        at 
> org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:275)
>        at 
> org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:397)
>        at 
> org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:108)
>        at 
> org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3800)
>        at 
> org.apache.catalina.core.StandardContext.start(StandardContext.java:4450)
>        at 
> org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:791)
>        at 
> org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:771)
>        at 
> org.apache.catalina.core.StandardHost.addChild(StandardHost.java:526)
>        at 
> org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:987)
>        at 
> org.apache.catalina.startup.HostConfig.deployDirectories(HostConfig.java:909)
>        at 
> org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:495)
>        at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1206)
>        at 
> org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:314)
>        at 
> org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
>        at 
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1053)
>        at org.apache.catalina.core.StandardHost.start(StandardHost.java:722)
>        at 
> org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1045)
>        at 
> org.apache.catalina.core.StandardEngine.start(StandardEngine.java:443)
>        at 
> org.apache.catalina.core.StandardService.start(StandardService.java:516)
>        at 
> org.apache.catalina.core.StandardServer.start(StandardServer.java:710)
>        at org.apache.catalina.startup.Catalina.start(Catalina.java:583)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.catalina.startup.Bootstrap.start(Bootstrap.java:288)
>        at org.apache.catalina.startup.Bootstrap.main(Bootstrap.java:413)
>
> 2009/7/28 Marc Sturlese :
>>
>> That's provably because you are using both the CollpaseComponent and the
>> QueryComponent. I think the 2 or 3 last patches allow full replacement of
>> QueryComponent.You shoud just replace:
>>
>> > class="org.apache.solr.handler.component.QueryComponent" />
>> for:
>> > class="org.apache.solr.handler.component.CollapseComponent" />
>>
>> This will sort your problem and make response times faster.
>>
>>
>>
>> Jay Hill wrote:
>>>
>>> I'm doing some testing with field collapsing, and early results look good.
>>> One thing seems odd to me however. I would expect to get back one block of
>>> results, but I get two - the first one contains the collapsed results, the
>>> second one contains the full non-collapsed results:
>>>
>>>  ... 
>>>  ... 
>>>
>>> This seems somewhat confusing. Is this intended or is this a bug?
>>>
>>> Thanks,
>>> -Jay
>>>
>>>
>>
>> --
>> View this message in context: 
>> http://www.nabble.com/FieldCollapsing%3A-Two-response-elements-returned--tp24690426p24693960.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
> Lici
>



-- 
Lici

Re: Relevant results with DisMaxRequestHandler

2009-07-29 Thread Erik Hatcher



On Jul 29, 2009, at 6:55 AM, Vincent Pérès wrote:

Using the following query :
http://localhost:8983/solr/others/select/?debugQuery=true&q=anna%20lewis&rows=20&start=0&fl=*&qt=dismax

I get back around 100 results. Follow the two first :

Person:151
Victoria Davisson


Person:37
Anna Lewis


And the related debugs :
57.998047 = (MATCH) sum of:
 0.048290744 = (MATCH) sum of:
   0.024546575 = (MATCH) max plus 0.01 times others of:
 0.024546575 = (MATCH) weight(text:anna^0.5 in 64288), product of:
   0.027395602 = queryWeight(text:anna^0.5), product of:
 0.5 = boost
 5.734427 = idf(docFreq=564, numDocs=30400)
 0.009554783 = queryNorm
   0.8960042 = (MATCH) fieldWeight(text:anna in 64288), product  
of:

 1.0 = tf(termFreq(text:anna)=1)
 5.734427 = idf(docFreq=564, numDocs=30400)
 0.15625 = fieldNorm(field=text, doc=64288)
   0.02374417 = (MATCH) max plus 0.01 times others of:
 0.02374417 = (MATCH) weight(text:lewi^0.5 in 64288), product of:
   0.026944114 = queryWeight(text:lewi^0.5), product of:
 0.5 = boost
 5.6399217 = idf(docFreq=620, numDocs=30400)
 0.009554783 = queryNorm
   0.88123775 = (MATCH) fieldWeight(text:lewi in 64288), product  
of:

 1.0 = tf(termFreq(text:lewi)=1)
 5.6399217 = idf(docFreq=620, numDocs=30400)
 0.15625 = fieldNorm(field=text, doc=64288)
 57.949757 = (MATCH) FunctionQuery(ord(name_s)), product of:
   1213.0 = ord(name_s)=1213
   5.0 = boost
   0.009554783 = queryNorm

5.006892 = (MATCH) sum of:
 0.038405567 = (MATCH) sum of:
   0.021955125 = (MATCH) max plus 0.01 times others of:
 0.021955125 = (MATCH) weight(text:anna^0.5 in 62632), product of:
   0.027395602 = queryWeight(text:anna^0.5), product of:
 0.5 = boost
 5.734427 = idf(docFreq=564, numDocs=30400)
 0.009554783 = queryNorm
   0.80141056 = (MATCH) fieldWeight(text:anna in 62632), product  
of:

 2.236068 = tf(termFreq(text:anna)=5)
 5.734427 = idf(docFreq=564, numDocs=30400)
 0.0625 = fieldNorm(field=text, doc=62632)
   0.016450444 = (MATCH) max plus 0.01 times others of:
 0.016450444 = (MATCH) weight(text:lewi^0.5 in 62632), product of:
   0.026944114 = queryWeight(text:lewi^0.5), product of:
 0.5 = boost
 5.6399217 = idf(docFreq=620, numDocs=30400)
 0.009554783 = queryNorm
   0.61053944 = (MATCH) fieldWeight(text:lewi in 62632), product  
of:

 1.7320508 = tf(termFreq(text:lewi)=3)
 5.6399217 = idf(docFreq=620, numDocs=30400)
 0.0625 = fieldNorm(field=text, doc=62632)
 4.968487 = (MATCH) FunctionQuery(ord(name_s)), product of:
   104.0 = ord(name_s)=104
   5.0 = boost
   0.009554783 = queryNorm

I'm using a simple boost function :
  

 dismax
 explicit
 0.01
 
text^0.5 name_s^5.0
 
 
name_s^5.0
 
 
name_s^5.0
 

  

Can anyone explain to me why the first result is on top (the query  
is 'anna
lewis') with a huge weight and nothing related (it seems the weight  
come

from the name_s field...) ?


The ord function perhaps isn't doing what you want.  It is returning  
the term position, and thus it appears "Anna Lewis" is the 104th  
name_s value in your index lexicographically.  And of course "Victoria  
Davisson" is much further down, at the 1203rd position.  Maybe you  
want rord instead?   But probably not...


A second general question... is it possible to boost a field if the  
query

match exactly the content of a field?


You can use set dismax to have a qs (query slop) factor which will  
boost documents where the users terms are closer together (within the  
number of terms distance specified).


Erik

RE: Boosting ('bq') on multi-valued fields

2009-07-29 Thread Ensdorf Ken

> Hey,
> I have a field defined as such:
>
>   stored="false"
> multiValued="true" />
>
> with the string type defined as:
>
>  omitNorms="true"/>
>
> When I try using some query-time boost parameters using the bq on
> values of
> this field it seems to behave
> strangely in case of documents actually having multiple values:
> If i'd do a boost for a particular value ( "site_id:5^1.1" ) it seems
> like
> all the cases where this field is actually
> populated with multiple ones ( i.e a document with field value "5|6" )
> do
> not get boosted at all. I verified this using
> debugQuery & explainOther=doc_id:.
> is this a known issue/bug? any work arounds? (i'm using a nightly solr
> build
> from a few months back.. )

There is no tokenization on 'string' fields, so a query for "5" does not match 
a doc with a value of "5|6" for this field.  You could try  using field type 
'text' for this and see what you get.  You may need to customize it to you the 
StandardAnalyzer or WordDelimiterFilterFactory to get the right behavior.  
Using the analysis tool in the solr admin UI to experiment will probably be 
helpful.

-Ken

Re: update some index documents after indexing process is done with DIH

2009-07-29 Thread Marc Sturlese


>From the newSearcher(..) of a CustomEventListener which extends of
AbstractSolrEventListener  can access to SolrIndexSearcher and all core
properties but can't get a SolrIndexWriter. Do you now how can I get from
there a SolrIndexWriter? This way I would be able to modify the documents (I
need to modify them depending on values of other documents, that's why I
can't do it with DIH delta-import).
Thanks in advance


Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
> 
> On Tue, Jul 28, 2009 at 5:17 PM, Marc Sturlese
> wrote:
>>
>> That really sounds the best way to reach my goal. How could I invoque a
>> listener from the newSearcher?Would be something like:
>>    
>>      
>>         solr 0 > name="rows">10 
>>         rocks 0 > name="rows">10 
>>        static newSearcher warming query from
>> solrconfig.xml
>>      
>>    
>>    
>>
>> And MyCustomListener would be the class who open the reader:
>>
>>        RefCounted searchHolder = null;
>>        try {
>>          searchHolder = dataImporter.getCore().getSearcher();
>>          IndexReader reader = searchHolder.get().getReader();
>>
>>          //Here I iterate over the reader doing docuemnt modifications
>>
>>        } finally {
>>           if (searchHolder != null) searchHolder.decref();
>>        }
>>        } catch (Exception ex) {
>>            LOG.info("error");
>>        }
> 
> you may not be able to access the DIH API from a newSearcher event .
> But the API would give you the searcher directly as a method
> parameter.
>>
>> Finally, to access to documents and add fields to some of them, I have
>> thought in using SolrDocument classes. Can you please point me where
>> something similar is done in solr source (I mean creation of
>> SolrDocuemnts
>> and conversion of them to proper lucene docuements).
>>
>> Does this way for reaching the goal makes sense?
>>
>> Thanks in advance
>>
>>
>>
>> Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
>>>
>>> when a core is reloaded the event fired is firstSearcher. newSearcher
>>> is fired when a commit happens
>>>
>>>
>>> On Tue, Jul 28, 2009 at 4:19 PM, Marc Sturlese
>>> wrote:

 Ok, but if I handle it in a newSearcher listener it will be executed
 every
 time I reload a core, isn't it? The thing is that I want to use an
 IndexReader to load in a HashMap some doc fields of the index and
 depending
 of the values of some field docs modify other docs. Its very memory
 consuming (I have tested it with a simple lucene script). Thats why I
 wanted
 to do it just after the indexing process.

 My ideal case would be to do it in the commit function of
 DirectUpdatehandler2.java just before
 writer.optimize(cmd.maxOptimizeSegments); is executed. But I don't want
 to
 mess that code... so trying to find out the best way to do that as a
 plugin
 instead of a hack as possible.

 Thanks in advance


 Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
>
> It is best handled as a 'newSearcher' listener in solrconfig.xml.
> onImportEnd is invoked before committing
>
> On Tue, Jul 28, 2009 at 3:13 PM, Marc
> Sturlese
> wrote:
>>
>> Hey there,
>> I would like to be able to do something like: After the indexing
>> process
>> is
>> done with DIH I would like to open an indexreader, iterate over all
>> docs,
>> modify some of them depending on others and delete some others. I can
>> easy
>> do this directly coding with lucene but would like to know if there's
>> a
>> way
>> to do it with Solr using SolrDocument or SolrInputDocument classes.
>> I have thougth in using SolrJ or DIH listener onImportEnd but not
>> sure
>> if
>> I
>> can get an IndexReader in there.
>> Any advice?
>> Thanks in advance
>> --
>> View this message in context:
>> http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24695947.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
>
>

 --
 View this message in context:
 http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24696872.html
 Sent from the Solr - User mailing list archive at Nabble.com.


>>>
>>>
>>>
>>> --
>>> -
>>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24697751.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> -
> Noble Paul | Principal Engineer| AOL | http://aol.com
> 
> 

-- 
View this message in cont

RE: search suggest

2009-07-29 Thread Robert Petersen

To do a proper search suggest feature you have to index all the queries
your system gets and search it with wildcards for matches on what the
user has typed so far for each user keystroke in the search box...
Usually with some timer logic to wait for a small hesitation in their
typing.   



-Original Message-
From: Jack Bates [mailto:ms...@freezone.co.uk] 
Sent: Tuesday, July 28, 2009 10:54 AM
To: solr-user@lucene.apache.org
Subject: search suggest

how can i use solr to make search suggestions? i'm thinking google-style
suggestions, which suggests more refined queries - vs. freebase-style
suggestions, which suggests top hits.

i've been looking at the query params,
http://wiki.apache.org/solr/StandardRequestHandler

- and searching for "solr suggest" - but haven't figured out how to get
search suggestions from solr

Wildcard and boosting

2009-07-29 Thread Jón Helgi Jónsson

Hey now!

I do index time boosting for my fields and just discovered that when
searching with a trailing wild card the boosting is ignored.

Will my boosting work with a wild card if I do it at query time? And
if so is there a lot of performance difference?

Some other method I can use to preserve my boosting? I do not need
hightlighting.

Thanks,
Jon Helgi

RE: refering/alias other Solr documents

2009-07-29 Thread Steven A Rowe

Hi Ravi,

This may help:

   http://wiki.apache.org/solr/HierarchicalFaceting

Steve

> -Original Message-
> From: ravi.gidwani [mailto:ravi.gidw...@gmail.com]
> Sent: Wednesday, July 29, 2009 3:24 AM
> To: solr-user@lucene.apache.org
> Subject: refering/alias other Solr documents
> 
> 
> Hi all:
> Is in solr, that will allow documents referring each other ? In
> other words, if a search for "abc" matches on document 1 , I should be
> able
> to return document 2 even though the index does any fields matching
> "abc".
> Here is the scenario with some more details:
> 
> Solr version:1.3
> 
> Scenario:
> 1) Solr Document 1 with say some field title="abc" and Solr Document 2
> with
> its own data.
> 2) User searches for "abc" and gets Document 1 as it matches on title
> field
> 
> Expected results:
> When the user searches for "abc"  he it also get Document 2 along with
> Document 1.
> 
> I understand one way of doing this is to make sure Document 2 has all
> the
> contents of Document 1. But this introduces a issue of keeping the two
> documents (and hence their solr index) in sync with each other.
> 
> I think I am looking for a mechanism like this:
> 
> Document 1 refers => document 2, Document 3.
> 
> Hence whenever document 1 in part of search results, document 2 and
> document
> 3 will also be returned as search results .
> 
> I may be totally off on this expectation but am trying to solve a
> "Contains"
> problem where lets say a book (represented as Document 1 in solr)
> "contains"
> Chapters (represented by Document 2,3,4..) in solr.
> 
> I hope this is not too confusing ;)
> 
> TIA
> ~Ravi Gidwani
> --
> View this message in context: http://www.nabble.com/refering-alias-
> other-Solr-documents-tp24713855p24713855.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting Tika to work in Solr 1.4 nightly

2009-07-29 Thread Yonik Seeley

Hi Kevin,
The parameter names have changed in the latest Solr 1.4 builds... please see
http://wiki.apache.org/solr/ExtractingRequestHandler

-Yonik
http://www.lucidimagination.com



On Wed, Jul 29, 2009 at 10:17 AM, Kevin
Miller wrote:
> I am working with Solr 1.4 nightly and am running it on a Windows
> machine.  Solr is running using the example folder that was installed
> from the zip file.  The only alteration that I have made to this default
> installation is to add a simple Word document into the exampledocs
> folder.
>
> I am trying to get Tika to work in Solr.  When I run the tika-0.3.jar
> directed to a Word document it outputs to the screen in XML format.  I
> am not able to get Solr to run tika and index the information in the
> sample Word document.
>
> I have looked at the following resources:
> Solr mailing list archive (although I could have missed something here);
> Documentation and Getting started on the Apache Tika website;
> I even found an article called Content Extraction with Tika at this
> website:
> http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles
> /Content-Extraction-Tika This article talks about using curl.  Is curl
> necessary to use or does Solr have something already configured to do
> the same as curl?
>
> I have modified the solrconfig.xml file to include the request handler
> for the ExtractingRequestHandler.  I used the modification that was
> commented out in the solrconfig.xml file.  Here it is for reference:
>
>  class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>    
>      last_modified
>      true
>    
>  
>
> Is there some modification to this code that I need to make?
>
> Can some one please direct me to a source that can help me get this to
> work.
>
>
> Kevin Miller
>

Multi select faceting

2009-07-29 Thread Mike

Hi,

We're using Lucid Imagination's LucidWorks Solr 1.3 and we have a requirement 
to implement multiple-select faceting where the facet cells show up as 
checkboxes and despite checked options, all of the options continue to persist 
with counts. The best example I found is the search on Lucid Imagination's 
site: http://www.lucidimagination.com/search/

It appears the Solr 1.4 release has support for doing this with filter tagging 
(http://wiki.apache.org/solr/SimpleFacetParameters#head-f277d409b221b407d9c5430f552bf40ee6185c4c),
 but I was wondering if there was another way to accomplish this in 1.3?

Mike

query and analyzers

2009-07-29 Thread Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]

Hi,
What analyzer, tokenizer, filter factory would I need to use to get wildcard 
matching to match where:
Value:
XYZ123
Query:
XYZ1*

I have been messing with solr.WordDelimiterFilterFactory splitOnNumerics and 
oreserveOriginal in both the analyzer and the query.  I also noticed it is 
different when I use quotes in the query - phrase search.  Unfortunately, I'm 
missing something as I can't get it to work.

Tim

Re: query and analyzers

2009-07-29 Thread AHMET ARSLAN


> What analyzer, tokenizer, filter factory would I need to
> use to get wildcard matching to match where:
> Value:
> XYZ123
> Query:
> XYZ1*

StandardAnalyzer, WhitespaceAnalyzer.
 
> I have been messing with solr.WordDelimiterFilterFactory
> splitOnNumerics and oreserveOriginal in both the analyzer
> and the query.  I also noticed it is different when I
> use quotes in the query - phrase search. 
> Unfortunately, I'm missing something as I can't get it to
> work.

But i think your problem is not the analyzer. I guess in your analyzer there is 
lowercase filter and wildcard queries are not analyzed.
Try querying xyz1*

Re: query in solr lucene

2009-07-29 Thread Avlesh Singh

You may index your data using a delimiter, like $my-field-content$. While
searching, perform a phrase query with the leading and trailing "$" appended
to the query string.

Cheers
Avlesh

On Wed, Jul 29, 2009 at 12:04 PM, Sushan Rungta  wrote:

> I tried using AND, but it even provided me doc 3 which was not required.
>
> Hence my problem still persists...
>
> regards,
> Sushan
>
>
> At 06:59 AM 7/29/2009, Avlesh Singh wrote:
>
>> >
>> > No, phrase query would match docs 2 and 3. Sushan only wantsdoc 2 as I
>> read
>> > it.
>> >
>> Sorry, my bad. I did not read properly before replying.
>>
>> Cheers
>> Avlesh
>>
>> On Wed, Jul 29, 2009 at 3:23 AM, Erick Erickson > >wrote:
>>
>> > No, phrase query would match docs 2 and 3. Sushan only wantsdoc 2 as I
>> read
>> > it.
>> >
>> > You might have some joy with KeywordAnalyzer, which does
>> > not break the incoming stream up into tokens. You have to be
>> > careful, though, because it also won't fold the case, so 'Hello'
>> > would not match 'hello'.
>> >
>> > Best
>> > Erick
>> >
>> > On Tue, Jul 28, 2009 at 11:11 AM, Avlesh Singh 
>> wrote:
>> >
>> > > You should perform a PhraseQuery on the required field.
>> > > Meaning, http://your-solr-host:port:
>> > > /your-core-path/select?q=fieldName:"Hello
>> > > how are you sushan" would work for you.
>> > >
>> > > Cheers
>> > > Avlesh
>> > >
>> > > 2009/7/28 Gérard Dupont 
>> > >
>> > > > Hi Sushan,
>> > > >
>> > > > I'm not an expert of Solr, just beginner, but it appears to me that
>> you
>> > > >  may
>> > > > have default 'OR' combinaison fo keywords so that will explain this
>> > > > behavior. Try to modify the configuration for an 'AND' combinaison.
>> > > >
>> > > > cheers
>> > > >
>> > > > On Tue, Jul 28, 2009 at 16:49, Sushan Rungta 
>> > wrote:
>> > > >
>> > > > > I am extremely sorry for responding late as I was ill from past
>> few
>> > > days.
>> > > > >
>> > > > > My problem is explained below with an example:
>> > > > >
>> > > > > I am having three documents with following list:
>> > > > >
>> > > > > 1. Hello how are you
>> > > > > 2. Hello how are you sushan
>> > > > > 3. Hello how are you sushan. I am fine.
>> > > > >
>> > > > > When I search for a query "Hello how are you sushan", I should
>> only
>> > get
>> > > > > document 2 in my result.
>> > > > >
>> > > > > I hope this will give you all a better insight in my problem.
>> > > > >
>> > > > > regards,
>> > > > >
>> > > > > Sushan Rungta
>> > > > >
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Gérard Dupont
>> > > > Information Processing Control and Cognition (IPCC) - EADS DS
>> > > > http://weblab-project.org
>> > > >
>> > > > Document & Learning team - LITIS Laboratory
>> > > >
>> > >
>> >
>>
>
>
>

Re: search suggest

2009-07-29 Thread Jason Rutherglen

Autosuggest is something that would be very useful to build into
Solr as many search projects require it.

I'd recommend indexing relevant terms/phrases into a Ternary
Search Tree which is compact and performant. Using a wildcard
query will likely not be as fast as a Ternary Tree, and I'm not
sure how phrases would be handled?

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi

It would be good to separate out the TernaryTree from
analysis/compound and into Lucene core, or into it's own contrib.

Also see http://issues.apache.org/jira/browse/LUCENE-625 which
improves relevancy using click through rates.

I'll open an issue in Solr to get this one going.

On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen wrote:
> To do a proper search suggest feature you have to index all the queries
> your system gets and search it with wildcards for matches on what the
> user has typed so far for each user keystroke in the search box...
> Usually with some timer logic to wait for a small hesitation in their
> typing.
>
>
>
> -Original Message-
> From: Jack Bates [mailto:ms...@freezone.co.uk]
> Sent: Tuesday, July 28, 2009 10:54 AM
> To: solr-user@lucene.apache.org
> Subject: search suggest
>
> how can i use solr to make search suggestions? i'm thinking google-style
> suggestions, which suggests more refined queries - vs. freebase-style
> suggestions, which suggests top hits.
>
> i've been looking at the query params,
> http://wiki.apache.org/solr/StandardRequestHandler
>
> - and searching for "solr suggest" - but haven't figured out how to get
> search suggestions from solr
>

Re: search suggest

2009-07-29 Thread Jason Rutherglen

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/compound/hyphenation/TernaryTree.html

On Wed, Jul 29, 2009 at 12:08 PM, Jason
Rutherglen wrote:
> Autosuggest is something that would be very useful to build into
> Solr as many search projects require it.
>
> I'd recommend indexing relevant terms/phrases into a Ternary
> Search Tree which is compact and performant. Using a wildcard
> query will likely not be as fast as a Ternary Tree, and I'm not
> sure how phrases would be handled?
>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi
>
> It would be good to separate out the TernaryTree from
> analysis/compound and into Lucene core, or into it's own contrib.
>
> Also see http://issues.apache.org/jira/browse/LUCENE-625 which
> improves relevancy using click through rates.
>
> I'll open an issue in Solr to get this one going.
>
> On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen wrote:
>> To do a proper search suggest feature you have to index all the queries
>> your system gets and search it with wildcards for matches on what the
>> user has typed so far for each user keystroke in the search box...
>> Usually with some timer logic to wait for a small hesitation in their
>> typing.
>>
>>
>>
>> -Original Message-
>> From: Jack Bates [mailto:ms...@freezone.co.uk]
>> Sent: Tuesday, July 28, 2009 10:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: search suggest
>>
>> how can i use solr to make search suggestions? i'm thinking google-style
>> suggestions, which suggests more refined queries - vs. freebase-style
>> suggestions, which suggests top hits.
>>
>> i've been looking at the query params,
>> http://wiki.apache.org/solr/StandardRequestHandler
>>
>> - and searching for "solr suggest" - but haven't figured out how to get
>> search suggestions from solr
>>
>

Visualizing Semantic Journal Space (large scale) using full-text

2009-07-29 Thread Glen Newton

I thought the Lucene and Solr communities would find this interesting:
My collaborators and I have used LuSql, Lucene and Semantic Vectors to
visualize a large scale semantic journal space (kind of like 'Maps of
Science') of a large
scale (5.7 million articles) journal article collection using only the
full-text (no metadata).

For more info & howto:
http://zzzoot.blogspot.com/2009/07/project-torngat-building-large-scale.html

Glen Newton

-- 

-

RE: query and analyzers

2009-07-29 Thread Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]

This was the definition I was last working with (I've been playing with setting 
the various parameters).


   
 

   
   
 

   
 

-Original Message-
From: AHMET ARSLAN [mailto:iori...@yahoo.com] 
Sent: Wednesday, July 29, 2009 11:55 AM
To: solr-user@lucene.apache.org
Subject: Re: query and analyzers


> What analyzer, tokenizer, filter factory would I need to
> use to get wildcard matching to match where:
> Value:
> XYZ123
> Query:
> XYZ1*

StandardAnalyzer, WhitespaceAnalyzer.
 
> I have been messing with solr.WordDelimiterFilterFactory
> splitOnNumerics and oreserveOriginal in both the analyzer
> and the query.  I also noticed it is different when I
> use quotes in the query - phrase search. 
> Unfortunately, I'm missing something as I can't get it to
> work.

But i think your problem is not the analyzer. I guess in your analyzer there is 
lowercase filter and wildcard queries are not analyzed.
Try querying xyz1*

RE: query and analyzers

2009-07-29 Thread AHMET ARSLAN


In order to match (query) XYZ1* to (document) XYZ123 you do not need 
WordDelimiterFilterFactory. You need an tokenizer that recognizes XYZ123 as one 
token. And WhitespaceTokenizer is one of them.

As I see from the fieldType named text_ws, you want to use 
WhitespaceTokenizerFactory
and there is no LowercaseFilter in it. So there is no problem.
Just remove the WordDelimiterFilterFactory (both query and index) and it should 
work.
 
Ahmet

RE: query and analyzers

2009-07-29 Thread Harsch, Timothy J. (ARC-SC)[PEROT SYSTEMS]

That did it, thanks!

I thought that was how it should work, but I guess somehow I got out of sync or 
something at one point which led me to dive deeper into it than I needed to.

-Original Message-
From: AHMET ARSLAN [mailto:iori...@yahoo.com] 
Sent: Wednesday, July 29, 2009 12:52 PM
To: solr-user@lucene.apache.org
Subject: RE: query and analyzers

In order to match (query) XYZ1* to (document) XYZ123 you do not need 
WordDelimiterFilterFactory. You need an tokenizer that recognizes XYZ123 as one 
token. And WhitespaceTokenizer is one of them.

As I see from the fieldType named text_ws, you want to use 
WhitespaceTokenizerFactory
and there is no LowercaseFilter in it. So there is no problem.
Just remove the WordDelimiterFilterFactory (both query and index) and it should 
work.

Ahmet

Re: search suggest

2009-07-29 Thread manuel aldana

also watch out that you have a good stopwords list otherwise the 
suggestions won't be helpful for the user.


Jack Bates wrote:

how can i use solr to make search suggestions? i'm thinking google-style
suggestions, which suggests more refined queries - vs. freebase-style
suggestions, which suggests top hits.

i've been looking at the query params,
http://wiki.apache.org/solr/StandardRequestHandler

- and searching for "solr suggest" - but haven't figured out how to get
search suggestions from solr
  



--
manuel aldana
ald...@gmx.de
software-engineering blog: http://www.aldana-online.de

RE: search suggest

2009-07-29 Thread Robert Petersen

Simple minded autosuggest can just not tokenize the phrases at all and
so the wildcards just complete whatever the user has typed so far
including spaces.  Upon encountering a space though, autosuggest should
wait to make more suggestions until the user has typed at least a couple
of letters of the next word.  That is the way I did it last time using a
different search engine.  It'd sure be kewl if this became a core
feature of solr!

I like the idea of the tree approach, sounds much faster.  The root is
the least letters to start suggestions and the leaves are the full
phrases?

-Original Message-
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: Wednesday, July 29, 2009 12:09 PM
To: solr-user@lucene.apache.org
Subject: Re: search suggest

Autosuggest is something that would be very useful to build into
Solr as many search projects require it.

I'd recommend indexing relevant terms/phrases into a Ternary
Search Tree which is compact and performant. Using a wildcard
query will likely not be as fast as a Ternary Tree, and I'm not
sure how phrases would be handled?

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi

It would be good to separate out the TernaryTree from
analysis/compound and into Lucene core, or into it's own contrib.

Also see http://issues.apache.org/jira/browse/LUCENE-625 which
improves relevancy using click through rates.

I'll open an issue in Solr to get this one going.

On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen
wrote:
> To do a proper search suggest feature you have to index all the
queries
> your system gets and search it with wildcards for matches on what the
> user has typed so far for each user keystroke in the search box...
> Usually with some timer logic to wait for a small hesitation in their
> typing.
>
>
>
> -Original Message-
> From: Jack Bates [mailto:ms...@freezone.co.uk]
> Sent: Tuesday, July 28, 2009 10:54 AM
> To: solr-user@lucene.apache.org
> Subject: search suggest
>
> how can i use solr to make search suggestions? i'm thinking
google-style
> suggestions, which suggests more refined queries - vs. freebase-style
> suggestions, which suggests top hits.
>
> i've been looking at the query params,
> http://wiki.apache.org/solr/StandardRequestHandler
>
> - and searching for "solr suggest" - but haven't figured out how to
get
> search suggestions from solr
>

Re: Indexing TIKA extracted text. Are there some issues?

2009-07-29 Thread ashokc


Sure.

The java command I use with TIKA to extract text from a URL is:

java -jar tika-0.3-standalone.jar -t $url

I have also attached the screenshots of the web page, post documents
produced in the two different ways (Perl & Tika) for that web page, and the
screenshots of the search result for a string contained in that web page.
The index in each case contains just this one URL. To keep everything else
identical, I used the same instance for creating the index in each case.
First I posted the Tika document, checked for the results, emptied the
index, posted the Perl document, and checked the results.

Debug query for Tika:


+DisjunctionMaxQuery((urltext:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0
| title:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0 |
content_china:"é«é éå¬ å¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é
éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å åå®¹ å®¹è½")~0.01) ()


Debug query for Perl:


+DisjunctionMaxQuery((urltext:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0
| title:é«éå¬å¸å±ç°äºæµ·éçä¼è´¨å¤åªä½åå®¹è½^2.0 |
content_china:"é«é éå¬ å¬å¸ å¸å± å±ç° ç°äº äºæµ· æµ·é
éç çä¼ ä¼è´¨ è´¨å¤ å¤åª åªä½ ä½å åå®¹ å®¹è½")~0.01) ()


The screenshots
http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx 

Perl extracted doc
http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml 

Tika extracted doc
http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml 


Grant Ingersoll-6 wrote:
> 
> Hmm, looks very much like an encoding problem.  Can you post a sample  
> showing it, along with the commands you invoked?
> 
> Thanks,
> Grant
> 
> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
> 
>>
>> I am finding that the search results based on indexing Tika  
>> extracted text
>> are very different from results based on indexing the text extracted  
>> via
>> other means. This shows up for example with a chinese web site that  
>> I am
>> trying to index.
>>
>> I created the documents (for posting to SOLR) in two ways. The  
>> source text
>> of the web pages are full of html entities like 〹 and some  
>> english
>> characters mixed in.
>>
>> (a) Simple text extraction from the page source by a Perl script. The
>> resulting content field looks like
>>
>> Who We Are  
>> 公司历史
>> 您的成功案例
>> 领导团队 业务部门  
>> Innovation
>> 创 etc... 
>>
>> I posted these documents to a SOLR instance
>>
>> (b) Used Tika (command line). The resulting content field looks like
>>
>> Who We Are Ã¥ Â¬Ã¥ÂÂ¸Ã 
>> ¥ÂŽÂ†Ã¥ÂÂ²
>> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂÃ¥ÂŠÂŸÃ¦Â¡
>> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ  
>> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã 
>> ¥Â
>> etc... 
>>
>> I posted these documents to a different instance
>>
>> When I search the first instance for a string (that I copied &  
>> pasted from
>> the web site) I find a number of hits, including the page from which I
>> copied the string from. But when I do the same on the instance with  
>> Tika
>> extracted text - I get nothing.
>>
>> Has anyone seen this? I believe it may have to do with encoding. In  
>> both
>> cases the posted documents were utf-8 compiant.
>>
>> Thanks for your insights.
>>
>> - ashok
>>
>> -- 
>> View this message in context:
>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: search suggest

2009-07-29 Thread Jason Rutherglen

Here's a good article on Ternary Trees: http://www.ddj.com/windows/184410528

I looked at the one in Lucene, I don't understand why the find method
only returns a char/int?

On Wed, Jul 29, 2009 at 2:33 PM, Robert Petersen wrote:
> Simple minded autosuggest can just not tokenize the phrases at all and
> so the wildcards just complete whatever the user has typed so far
> including spaces.  Upon encountering a space though, autosuggest should
> wait to make more suggestions until the user has typed at least a couple
> of letters of the next word.  That is the way I did it last time using a
> different search engine.  It'd sure be kewl if this became a core
> feature of solr!
>
> I like the idea of the tree approach, sounds much faster.  The root is
> the least letters to start suggestions and the leaves are the full
> phrases?
>
> -Original Message-
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: Wednesday, July 29, 2009 12:09 PM
> To: solr-user@lucene.apache.org
> Subject: Re: search suggest
>
> Autosuggest is something that would be very useful to build into
> Solr as many search projects require it.
>
> I'd recommend indexing relevant terms/phrases into a Ternary
> Search Tree which is compact and performant. Using a wildcard
> query will likely not be as fast as a Ternary Tree, and I'm not
> sure how phrases would be handled?
>
> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi
>
> It would be good to separate out the TernaryTree from
> analysis/compound and into Lucene core, or into it's own contrib.
>
> Also see http://issues.apache.org/jira/browse/LUCENE-625 which
> improves relevancy using click through rates.
>
> I'll open an issue in Solr to get this one going.
>
> On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen
> wrote:
>> To do a proper search suggest feature you have to index all the
> queries
>> your system gets and search it with wildcards for matches on what the
>> user has typed so far for each user keystroke in the search box...
>> Usually with some timer logic to wait for a small hesitation in their
>> typing.
>>
>>
>>
>> -Original Message-
>> From: Jack Bates [mailto:ms...@freezone.co.uk]
>> Sent: Tuesday, July 28, 2009 10:54 AM
>> To: solr-user@lucene.apache.org
>> Subject: search suggest
>>
>> how can i use solr to make search suggestions? i'm thinking
> google-style
>> suggestions, which suggests more refined queries - vs. freebase-style
>> suggestions, which suggests top hits.
>>
>> i've been looking at the query params,
>> http://wiki.apache.org/solr/StandardRequestHandler
>>
>> - and searching for "solr suggest" - but haven't figured out how to
> get
>> search suggestions from solr
>>
>

Re: Indexing TIKA extracted text. Are there some issues?

2009-07-29 Thread Robert Muir

it appears there is an encoding problem, in the screenshot I can see
the title is mangled, and if i open up the URL in IE or firefox, both
browsers think it is iso-8859-1.

I think this is why (from w3c validator):

Character Encoding mismatch!

The character encoding specified in the HTTP header (iso-8859-1) is
different from the value in the  element (utf-8). I will use the
value from the HTTP header (iso-8859-1) for this validation.

On Wed, Jul 29, 2009 at 6:02 PM, ashokc wrote:
>
> Sure.
>
> The java command I use with TIKA to extract text from a URL is:
>
> java -jar tika-0.3-standalone.jar -t $url
>
> I have also attached the screenshots of the web page, post documents
> produced in the two different ways (Perl & Tika) for that web page, and the
> screenshots of the search result for a string contained in that web page.
> The index in each case contains just this one URL. To keep everything else
> identical, I used the same instance for creating the index in each case.
> First I posted the Tika document, checked for the results, emptied the
> index, posted the Perl document, and checked the results.
>
> Debug query for Tika:
>
> 
> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ 
> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
> 
>
> Debug query for Perl:
>
> 
> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ 
> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
> 
>
> The screenshots
> http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx
>
> Perl extracted doc
> http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml
>
> Tika extracted doc
> http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml
>
>
> Grant Ingersoll-6 wrote:
>>
>> Hmm, looks very much like an encoding problem.  Can you post a sample
>> showing it, along with the commands you invoked?
>>
>> Thanks,
>> Grant
>>
>> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
>>
>>>
>>> I am finding that the search results based on indexing Tika
>>> extracted text
>>> are very different from results based on indexing the text extracted
>>> via
>>> other means. This shows up for example with a chinese web site that
>>> I am
>>> trying to index.
>>>
>>> I created the documents (for posting to SOLR) in two ways. The
>>> source text
>>> of the web pages are full of html entities like 〹 and some
>>> english
>>> characters mixed in.
>>>
>>> (a) Simple text extraction from the page source by a Perl script. The
>>> resulting content field looks like
>>>
>>> Who We Are
>>> 公司历史
>>> 您的成功案例
>>> 领导团队 业务部门
>>> Innovation
>>> 创 etc...     
>>>
>>> I posted these documents to a SOLR instance
>>>
>>> (b) Used Tika (command line). The resulting content field looks like
>>>
>>> Who We Are Ã¥ Â¬Ã¥Â Â¸Ã
>>> ¥ÂŽÂ†Ã¥Â Â²
>>> Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂ Ã¥ÂŠÂŸÃ¦Â¡
>>> ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ
>>> Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã
>>> ¥Â
>>> etc... 
>>>
>>> I posted these documents to a different instance
>>>
>>> When I search the first instance for a string (that I copied &
>>> pasted from
>>> the web site) I find a number of hits, including the page from which I
>>> copied the string from. But when I do the same on the instance with
>>> Tika
>>> extracted text - I get nothing.
>>>
>>> Has anyone seen this? I believe it may have to do with encoding. In
>>> both
>>> cases the posted documents were utf-8 compiant.
>>>
>>> Thanks for your insights.
>>>
>>> - ashok
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>> --
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Robert Muir
rcm...@gmail.com

Re: Indexing TIKA extracted text. Are there some issues?

2009-07-29 Thread ashokc


Could very well be... I will rectify it and try again. Thanks

- ashok



Robert Muir wrote:
> 
> it appears there is an encoding problem, in the screenshot I can see
> the title is mangled, and if i open up the URL in IE or firefox, both
> browsers think it is iso-8859-1.
> 
> I think this is why (from w3c validator):
> 
> Character Encoding mismatch!
> 
> The character encoding specified in the HTTP header (iso-8859-1) is
> different from the value in the  element (utf-8). I will use the
> value from the HTTP header (iso-8859-1) for this validation.
> 
> On Wed, Jul 29, 2009 at 6:02 PM, ashokc wrote:
>>
>> Sure.
>>
>> The java command I use with TIKA to extract text from a URL is:
>>
>> java -jar tika-0.3-standalone.jar -t $url
>>
>> I have also attached the screenshots of the web page, post documents
>> produced in the two different ways (Perl & Tika) for that web page, and
>> the
>> screenshots of the search result for a string contained in that web page.
>> The index in each case contains just this one URL. To keep everything
>> else
>> identical, I used the same instance for creating the index in each case.
>> First I posted the Tika document, checked for the results, emptied the
>> index, posted the Perl document, and checked the results.
>>
>> Debug query for Tika:
>>
>> 
>> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡
>> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
>> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
>> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
>> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
>> 
>>
>> Debug query for Perl:
>>
>> 
>> +DisjunctionMaxQuery((urltext:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡
>> çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0
>> | title:é«˜é€šå…¬å ¸å±•çŽ°äº†æµ·é‡ çš„ä¼˜è´¨å¤šåª’ä½“å†…å®¹èƒ½^2.0 |
>> content_china:"é«˜é€š é€šå…¬ å…¬å ¸ å ¸å±• å±•çŽ° çŽ°äº† äº†æµ· æµ·é‡
>> é‡ çš„ çš„ä¼˜ ä¼˜è´¨ è´¨å¤š å¤šåª’ åª’ä½“ ä½“å†… å†…å®¹ å®¹èƒ½")~0.01) ()
>> 
>>
>> The screenshots
>> http://www.nabble.com/file/p24728917/Tika%2BIssue.docx Tika+Issue.docx
>>
>> Perl extracted doc
>> http://www.nabble.com/file/p24728917/china.perl.xml china.perl.xml
>>
>> Tika extracted doc
>> http://www.nabble.com/file/p24728917/china.tika.xml china.tika.xml
>>
>>
>> Grant Ingersoll-6 wrote:
>>>
>>> Hmm, looks very much like an encoding problem.  Can you post a sample
>>> showing it, along with the commands you invoked?
>>>
>>> Thanks,
>>> Grant
>>>
>>> On Jul 28, 2009, at 6:14 PM, ashokc wrote:
>>>

 I am finding that the search results based on indexing Tika
 extracted text
 are very different from results based on indexing the text extracted
 via
 other means. This shows up for example with a chinese web site that
 I am
 trying to index.

 I created the documents (for posting to SOLR) in two ways. The
 source text
 of the web pages are full of html entities like 〹 and some
 english
 characters mixed in.

 (a) Simple text extraction from the page source by a Perl script. The
 resulting content field looks like

 Who We Are
 公司历史
 您的成功案例
 领导团队 业务部门
 Innovation
 创 etc...     

 I posted these documents to a SOLR instance

 (b) Used Tika (command line). The resulting content field looks like

 Who We Are Ã¥ Â¬Ã¥Â Â¸Ã
 ¥ÂŽÂ†Ã¥Â Â²
 Ã¦Â‚Â¨Ã§ÂšÂ„Ã¦ÂˆÂ Ã¥ÂŠÂŸÃ¦Â¡
 ÂˆÃ¤Â¾Â‹ Ã©Â¢Â†Ã¥Â¯Â¼Ã¥Â›Â¢Ã©Â˜ÂŸ
 Ã¤Â¸ÂšÃ¥ÂŠÂ¡Ã©ÂƒÂ¨Ã©Â—Â¨ Ã‚ Innovation Ã
 ¥Â
 etc... 

 I posted these documents to a different instance

 When I search the first instance for a string (that I copied &
 pasted from
 the web site) I find a number of hits, including the page from which I
 copied the string from. But when I do the same on the instance with
 Tika
 extracted text - I get nothing.

 Has anyone seen this? I believe it may have to do with encoding. In
 both
 cases the posted documents were utf-8 compiant.

 Thanks for your insights.

 - ashok

 --
 View this message in context:
 http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24708854.html
 Sent from the Solr - User mailing list archive at Nabble.com.

>>>
>>> --
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>>> using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24728917.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> 
> -- 
> Robert Muir
> rcm...@gmail.com
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Indexing-TIKA-extracted-text.-Are-there-some-issues--tp24708854p24729595.html
Sent from the Solr - User mailing list archive at Nabble.com.

deleteById always returning OK

2009-07-29 Thread Reuben Firmin

Is it expected behaviour that "deleteById" will always return OK as a
status, regardless of whether the id was matched?

I have a unit test:

  // set up the test data
engine.index(12345, s1, d1);
engine.index(54321, s2, d2);
engine.index(23453, s3, d3);

// ...

@Test
public void testRemove() throws Exception {
assertEquals(engine.size(), 3);
assertTrue(engine.remove(12345));
assertEquals(engine.size(), 2);
// XXX, it returns true
assertFalse(engine.remove(23523352));

"Engine" is my wrapper around Solr. The remove method looks like this:

private static final int RESPONSE_STATUS_OK = 0;
private SolrServer server;

public boolean remove(final Integer titleInstanceId)
throws IOException
{
try {
server.deleteById(String.valueOf(titleInstanceId));
final UpdateResponse updateResponse = server.commit(true, true);
// XXX It's always OK
return (updateResponse.getStatus() == RESPONSE_STATUS_OK);

Any ideas what's going wrong? Is there a different way to test for the id
not having been there, other than an additional search?

Thanks
Reuben

Re: search suggest

2009-07-29 Thread Jason Rutherglen

I created an issue and have added some notes
https://issues.apache.org/jira/browse/SOLR-1316

On Wed, Jul 29, 2009 at 3:15 PM, Jason
Rutherglen wrote:
> Here's a good article on Ternary Trees: http://www.ddj.com/windows/184410528
>
> I looked at the one in Lucene, I don't understand why the find method
> only returns a char/int?
>
> On Wed, Jul 29, 2009 at 2:33 PM, Robert Petersen wrote:
>> Simple minded autosuggest can just not tokenize the phrases at all and
>> so the wildcards just complete whatever the user has typed so far
>> including spaces.  Upon encountering a space though, autosuggest should
>> wait to make more suggestions until the user has typed at least a couple
>> of letters of the next word.  That is the way I did it last time using a
>> different search engine.  It'd sure be kewl if this became a core
>> feature of solr!
>>
>> I like the idea of the tree approach, sounds much faster.  The root is
>> the least letters to start suggestions and the leaves are the full
>> phrases?
>>
>> -Original Message-
>> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
>> Sent: Wednesday, July 29, 2009 12:09 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: search suggest
>>
>> Autosuggest is something that would be very useful to build into
>> Solr as many search projects require it.
>>
>> I'd recommend indexing relevant terms/phrases into a Ternary
>> Search Tree which is compact and performant. Using a wildcard
>> query will likely not be as fast as a Ternary Tree, and I'm not
>> sure how phrases would be handled?
>>
>> http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysi
>>
>> It would be good to separate out the TernaryTree from
>> analysis/compound and into Lucene core, or into it's own contrib.
>>
>> Also see http://issues.apache.org/jira/browse/LUCENE-625 which
>> improves relevancy using click through rates.
>>
>> I'll open an issue in Solr to get this one going.
>>
>> On Wed, Jul 29, 2009 at 9:12 AM, Robert Petersen
>> wrote:
>>> To do a proper search suggest feature you have to index all the
>> queries
>>> your system gets and search it with wildcards for matches on what the
>>> user has typed so far for each user keystroke in the search box...
>>> Usually with some timer logic to wait for a small hesitation in their
>>> typing.
>>>
>>>
>>>
>>> -Original Message-
>>> From: Jack Bates [mailto:ms...@freezone.co.uk]
>>> Sent: Tuesday, July 28, 2009 10:54 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: search suggest
>>>
>>> how can i use solr to make search suggestions? i'm thinking
>> google-style
>>> suggestions, which suggests more refined queries - vs. freebase-style
>>> suggestions, which suggests top hits.
>>>
>>> i've been looking at the query params,
>>> http://wiki.apache.org/solr/StandardRequestHandler
>>>
>>> - and searching for "solr suggest" - but haven't figured out how to
>> get
>>> search suggestions from solr
>>>
>>
>

Re: THIS WEEK: PNW Hadoop, HBase / Apache Cloud Stack Users' Meeting, Wed Jul 29th, Seattle

2009-07-29 Thread Bradford Stephens

Don't forget this is tonight! Excited to see everyone there.

On Tue, Jul 28, 2009 at 11:25 AM, Bradford
Stephens wrote:
> Hey everyone,
>
> SLIGHT change of plans.
>
> A few people have asked me to move to a place with Air Conditioning,
> since the temperature's in the 90's this week. So, here we go:
>
> Big Time Brewing Company
> 4133 University Way NE
> Seattle, WA 98105
>
> Call me at 904-415-3009 if you have any questions.
>
>
> On Mon, Jul 27, 2009 at 12:16 PM, Bradford
> Stephens wrote:
>> Hello again!
>>
>> Yes, I know some of us are still recovering from OSCON. It's time for
>> another delicious meetup to chat about Hadoop, HBase, Solr, Lucene,
>> and more!
>>
>> UW is quite a pain for us to access until August, so we're changing
>> the venue to one pretty close:
>>
>> Piccolo's Pizza
>> 5301 Roosevelt Way NE
>> (between 53rd St & 55th St)
>>
>> 6:45pm - 8:30 (or when we get bored)!
>>
>> As usual, people are more than welcome to give talks, whether they're
>> long-format or lightning. I'd also really like to start thinking about
>> hackathons, perhaps we could have one next month?
>>
>> I'll be talking about HBase .20 and the possibility of low-latency
>> HBase Analytics. I'd be very excited to hear what people are up to!
>>
>> Contact me if there's any questions: 904-415-3009
>>
>> Cheers,
>> Bradford
>>
>> --
>> http://www.roadtofailure.com -- The Fringes of Scalability, Social
>> Media, and Computer Science
>>
>
>
>
> --
> http://www.roadtofailure.com -- The Fringes of Scalability, Social
> Media, and Computer Science
>



-- 
http://www.roadtofailure.com -- The Fringes of Scalability, Social
Media, and Computer Science

Re: Wildcard and boosting

2009-07-29 Thread Jón Helgi Jónsson

I just updated to nightly build (I was using 1.2) and this does not
seem to be an issue anymore.

2009/7/29 Jón Helgi Jónsson :
> Hey now!
>
> I do index time boosting for my fields and just discovered that when
> searching with a trailing wild card the boosting is ignored.
>
> Will my boosting work with a wild card if I do it at query time? And
> if so is there a lot of performance difference?
>
> Some other method I can use to preserve my boosting? I do not need
> hightlighting.
>
> Thanks,
> Jon Helgi
>

Re: deleteById always returning OK

2009-07-29 Thread Koji Sekiguchi


Reuben Firmin wrote:

Is it expected behaviour that "deleteById" will always return OK as a
status, regardless of whether the id was matched?

  

It is expected behaviour as Solr always returns 0 unless an error occurs
during processing a request (query, update, ...), so you don't need to check
the status, but you'll get an exception if something wrong; otherwise
the request succeeded.

And you cannot know whether the id was matched. The only way
you can try is send a query "q=id:value&rows=0" and check the numFound
in the response before sending deleteById.

Koji


I have a unit test:

  // set up the test data
engine.index(12345, s1, d1);
engine.index(54321, s2, d2);
engine.index(23453, s3, d3);

// ...

@Test
public void testRemove() throws Exception {
assertEquals(engine.size(), 3);
assertTrue(engine.remove(12345));
assertEquals(engine.size(), 2);
// XXX, it returns true
assertFalse(engine.remove(23523352));

"Engine" is my wrapper around Solr. The remove method looks like this:

private static final int RESPONSE_STATUS_OK = 0;
private SolrServer server;

public boolean remove(final Integer titleInstanceId)
throws IOException
{
try {
server.deleteById(String.valueOf(titleInstanceId));
final UpdateResponse updateResponse = server.commit(true, true);
// XXX It's always OK
return (updateResponse.getStatus() == RESPONSE_STATUS_OK);

Any ideas what's going wrong? Is there a different way to test for the id
not having been there, other than an additional search?

Thanks
Reuben

RE: Boosting ('bq') on multi-valued fields

2009-07-29 Thread KaktuChakarabati


Hey Ken,
Thanks for your reply.
When I wrote '5|6' I ment that this is a multiValued field with two values
'5' and '6', rather than the literal string '5|6' (and any Tokenizer). Does
your reply still holds? That is, are multiValued fields dependent on the
notion of tokenization to such a degree so that I cant use str type with
them meaningfully? if so, it seems weird to me that I should be able to
define a str multiValued field to begin with..

-Chak


Ensdorf Ken wrote:
> 
>> Hey,
>> I have a field defined as such:
>>
>>  > stored="false"
>> multiValued="true" />
>>
>> with the string type defined as:
>>
>> > omitNorms="true"/>
>>
>> When I try using some query-time boost parameters using the bq on
>> values of
>> this field it seems to behave
>> strangely in case of documents actually having multiple values:
>> If i'd do a boost for a particular value ( "site_id:5^1.1" ) it seems
>> like
>> all the cases where this field is actually
>> populated with multiple ones ( i.e a document with field value "5|6" )
>> do
>> not get boosted at all. I verified this using
>> debugQuery & explainOther=doc_id:.
>> is this a known issue/bug? any work arounds? (i'm using a nightly solr
>> build
>> from a few months back.. )
> 
> There is no tokenization on 'string' fields, so a query for "5" does not
> match a doc with a value of "5|6" for this field.  You could try  using
> field type 'text' for this and see what you get.  You may need to
> customize it to you the StandardAnalyzer or WordDelimiterFilterFactory to
> get the right behavior.  Using the analysis tool in the solr admin UI to
> experiment will probably be helpful.
> 
> -Ken
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Boosting-%28%27bq%27%29-on-multi-valued-fields-tp24713905p24730981.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Is there a multi-shard optimize message?

2009-07-29 Thread Chris Hostetter


: > Normally to optimize an index you POST  to /solr/update.  Is
: > there any way to POST an optimize message to one instance and have it
: > propagate to all shards sort of like the select?
: >
: > /solr-shard-1/select?q=dog... shards=shard-1,shard2

: No, you'll need to send optimize to each host separately.

and for the record: it would be relatively straight forward to impliment 
something like this (just like distributed search) ... but it has very 
little value.  clients doing "indexing" operations have to send add/delete 
commands directly to the individual shards, so they have to send teh 
commit/optimize commands directly to them as well.

if/when someone writes a distributed indexing handler, making it support 
distributed optimize/commit will be fairly trivial.





-Hoss

Re: update some index documents after indexing process is done with DIH

2009-07-29 Thread Noble Paul നോബിള്‍ नोब्ळ्

If you make your EventListener implements SolrCoreAware you can get
hold of the core on inform. use that to get hold of the
SolrIndexWriter

On Wed, Jul 29, 2009 at 9:20 PM, Marc Sturlese wrote:
>
> From the newSearcher(..) of a CustomEventListener which extends of
> AbstractSolrEventListener  can access to SolrIndexSearcher and all core
> properties but can't get a SolrIndexWriter. Do you now how can I get from
> there a SolrIndexWriter? This way I would be able to modify the documents (I
> need to modify them depending on values of other documents, that's why I
> can't do it with DIH delta-import).
> Thanks in advance
>
>
> Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
>>
>> On Tue, Jul 28, 2009 at 5:17 PM, Marc Sturlese
>> wrote:
>>>
>>> That really sounds the best way to reach my goal. How could I invoque a
>>> listener from the newSearcher?Would be something like:
>>>    
>>>      
>>>         solr 0 >> name="rows">10 
>>>         rocks 0 >> name="rows">10 
>>>        static newSearcher warming query from
>>> solrconfig.xml
>>>      
>>>    
>>>    
>>>
>>> And MyCustomListener would be the class who open the reader:
>>>
>>>        RefCounted searchHolder = null;
>>>        try {
>>>          searchHolder = dataImporter.getCore().getSearcher();
>>>          IndexReader reader = searchHolder.get().getReader();
>>>
>>>          //Here I iterate over the reader doing docuemnt modifications
>>>
>>>        } finally {
>>>           if (searchHolder != null) searchHolder.decref();
>>>        }
>>>        } catch (Exception ex) {
>>>            LOG.info("error");
>>>        }
>>
>> you may not be able to access the DIH API from a newSearcher event .
>> But the API would give you the searcher directly as a method
>> parameter.
>>>
>>> Finally, to access to documents and add fields to some of them, I have
>>> thought in using SolrDocument classes. Can you please point me where
>>> something similar is done in solr source (I mean creation of
>>> SolrDocuemnts
>>> and conversion of them to proper lucene docuements).
>>>
>>> Does this way for reaching the goal makes sense?
>>>
>>> Thanks in advance
>>>
>>>
>>>
>>> Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:

 when a core is reloaded the event fired is firstSearcher. newSearcher
 is fired when a commit happens


 On Tue, Jul 28, 2009 at 4:19 PM, Marc Sturlese
 wrote:
>
> Ok, but if I handle it in a newSearcher listener it will be executed
> every
> time I reload a core, isn't it? The thing is that I want to use an
> IndexReader to load in a HashMap some doc fields of the index and
> depending
> of the values of some field docs modify other docs. Its very memory
> consuming (I have tested it with a simple lucene script). Thats why I
> wanted
> to do it just after the indexing process.
>
> My ideal case would be to do it in the commit function of
> DirectUpdatehandler2.java just before
> writer.optimize(cmd.maxOptimizeSegments); is executed. But I don't want
> to
> mess that code... so trying to find out the best way to do that as a
> plugin
> instead of a hack as possible.
>
> Thanks in advance
>
>
> Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
>>
>> It is best handled as a 'newSearcher' listener in solrconfig.xml.
>> onImportEnd is invoked before committing
>>
>> On Tue, Jul 28, 2009 at 3:13 PM, Marc
>> Sturlese
>> wrote:
>>>
>>> Hey there,
>>> I would like to be able to do something like: After the indexing
>>> process
>>> is
>>> done with DIH I would like to open an indexreader, iterate over all
>>> docs,
>>> modify some of them depending on others and delete some others. I can
>>> easy
>>> do this directly coding with lucene but would like to know if there's
>>> a
>>> way
>>> to do it with Solr using SolrDocument or SolrInputDocument classes.
>>> I have thougth in using SolrJ or DIH listener onImportEnd but not
>>> sure
>>> if
>>> I
>>> can get an IndexReader in there.
>>> Any advice?
>>> Thanks in advance
>>> --
>>> View this message in context:
>>> http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24695947.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> -
>> Noble Paul | Principal Engineer| AOL | http://aol.com
>>
>>
>
> --
> View this message in context:
> http://www.nabble.com/update-some-index-documents-after-indexing-process-is-done-with-DIH-tp24695947p24696872.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



 --
 -
 Noble Paul | Principal Engineer| AOL | http://aol.com


>>>
>>> --
>>> View this message in context:
>

Re: issue inquiry: unterminated index lock after optimize update command

2009-07-29 Thread Chris Hostetter


: I'm using solr build 2009-06-16_08-06-14, in multicore configuration.
: When I issue the update command "optimize" to a core, the index files
: are locked and never released.  Calling the coreAdmin unload method on
: the core unload the core but does not unlock the underlying index files.
: The core has no other alias, the data path is not referenced by any
: other core when a full status is requested.  The end result is that
: optimized cores that have been unloaded cannot be deleted until jetty is
: restarted.
... 
: I have searched jira but did not find anything relevant.  Is this a bug
: that should be reported, or is this an intended behavior?

...this is certinaly not intented behavior .. you shouldn't need to 
restart the server (or even reload the core) to unlock the index ... it 
should be unlocked automaticly when the optimize completes.

are you sure there wasn't any sort of serious error in the logs?  like an 
OutOfMemory perhaps?

if you can reproduce this consistently a detailed bug report showing your 
exact config files, describing your OS and filesystem, and describing 
exactly what setps you take to trigger this problem would certainly be 
appreciated.


-Hoss

Re: DocList Pagination

2009-07-29 Thread Chris Hostetter


: Hi, I am try to get the next DocList "page" in my custom search component.
: Could I get a code example of this?

you just increase the "offset" value you pass to 
SolrIndexSearcher.getDocList by whatever your page size is.  (if you use 
the newer QueryCommand versions you just call setOffset with the same 
value).





-Hoss

Re: solr indexing on same set of records with different value of unique field...not working...

2009-07-29 Thread Chris Hostetter


I'm not really understanding how you could get the situation you describe 
... which suggests that one (or both) of us don't understand exactly what 
happened.

if you can post the actual schema.xml file you used and an example of the 
input you indexed perhaps we can spot the discrepency.

FWIW: using a timestamp as a uniqueKey doesn't make much sense ...

 1) if you have heavy parallelization two docs indexed at the exact same 
time might overwrite eachother.
 2) you have no way of ever replacing an existing doc (unless you roll the 
clock back) in which case there's no advantage to using a uniqueKey -- 
so you might as leave it out of your schema (which makes indexing 
slightly faster) 

: I need to run around 10 million records to index, by solr.
: I has nearly 2lakh records, so i made a program to looping it till 10 million.
: Here, i specified 20 fields in schema.xml file. the unoque field i set was,
: currentTimeStamp field.
: So, when i run the loader program (which loads xml data into solr) it creates
: currentTimestamp value...and loads into solr.
: 
: For this situation,
: i stopped the loader program, after 100 records indexed into solr.
: Then again, i run the loader program for the SAME 100 records to indexed
: means,
: the solr results 100, rather than 200.
: 
: Because, i set currentTimeStamp field as uniqueField. So i expect the result
: as 200, if i run again the same 100 records...
: 
: Any suggestions please...



-Hoss

Re: update some index documents after indexing process is done with DIH

2009-07-29 Thread Chris Hostetter


This thread all sounds really kludgy ... among other things the 
newSearcher listener is going to need to some how keep track of when it 
was called as a result of a "real" commit, vs when it was called as the 
result of a commit it itself triggered to make changes.

wouldn't an easier place to implement this logic be in an UpdateProcessor?  
you'll still need the "double commit" (once so you can see the 
main changes, and once so the rest of the world can see your 
modifications) but you can execute them both directly in your 
processCommit(CommitUpdateCommand) method (so you don't have to worry 
about being able to tell them apart)

: Date: Thu, 30 Jul 2009 10:14:16 +0530
: From:
: =?UTF-8?B?Tm9ibGUgUGF1bCDgtKjgtYvgtKzgtL/gtLPgtY3igI0gIOCkqOCli+CkrOCljeCk
: s+CljQ==?= 
: Reply-To: solr-user@lucene.apache.org, noble.p...@gmail.com
: To: solr-user@lucene.apache.org
: Subject: Re: update some index documents after indexing process is done with 
: DIH
: 
: If you make your EventListener implements SolrCoreAware you can get
: hold of the core on inform. use that to get hold of the
: SolrIndexWriter
: 
: On Wed, Jul 29, 2009 at 9:20 PM, Marc Sturlese wrote:
: >
: > From the newSearcher(..) of a CustomEventListener which extends of
: > AbstractSolrEventListener  can access to SolrIndexSearcher and all core
: > properties but can't get a SolrIndexWriter. Do you now how can I get from
: > there a SolrIndexWriter? This way I would be able to modify the documents (I
: > need to modify them depending on values of other documents, that's why I
: > can't do it with DIH delta-import).
: > Thanks in advance
: >
: >
: > Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
: >>
: >> On Tue, Jul 28, 2009 at 5:17 PM, Marc Sturlese
: >> wrote:
: >>>
: >>> That really sounds the best way to reach my goal. How could I invoque a
: >>> listener from the newSearcher?Would be something like:
: >>>    
: >>>      
: >>>         solr 0 >> name="rows">10 
: >>>         rocks 0 >> name="rows">10 
: >>>        static newSearcher warming query from
: >>> solrconfig.xml
: >>>      
: >>>    
: >>>    
: >>>
: >>> And MyCustomListener would be the class who open the reader:
: >>>
: >>>        RefCounted searchHolder = null;
: >>>        try {
: >>>          searchHolder = dataImporter.getCore().getSearcher();
: >>>          IndexReader reader = searchHolder.get().getReader();
: >>>
: >>>          //Here I iterate over the reader doing docuemnt modifications
: >>>
: >>>        } finally {
: >>>           if (searchHolder != null) searchHolder.decref();
: >>>        }
: >>>        } catch (Exception ex) {
: >>>            LOG.info("error");
: >>>        }
: >>
: >> you may not be able to access the DIH API from a newSearcher event .
: >> But the API would give you the searcher directly as a method
: >> parameter.
: >>>
: >>> Finally, to access to documents and add fields to some of them, I have
: >>> thought in using SolrDocument classes. Can you please point me where
: >>> something similar is done in solr source (I mean creation of
: >>> SolrDocuemnts
: >>> and conversion of them to proper lucene docuements).
: >>>
: >>> Does this way for reaching the goal makes sense?
: >>>
: >>> Thanks in advance
: >>>
: >>>
: >>>
: >>> Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
: 
:  when a core is reloaded the event fired is firstSearcher. newSearcher
:  is fired when a commit happens
: 
: 
:  On Tue, Jul 28, 2009 at 4:19 PM, Marc Sturlese
:  wrote:
: >
: > Ok, but if I handle it in a newSearcher listener it will be executed
: > every
: > time I reload a core, isn't it? The thing is that I want to use an
: > IndexReader to load in a HashMap some doc fields of the index and
: > depending
: > of the values of some field docs modify other docs. Its very memory
: > consuming (I have tested it with a simple lucene script). Thats why I
: > wanted
: > to do it just after the indexing process.
: >
: > My ideal case would be to do it in the commit function of
: > DirectUpdatehandler2.java just before
: > writer.optimize(cmd.maxOptimizeSegments); is executed. But I don't want
: > to
: > mess that code... so trying to find out the best way to do that as a
: > plugin
: > instead of a hack as possible.
: >
: > Thanks in advance
: >
: >
: > Noble Paul നോബിള്‍  नोब्ळ्-2 wrote:
: >>
: >> It is best handled as a 'newSearcher' listener in solrconfig.xml.
: >> onImportEnd is invoked before committing
: >>
: >> On Tue, Jul 28, 2009 at 3:13 PM, Marc
: >> Sturlese
: >> wrote:
: >>>
: >>> Hey there,
: >>> I would like to be able to do something like: After the indexing
: >>> process
: >>> is
: >>> done with DIH I would like to open an indexreader, iterate over all
: >>> docs,
: >>> modify some of them depending on others and delete some others. I can
: >>> easy
: >>> do this directly coding with luce

50 matches

Mail list logo