AW: Will Solr fit our needs?

2010-03-18 Thread Moritz Maedler
Hi guys!

Thanks alot for your suggestions and help - I really appreciate that!
As we need e.g. the the price for sorting I think it must be in the index?
Thus, I'm not shure that a key-value-store is the thing we are looking for as we
need a good searchengine.
Currently we are using serveral indices items, bids, etc. I think that ZOIE 
looks
kind of interesting - I will check that out.

BTW do/can you recommend any "newbe" literature or resources for getting a bit
more familiar w/ Solr/Lucene ?

Thanks a lot again guys!


-Ursprüngliche Nachricht-
Von: Lance Norskog [mailto:goks...@gmail.com] 
Gesendet: Donnerstag, 18. März 2010 00:38
An: solr-user@lucene.apache.org
Betreff: Re: Will Solr fit our needs?

Another option is the ExternalFileField:

http://www.lucidimagination.com/search/document/CDRG_ch04_4.4.4?q=ExternalFileField

This lets you store the current prices for all items in a separate
file. You can only use it in a function query, that is. But it does
allow you to maintain one Solr index, which is very very worthwhile.

On Wed, Mar 17, 2010 at 4:19 AM, Geert-Jan Brits  wrote:
> If you dont' plan on filtering/ sorting and/or faceting on fast-changing
> fields it would be better to store them outside of solr/lucene in my
> opinion.
>
> If you must: for indexing-performance reasons you will probably end up with
> maintaining seperate indices (1 for slow-changing/static fields and 1 for
> fast-changing-fields) .
> You frequently commit the fast-changing -index to incorporate the changes
> in current_price. Afterwards you have 2 options I believe:
>
> 1. use parallelreader to query the seperate indices directly. Afaik, this is
> not (completely) integrated in Solr... I wouldn't recommend it.
> 2. after you commit the fast-changing-index, merge with the static-index.
> You're left with 1 fresh index, which you can push to your slave-servers.
> (all this in regular interverals)
>
> Disadvatages:
> - In any way, you must be very careful with maintaining multiple parallel
> indexes with the purpose of treating them as one. For instance document
> inserts must be done exactly in the same order, otherwise the indices go
> 'out-of-sync' and are unusable.
> - higher maintenance
> - there is always a time-window in which the current_price values are stale.
> If that's within reqs that's ok.
>
> The other path, which I recommend, would be to store the current_price
> outside of solr (like you're currently doing) but instead of using a
> relational db, try looking into persistent key-value stores. Many of them
> exist and a lot of progress has been made in the last couple of years. For
> simple key-lookups (what you need as far as I can tell) they really blow
> every relational db out of the water (considering the same hardware of
> course)
>
> We're currently using Tokyo Cabinet with the server-frontend Tokyo Tyrant
> and seeing almost a 5x increased in lookup performance compared to our
> previous kv-store memcachedDB which is based on BerkelyDB. Memcachedb was
> already several times faster than our mysql-setup (although not optimally
> tuned) .
>
> to sum things up: use the best tools for what they were meant to do.
>
> - index/search --> solr/ lucene without a doubt.
>
> - kv-lookup --> consensus is still forming, and a lot of players (with a lot
> of different types of functionality) but if all you need is simple
> key-value-lookup, I would go for Tokyo Cabinet (TC) / Tyrant at the moment.
>  Please note that TC and competitors aren't just some code/ hobby projects
> but are usually born out of a real need at huge websites / social networks
> such as TC which is born from mixi  (big social network in Japan) . So at
> least you're in good company..
>
> for kv-stores I would suggest to begin your research at:
> http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
> (beginning
> 2009)
> http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores (half
> 2009)
> and get a feel of the kv-playing field.
>
> Hope this (pretty long) post helps,
> Geert-Jan
>
>
> 2010/3/17 Krzysztof Grodzicki 
>
>> Hi Mortiz,
>>
>> You can take a look on the project ZOIE -
>> http://code.google.com/p/zoie/. I think it's that what are you looking
>> for.
>>
>> br
>> Krzysztof
>>
>> On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mädler 
>> wrote:
>> > Hi List,
>> >
>> > we are running a marketplace which has about a comparable functionality
>> like ebay (auctions, fixed-price items etc).
>> > The items are placed on the market by users who want to sell their goods.
>> >
>> > Currently we are using Sphinx as an indexing engine, but, as Sphinx
>> returns only document ids we have to make a
>> > database-query to fetch the data to display. This massively decreases
>> performance as we have to do two requests to
>> > display data.
>> >
>> > I heard that Solr is able to return a complete dataset and we hope a
>> switch to Solr can boost perfomance.
>> > A critical question is left and i was not abl

Re: Will Solr fit our needs?

2010-03-18 Thread Lukáš Vlček
On Thu, Mar 18, 2010 at 8:45 AM, Moritz Maedler wrote:

> Hi guys!
>
> Thanks alot for your suggestions and help - I really appreciate that!
> As we need e.g. the the price for sorting I think it must be in the index?
> Thus, I'm not shure that a key-value-store is the thing we are looking for
> as we
> need a good searchengine.
> Currently we are using serveral indices items, bids, etc. I think that ZOIE
> looks
> kind of interesting - I will check that out.
>
> BTW do/can you recommend any "newbe" literature or resources for getting a
> bit
> more familiar w/ Solr/Lucene ?
>

Probably the best way how to get your questions answered it participating in
mail lists. Documentation and wiki pages as well. There are also some books:
check Lucene in Action (http://www.manning.com/hatcher3/) and Solr 1.4
Enterprise Search Server (
http://www.packtpub.com/solr-1-4-enterprise-search-server). Other books
about Lucene and Solr are WIP but unfortunately not released yet.


>
> Thanks a lot again guys!
>
>
> -Ursprüngliche Nachricht-
> Von: Lance Norskog [mailto:goks...@gmail.com]
> Gesendet: Donnerstag, 18. März 2010 00:38
> An: solr-user@lucene.apache.org
> Betreff: Re: Will Solr fit our needs?
>
> Another option is the ExternalFileField:
>
>
> http://www.lucidimagination.com/search/document/CDRG_ch04_4.4.4?q=ExternalFileField
>
> This lets you store the current prices for all items in a separate
> file. You can only use it in a function query, that is. But it does
> allow you to maintain one Solr index, which is very very worthwhile.
>
> On Wed, Mar 17, 2010 at 4:19 AM, Geert-Jan Brits  wrote:
> > If you dont' plan on filtering/ sorting and/or faceting on fast-changing
> > fields it would be better to store them outside of solr/lucene in my
> > opinion.
> >
> > If you must: for indexing-performance reasons you will probably end up
> with
> > maintaining seperate indices (1 for slow-changing/static fields and 1 for
> > fast-changing-fields) .
> > You frequently commit the fast-changing -index to incorporate the changes
> > in current_price. Afterwards you have 2 options I believe:
> >
> > 1. use parallelreader to query the seperate indices directly. Afaik, this
> is
> > not (completely) integrated in Solr... I wouldn't recommend it.
> > 2. after you commit the fast-changing-index, merge with the static-index.
> > You're left with 1 fresh index, which you can push to your slave-servers.
> > (all this in regular interverals)
> >
> > Disadvatages:
> > - In any way, you must be very careful with maintaining multiple parallel
> > indexes with the purpose of treating them as one. For instance document
> > inserts must be done exactly in the same order, otherwise the indices go
> > 'out-of-sync' and are unusable.
> > - higher maintenance
> > - there is always a time-window in which the current_price values are
> stale.
> > If that's within reqs that's ok.
> >
> > The other path, which I recommend, would be to store the current_price
> > outside of solr (like you're currently doing) but instead of using a
> > relational db, try looking into persistent key-value stores. Many of them
> > exist and a lot of progress has been made in the last couple of years.
> For
> > simple key-lookups (what you need as far as I can tell) they really blow
> > every relational db out of the water (considering the same hardware of
> > course)
> >
> > We're currently using Tokyo Cabinet with the server-frontend Tokyo Tyrant
> > and seeing almost a 5x increased in lookup performance compared to our
> > previous kv-store memcachedDB which is based on BerkelyDB. Memcachedb was
> > already several times faster than our mysql-setup (although not optimally
> > tuned) .
> >
> > to sum things up: use the best tools for what they were meant to do.
> >
> > - index/search --> solr/ lucene without a doubt.
> >
> > - kv-lookup --> consensus is still forming, and a lot of players (with a
> lot
> > of different types of functionality) but if all you need is simple
> > key-value-lookup, I would go for Tokyo Cabinet (TC) / Tyrant at the
> moment.
> >  Please note that TC and competitors aren't just some code/ hobby
> projects
> > but are usually born out of a real need at huge websites / social
> networks
> > such as TC which is born from mixi  (big social network in Japan) . So at
> > least you're in good company..
> >
> > for kv-stores I would suggest to begin your research at:
> >
> http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
> > (beginning
> > 2009)
> > http://randomfoo.net/2009/04/20/some-notes-on-distributed-key-stores(half
> > 2009)
> > and get a feel of the kv-playing field.
> >
> > Hope this (pretty long) post helps,
> > Geert-Jan
> >
> >
> > 2010/3/17 Krzysztof Grodzicki 
> >
> >> Hi Mortiz,
> >>
> >> You can take a look on the project ZOIE -
> >> http://code.google.com/p/zoie/. I think it's that what are you looking
> >> for.
> >>
> >> br
> >> Krzysztof
> >>
> >> On Wed, Mar 17, 2010 at 9:49 AM, Moritz Mä

HTTP Status 500 - null java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:249)

2010-03-18 Thread Marc Des Garets
Hi,

 

I am doing a really simple query on my index (it's running in tomcat):

http://host:8080/solr_er_07_09/select/?q=hash_id:123456

 

I am getting the following exception:

HTTP Status 500 - null java.lang.IllegalArgumentException at
java.nio.Buffer.limit(Buffer.java:249) at
org.apache.lucene.store.NIOFSDirectory$NIOFSIndexInput.readInternal(NIOF
SDirectory.java:123) at
org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.jav
a:157) at
org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.j
ava:38) at
org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:80) at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:214) at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:948)
at
org.apache.lucene.index.DirectoryReader.document(DirectoryReader.java:50
6) at org.apache.lucene.index.IndexReader.document(IndexReader.java:947)
at
org.apache.solr.search.SolrIndexReader.document(SolrIndexReader.java:444
) at
org.apache.solr.search.SolrIndexSearcher.doc(SolrIndexSearcher.java:427)
at
org.apache.solr.util.SolrPluginUtils.optimizePreFetchDocs(SolrPluginUtil
s.java:267) at
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.
java:269) at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(Search
Handler.java:195) at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerB
ase.java:131) at
org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.ja
va:338) at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.j
ava:241) at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(Applica
tionFilterChain.java:235) at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilt
erChain.java:206) at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValv
e.java:233) at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValv
e.java:191) at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java
:128) at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java
:102) at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.
java:109) at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:2
93) at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:84
9) at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(
Http11Protocol.java:583) at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:454)
at java.lang.Thread.run(Thread.java:595)

 

Has someone any idea why would this happen?

 

I built the index on a different machine than the one I am doing the
query on though the configuration is exactly the same. I can do the same
query using solrj (I have an app doing that) and it works fine.

 

 

Thanks.
--
This transmission is strictly confidential, possibly legally privileged, and 
intended solely for the 
addressee.  Any views or opinions expressed within it are those of the author 
and do not necessarily 
represent those of 192.com, i-CD Publishing (UK) Ltd or any of it's subsidiary 
companies.  If you 
are not the intended recipient then you must not disclose, copy or take any 
action in reliance of this 
transmission. If you have received this transmission in error, please notify 
the sender as soon as 
possible.  No employee or agent is authorised to conclude any binding agreement 
on behalf of 
i-CD Publishing (UK) Ltd with another party by email without express written 
confirmation by an 
authorised employee of the Company. http://www.192.com (Tel: 08000 192 192).  
i-CD Publishing (UK) Ltd 
is incorporated in England and Wales, company number 3148549, VAT No. GB 
673128728.

solrj sends duplicate documents

2010-03-18 Thread Tim Terlegård
I'm using StreamingUpdateSolrServer to index a document.

StreamingUpdateSolrServer server = new
StreamingUpdateSolrServer("http://localhost:8983/solr/core0";, 20, 4);
server.setRequestWriter(new BinaryRequestWriter());
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "12121212");
doc.addField("text", "something");
server.add(doc);
server.commit();

When I use Wireshark I can see that two documents are sent. I see that
this happens:

1) POST /solr/core0/update HTTP/1.1
2)  ... 
3) POST /solr/core0/update/javabin HTTP/1.1
4) ...

In the log file I see this:
2010-mar-18 12:53:46 org.apache.solr.core.SolrCore execute
INFO: [core0] webapp=/solr path=/update params={} status=0 QTime=7
...
2010-mar-18 12:53:46 org.apache.solr.core.SolrCore execute
INFO: [core0] webapp=/solr path=/update/javabin
params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=1}
status=0 QTime=24

It looks like server.add() sends the document in clear text and then
server.commit() also sends it in the javabin format. I'd rather it
just sends one document and in the javabin format. Am I using solrj
inappropriately? This is 1.4.

/Tim


Re: solrj sends duplicate documents

2010-03-18 Thread Erik Hatcher
The StreamingUpdateSolrServer does not support binary format,  
unfortunately.


Erik

On Mar 18, 2010, at 8:15 AM, Tim Terlegård wrote:


I'm using StreamingUpdateSolrServer to index a document.

StreamingUpdateSolrServer server = new
StreamingUpdateSolrServer("http://localhost:8983/solr/core0";, 20, 4);
server.setRequestWriter(new BinaryRequestWriter());
SolrInputDocument doc = new SolrInputDocument();
doc.addField("id", "12121212");
doc.addField("text", "something");
server.add(doc);
server.commit();

When I use Wireshark I can see that two documents are sent. I see that
this happens:

1) POST /solr/core0/update HTTP/1.1
2)  ... 
3) POST /solr/core0/update/javabin HTTP/1.1
4) ...

In the log file I see this:
2010-mar-18 12:53:46 org.apache.solr.core.SolrCore execute
INFO: [core0] webapp=/solr path=/update params={} status=0 QTime=7
...
2010-mar-18 12:53:46 org.apache.solr.core.SolrCore execute
INFO: [core0] webapp=/solr path=/update/javabin
params 
={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=1}

status=0 QTime=24

It looks like server.add() sends the document in clear text and then
server.commit() also sends it in the javabin format. I'd rather it
just sends one document and in the javabin format. Am I using solrj
inappropriately? This is 1.4.

/Tim




Return all Facets?

2010-03-18 Thread homerlex

I'm starting to play with Solr.  I am looking at the API and see that there
is an addFacetField on the SolrQuery Object that is required to specify
which facet fields you want returned.  Is there any way to specify that we
want all facet fields with explicitly having to add them all via
addFacetField?
-- 
View this message in context: 
http://old.nabble.com/Return-all-Facets--tp27944999p27944999.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Term Highlighting without store text in index

2010-03-18 Thread Alexey Serba
Hey Dominique,

See 
http://www.lucidimagination.com/search/document/5ea8054ed8348e6f/highlight_arbitrary_text#3799814845ebf002

Although it might be not good solution for huge texts, wildcard/phrase queries.
http://issues.apache.org/jira/browse/SOLR-1397

On Mon, Mar 15, 2010 at 4:09 PM, dbejean  wrote:
>
> Hello,
>
> Just in order to be able to show term highlighting in my results list, I
> store all the indexed data in the Lucene index and so, it is very huge
> (108Gb). Is there any possibilities to do it in an other way ? Now or in the
> future, is it possible that Solr use a 3nd-party tool such as ehcache in
> order to store the content of the indexed documents outside of the Lucene
> index ?
>
> Thank you
>
> Dominique
>
>
> --
> View this message in context: 
> http://old.nabble.com/Term-Highlighting-without-store-text-in-index-tp27904022p27904022.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


excluder filters and multivalued fields

2010-03-18 Thread Marc Sturlese

I don't think there's a way to do what has come to my mind but want to be
sure.
Let's say I have a doc with 2 fileds, one is multiValued

doc1:
name->john
year->2009;year->2010;year->2011

And I query for:
q=john&fq=-year:2010

Doc1 won't be in the matching results. Is there a way to make it appear
because even having 2010 the document has also years that don't match the
filter query?

Thanks in advance

-- 
View this message in context: 
http://old.nabble.com/excluder-filters-and-multivalued-fields-tp27945996p27945996.html
Sent from the Solr - User mailing list archive at Nabble.com.



RE: XPath Processing Applied to Clob

2010-03-18 Thread Craig Christman
You could also do the xpath processing on the oracle end using the extract or 
extractValue functions.  Here's a good reference:  
http://www.psoug.org/reference/xml_functions.html


-Original Message-
From: Neil Chaudhuri [mailto:nchaudh...@potomacfusion.com]
Sent: Wednesday, March 17, 2010 3:24 PM
To: solr-user@lucene.apache.org
Subject: XPath Processing Applied to Clob

I am using the DataImportHandler to index 3 fields in a table: an id, a date, 
and the text of a document. This is an Oracle database, and the document is an 
XML document stored as Oracle's xmltype data type. Since this is nothing more 
than a fancy CLOB, I am using the ClobTransformer to extract the actual XML. 
However, I don't want to index/store all the XML but instead just the XML 
within a set of tags. The XPath itself is trivial, but it seems like the 
XPathEntityProcessor only works for XML file content rather than the output of 
a Transformer.

Here is what I currently have that fails:





















Is there an easy way to do this without writing my own custom transformer?

Thanks.


Re: solrj sends duplicate documents

2010-03-18 Thread Tim Terlegård
It would be nice if the documentation mentioned this.  :)

/Tim

2010/3/18 Erik Hatcher :
> The StreamingUpdateSolrServer does not support binary format, unfortunately.
>
>        Erik
>
> On Mar 18, 2010, at 8:15 AM, Tim Terlegård wrote:
>
>> I'm using StreamingUpdateSolrServer to index a document.
>>
>> StreamingUpdateSolrServer server = new
>> StreamingUpdateSolrServer("http://localhost:8983/solr/core0";, 20, 4);
>> server.setRequestWriter(new BinaryRequestWriter());
>> SolrInputDocument doc = new SolrInputDocument();
>> doc.addField("id", "12121212");
>> doc.addField("text", "something");
>> server.add(doc);
>> server.commit();
>>
>> When I use Wireshark I can see that two documents are sent. I see that
>> this happens:
>>
>> 1) POST /solr/core0/update HTTP/1.1
>> 2)  ... 
>> 3) POST /solr/core0/update/javabin HTTP/1.1
>> 4) ...
>>
>> In the log file I see this:
>> 2010-mar-18 12:53:46 org.apache.solr.core.SolrCore execute
>> INFO: [core0] webapp=/solr path=/update params={} status=0 QTime=7
>> ...
>> 2010-mar-18 12:53:46 org.apache.solr.core.SolrCore execute
>> INFO: [core0] webapp=/solr path=/update/javabin
>> params={waitSearcher=true&waitFlush=true&wt=javabin&commit=true&version=1}
>> status=0 QTime=24
>>
>> It looks like server.add() sends the document in clear text and then
>> server.commit() also sends it in the javabin format. I'd rather it
>> just sends one document and in the javabin format. Am I using solrj
>> inappropriately? This is 1.4.
>>
>> /Tim
>
>


where can i get an synonym.txt and spellcheck.txt ?

2010-03-18 Thread stocki

Hello.

I search an synonym and spellcheck.txt 
where can i find it in the laaarge internet ? 

or how, do you filled these two files with good names ? 
-- 
View this message in context: 
http://old.nabble.com/where-can-i-get-an-synonym.txt-and-spellcheck.txt---tp27946812p27946812.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Return all Facets?

2010-03-18 Thread Erik Hatcher
No, there isn't.  How would one know what all the facet fields are,  
though?


One trick, use the luke request handler to get the list of fields,  
then use that list to construct the facet fields request parameters.


Erik

On Mar 18, 2010, at 8:40 AM, homerlex wrote:



I'm starting to play with Solr.  I am looking at the API and see  
that there
is an addFacetField on the SolrQuery Object that is required to  
specify
which facet fields you want returned.  Is there any way to specify  
that we

want all facet fields with explicitly having to add them all via
addFacetField?
--
View this message in context: 
http://old.nabble.com/Return-all-Facets--tp27944999p27944999.html
Sent from the Solr - User mailing list archive at Nabble.com.





Re: where can i get an synonym.txt and spellcheck.txt ?

2010-03-18 Thread Erick Erickson
You probably won't find a good synonyms file. The problem is
that synonyms tend to be domain-specific, so a synonyms file
for chemistry would be of little use for psychology.


Spellcheck is generally more useful it it's derived from words
already *in* your index. It's of little use to a user to have
spellcheck/autosuggest show terms that aren't in the
index...

HTH
Erick

On Thu, Mar 18, 2010 at 10:58 AM, stocki  wrote:

>
> Hello.
>
> I search an synonym and spellcheck.txt
> where can i find it in the laaarge internet ?
>
> or how, do you filled these two files with good names ?
> --
> View this message in context:
> http://old.nabble.com/where-can-i-get-an-synonym.txt-and-spellcheck.txt---tp27946812p27946812.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


some snynonym clarifications

2010-03-18 Thread Mark Fletcher
Hi,

Just needed some help to understand the following synonym mappings:-

1. aaa => 
  does it mean:-
 if the user queries for aaa it is replaced with  and documents
matching  are searched for
   or does it mean
 if the user queries for aaa, documents with aaa as well as  are
looked for


2. bbb => 1 2
does it mean that if the user queries for bbb, SOLR will look for
documents that contain 1 2


3. ccc => 1,2
does it mean that if the user queries for ccc, SOLR will look for
documents that contain 1 or 2

4.  a\=>a => b\=>b
 First of all my doubt is what does the "\" do there. Does it have
any special significance.
 Can someone help me interpret the above

5. a\,a => b\,b
  Can some one help me with this also

6. fooaaa,baraaa,bazaaa
  does this mean that if any of  fooaaa or baraaa  or bazaaa comes
as the search keyword, SOLR will look for documents that contain
 fooaaa

7. abc def rose\, atlas method, NY.GNP.PCAP.PP.CD
   does this mean a query for any of the above 3 will always be
replaced by a query for abc def rose\

Can some one pls extend some help at your earliest convenience.

Thank you.
Mark.


Re: some snynonym clarifications

2010-03-18 Thread Markus Jelsma
Hi,


Check out the wiki page on the SynonymFilterFactory. It gives a decent 
explantion on the subject. The backslash is just for escaping otherwise 
meaningful characters.


[1]:http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory


Cheers,

On Thursday 18 March 2010 17:10:56 Mark Fletcher wrote:
> Hi,
> 
> Just needed some help to understand the following synonym mappings:-
> 
> 1. aaa => 
>   does it mean:-
>  if the user queries for aaa it is replaced with  and documents
> matching  are searched for
>or does it mean
>  if the user queries for aaa, documents with aaa as well as 
>  are looked for
> 
> 
> 2. bbb => 1 2
> does it mean that if the user queries for bbb, SOLR will look for
> documents that contain 1 2
> 
> 
> 3. ccc => 1,2
> does it mean that if the user queries for ccc, SOLR will look for
> documents that contain 1 or 2
> 
> 4.  a\=>a => b\=>b
>  First of all my doubt is what does the "\" do there. Does it have
> any special significance.
>  Can someone help me interpret the above
> 
> 5. a\,a => b\,b
>   Can some one help me with this also
> 
> 6. fooaaa,baraaa,bazaaa
>   does this mean that if any of  fooaaa or baraaa  or bazaaa comes
> as the search keyword, SOLR will look for documents that contain
>  fooaaa
> 
> 7. abc def rose\, atlas method, NY.GNP.PCAP.PP.CD
>does this mean a query for any of the above 3 will always be
> replaced by a query for abc def rose\
> 
> Can some one pls extend some help at your earliest convenience.
> 
> Thank you.
> Mark.
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350



Re: some snynonym clarifications

2010-03-18 Thread Mark Fletcher
Hi,

Thanks for the mail. I had tried the WIKI.

My doubts remaining were mainly:-

1.
If we have synonyms specified and they replace your search keyword with the
ones specified wouldn't we face a risk of our original keyword missed out.
What i meant is if I have a keyword for search say "agriculture" and I
replace it with some synonyms, will I never again be able to search directly
for "agriculture". ie suppose I have a document which has the term
agriculture and none of the synonyms in it. Will that document be retrieved
when i search for agriculture as I have now mapped it to other terms.

2.
I am still a bit confused about the interpretation of:-
a\=>a => b\=>b
a\,a => b\,b
   abc def rose\, my cap ,  rose flower
   Can you pls give a one linere explanation for the above. There are some
sample entries in the synonyms.txt
3. If I get some help me with the above 3 it will help me understand the
backslash "\" also better.

Thanks,
Mark.


On Thu, Mar 18, 2010 at 12:19 PM, Markus Jelsma  wrote:

> Hi,
>
>
> Check out the wiki page on the SynonymFilterFactory. It gives a decent
> explantion on the subject. The backslash is just for escaping otherwise
> meaningful characters.
>
>
> [1]:
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.SynonymFilterFactory
>
>
> Cheers,
>
> On Thursday 18 March 2010 17:10:56 Mark Fletcher wrote:
> > Hi,
> >
> > Just needed some help to understand the following synonym mappings:-
> >
> > 1. aaa => 
> >   does it mean:-
> >  if the user queries for aaa it is replaced with  and
> documents
> > matching  are searched for
> >or does it mean
> >  if the user queries for aaa, documents with aaa as well as 
> >  are looked for
> >
> >
> > 2. bbb => 1 2
> > does it mean that if the user queries for bbb, SOLR will look for
> > documents that contain 1 2
> >
> >
> > 3. ccc => 1,2
> > does it mean that if the user queries for ccc, SOLR will look for
> > documents that contain 1 or 2
> >
> > 4.  a\=>a => b\=>b
> >  First of all my doubt is what does the "\" do there. Does it
> have
> > any special significance.
> >  Can someone help me interpret the above
> >
> > 5. a\,a => b\,b
> >   Can some one help me with this also
> >
> > 6. fooaaa,baraaa,bazaaa
> >   does this mean that if any of  fooaaa or baraaa  or bazaaa
> comes
> > as the search keyword, SOLR will look for documents that contain
> >  fooaaa
> >
> > 7. abc def rose\, my cap ,  rose flower
>


>  >does this mean a query for any of the above 3 will always be
> > replaced by a query for abc def rose\
> >
> > Can some one pls extend some help at your earliest convenience.
> >
> > Thank you.
> > Mark.
> >
>
> Markus Jelsma - Technisch Architect - Buyways BV
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>
>


Recommended OS

2010-03-18 Thread blargy

Does anyone have any recommendations on which OS to use when setting up Solr
search server?

Any memory/disk space recommendations? 

Thanks
-- 
View this message in context: 
http://old.nabble.com/Recommended-OS-tp27948306p27948306.html
Sent from the Solr - User mailing list archive at Nabble.com.



Opinions on Facet+Fulltext behavior?

2010-03-18 Thread Mark Bennett
Most sites allow you to search for some text, and then click on Facets (or
Tags or Taxonomy branches) to drill down into your search.

Most sites also show the search box in these search results, with the text
previously entered, so that you can edit it and resubmit.  Perhaps you want
to add a word or change the spelling of a search term, etc.

But what should happen to your previously selected Facets (or Tags, etc) ?

Before you answer
1: It seems like even "the big guys" don't agree.  Some large web sites
clear all previous tags when text is resubmitted, other sites keep these
other selections in place to filter the revised search.  So it doesn't seem
like there's universal agreement.
2: Let's assume that I do NOT want to put a "start new search" button or
checkbox in the search form.  The goal is to find the most reasonable
DEFAULT behavior, and possibly not display those options.

Examples: behavior A vs. B

Both start the same:
0: Assume a real-estate site.
1: I type in "furnished apartment", and get 2,000 matches.
2: Then I notice the "City" facet and click "Sunnyvale".  Now I have just 50
furnished apartments, in Sunnyvale
3: Then I edit my search text to add the word "garage", so the now the text
box is "furnished apartment garage", and submit the search.

Scenario A:
The engine brings back furnished apartments with a garage, in Sunnyvale, and
I get only 10 results.

Scenario B:
The engine brings back furnished apartments with a garage all over the Bay
Area, and I get 800 matches.
To limit the search to Sunnyvale, I must again click the City facet and
select it.

There are strengths and weaknesses to both scenarios, but I don't wanna bias
anybody's answer.  I'd love to hear your thoughts!

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


Re: Recommended OS

2010-03-18 Thread K Wong
http://wiki.apache.org/solr/FAQ#What_are_the_Requirements_for_running_a_Solr_server.3F

I have Solr running on CentOS 5.4. It runs fine on the OpenJDK 1.6.0
and Tomcat 5. If I were to do it again, I'd probably just stick with
Jetty.

You really will need to read the docs to get the settings right as
there is no one-size-fits-all setting. (re your mem/dsk question)

K



On Thu, Mar 18, 2010 at 9:51 AM, blargy  wrote:
>
> Does anyone have any recommendations on which OS to use when setting up Solr
> search server?
>
> Any memory/disk space recommendations?
>
> Thanks
> --
> View this message in context: 
> http://old.nabble.com/Recommended-OS-tp27948306p27948306.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: where can i get an synonym.txt and spellcheck.txt ?

2010-03-18 Thread stocki


aha, okay thx.

and how do you get yout spellcheck words from your productnames ? 

we have somtimes very looong names. how it is possible to use the
spellchecker function or autosuggestion in the right way ? 




Erick Erickson wrote:
> 
> You probably won't find a good synonyms file. The problem is
> that synonyms tend to be domain-specific, so a synonyms file
> for chemistry would be of little use for psychology.
> 
> 
> Spellcheck is generally more useful it it's derived from words
> already *in* your index. It's of little use to a user to have
> spellcheck/autosuggest show terms that aren't in the
> index...
> 
> HTH
> Erick
> 
> On Thu, Mar 18, 2010 at 10:58 AM, stocki  wrote:
> 
>>
>> Hello.
>>
>> I search an synonym and spellcheck.txt
>> where can i find it in the laaarge internet ?
>>
>> or how, do you filled these two files with good names ?
>> --
>> View this message in context:
>> http://old.nabble.com/where-can-i-get-an-synonym.txt-and-spellcheck.txt---tp27946812p27946812.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: 
http://old.nabble.com/where-can-i-get-an-synonym.txt-and-spellcheck.txt---tp27946812p27948589.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: [search_dev] Opinions on Facet+Fulltext behavior?

2010-03-18 Thread Mark Bennett
Hi Chris,

A cool idea, and I like that on Google too.

But while that's great for techies, not for other demographics.

The restriction on "no checkbox or 'start new search'" was because those
were considered too complicated / distracting / old-school for the target
users, so punctuation in the search box would cause smoke to billow out of
ears I imagine.  ;-)

--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


On Thu, Mar 18, 2010 at 10:07 AM, Chris Biow wrote:

>
>
>  One option is that taken by MarkMail, which is to explicitly reflect the
> facet selection with googly field syntax in the text box (and the URL):
>
>
>
>
> http://markmail.org/search/?q=text+date%3A200703-200812+list%3Aorg.apache.lucene.solr-user
>
>
>
> The user can now change the text query in the search box (which happens to
> be the word “text”) to something else, with or without adjusting the faceted
> selections on Date and List.
>
>
>
> *From:* search_...@yahoogroups.com [mailto:search_...@yahoogroups.com] *On
> Behalf Of *Mark Bennett
> *Sent:* Thursday, March 18, 2010 12:57 PM
> *To:* SearchDev.org Mailing List; solr-user@lucene.apache.org
> *Subject:* [search_dev] Opinions on Facet+Fulltext behavior?
>
>
>
>
>
> Most sites allow you to search for some text, and then click on Facets (or
> Tags or Taxonomy branches) to drill down into your search.
>
> Most sites also show the search box in these search results, with the text
> previously entered, so that you can edit it and resubmit.  Perhaps you want
> to add a word or change the spelling of a search term, etc.
>
> But what should happen to your previously selected Facets (or Tags, etc) ?
>
> Before you answer
> 1: It seems like even "the big guys" don't agree.  Some large web sites
> clear all previous tags when text is resubmitted, other sites keep these
> other selections in place to filter the revised search.  So it doesn't seem
> like there's universal agreement.
> 2: Let's assume that I do NOT want to put a "start new search" button or
> checkbox in the search form.  The goal is to find the most reasonable
> DEFAULT behavior, and possibly not display those options.
>
> Examples: behavior A vs. B
>
> Both start the same:
> 0: Assume a real-estate site.
> 1: I type in "furnished apartment", and get 2,000 matches.
> 2: Then I notice the "City" facet and click "Sunnyvale".  Now I have just
> 50 furnished apartments, in Sunnyvale
> 3: Then I edit my search text to add the word "garage", so the now the text
> box is "furnished apartment garage", and submit the search.
>
> Scenario A:
> The engine brings back furnished apartments with a garage, in Sunnyvale,
> and I get only 10 results.
>
> Scenario B:
> The engine brings back furnished apartments with a garage all over the Bay
> Area, and I get 800 matches.
> To limit the search to Sunnyvale, I must again click the City facet and
> select it.
>
> There are strengths and weaknesses to both scenarios, but I don't wanna
> bias anybody's answer.  I'd love to hear your thoughts!
>
> --
> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513
>
>   __._,_.___
>   Reply to 
> sender|
>  Reply
> to 
> group|
>  Reply
> via web 
> post|
>  Start
> a New 
> Topic
> Messages in this 
> topic(
> 2)
>  Recent Activity:
>
>- New 
> Members
>4
>
>  Visit Your 
> Group
>  [image: Yahoo! 
> Groups]
> Switch to: 
> Text-Only,
> Daily 
> Digest•
> Unsubscribe  •
> Terms of Use 
>.
>
> __,_._,___
>


Re: Recommended OS

2010-03-18 Thread Jean-Sebastien Vachon

On 2010-03-18, at 1:03 PM, K Wong wrote:

> http://wiki.apache.org/solr/FAQ#What_are_the_Requirements_for_running_a_Solr_server.3F
> 
> I have Solr running on CentOS 5.4. It runs fine on the OpenJDK 1.6.0
> and Tomcat 5. If I were to do it again, I'd probably just stick with
> Jetty.

Would you mind explaining why you would stick with Jetty instead of Tomcat?


> You really will need to read the docs to get the settings right as
> there is no one-size-fits-all setting. (re your mem/dsk question)
> 
> K
> 
> 
> 
> On Thu, Mar 18, 2010 at 9:51 AM, blargy  wrote:
>> 
>> Does anyone have any recommendations on which OS to use when setting up Solr
>> search server?
>> 
>> Any memory/disk space recommendations?
>> 
>> Thanks
>> --
>> View this message in context: 
>> http://old.nabble.com/Recommended-OS-tp27948306p27948306.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>> 
>> 



Re: Recommended OS

2010-03-18 Thread blargy

Beat me to the punch with that question.

KWong, did you happen to install the Apache APR? Wondering if it is even
worth the trouble.

I am thinking about going with RedHat Enterprise 5 unless anyone has any
objections?


Jean-Sebastien Vachon wrote:
> 
> 
> On 2010-03-18, at 1:03 PM, K Wong wrote:
> 
>> http://wiki.apache.org/solr/FAQ#What_are_the_Requirements_for_running_a_Solr_server.3F
>> 
>> I have Solr running on CentOS 5.4. It runs fine on the OpenJDK 1.6.0
>> and Tomcat 5. If I were to do it again, I'd probably just stick with
>> Jetty.
> 
> Would you mind explaining why you would stick with Jetty instead of
> Tomcat?
> 
> 
>> You really will need to read the docs to get the settings right as
>> there is no one-size-fits-all setting. (re your mem/dsk question)
>> 
>> K
>> 
>> 
>> 
>> On Thu, Mar 18, 2010 at 9:51 AM, blargy  wrote:
>>> 
>>> Does anyone have any recommendations on which OS to use when setting up
>>> Solr
>>> search server?
>>> 
>>> Any memory/disk space recommendations?
>>> 
>>> Thanks
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Recommended-OS-tp27948306p27948306.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>> 
>>> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Recommended-OS-tp27948306p27948867.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Recommended OS

2010-03-18 Thread K Wong
We're running Solr to provide search services to a Drupal 6
installation. The site is very low traffic (35 uniques a day) and
search doesn't get used very often. I was thinking that I could get
away with running it on the Jetty that comes with Solr. It's just one
less thing that has to be looked after (the Tomcat).

As for the APR, yes, I installed it using yum: as in "yum install apr.x86_64"

It was fairly painless, just a lot of reading.

If your installation is going to be running on a small cluster or
something, then that's a whole different scenario from what I had to
deal with.

K



On Thu, Mar 18, 2010 at 10:30 AM, blargy  wrote:
>
> Beat me to the punch with that question.
>
> KWong, did you happen to install the Apache APR? Wondering if it is even
> worth the trouble.
>
> I am thinking about going with RedHat Enterprise 5 unless anyone has any
> objections?
>
>
> Jean-Sebastien Vachon wrote:
>>
>>
>> On 2010-03-18, at 1:03 PM, K Wong wrote:
>>
>>> http://wiki.apache.org/solr/FAQ#What_are_the_Requirements_for_running_a_Solr_server.3F
>>>
>>> I have Solr running on CentOS 5.4. It runs fine on the OpenJDK 1.6.0
>>> and Tomcat 5. If I were to do it again, I'd probably just stick with
>>> Jetty.
>>
>> Would you mind explaining why you would stick with Jetty instead of
>> Tomcat?
>>
>>
>>> You really will need to read the docs to get the settings right as
>>> there is no one-size-fits-all setting. (re your mem/dsk question)
>>>
>>> K
>>>
>>>
>>>
>>> On Thu, Mar 18, 2010 at 9:51 AM, blargy  wrote:

 Does anyone have any recommendations on which OS to use when setting up
 Solr
 search server?

 Any memory/disk space recommendations?

 Thanks
 --
 View this message in context:
 http://old.nabble.com/Recommended-OS-tp27948306p27948306.html
 Sent from the Solr - User mailing list archive at Nabble.com.


>>
>>
>>
>
> --
> View this message in context: 
> http://old.nabble.com/Recommended-OS-tp27948306p27948867.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Solr query parser doesn't invoke analyzer for simple term query?

2010-03-18 Thread Chris Hostetter

: It seems that Solr's query parser doesn't pass a single term query

no ... the query parser always uses the analyzer for "text" regardless of 
wether it's a single term or not (it doesnt' even know if it's a single 
term until the Analyzer tells it)

cases where the analyzer isn't used are things like range queries, or 
wildcards, or prefix queries.


-Hoss



Re: Solr query parser doesn't invoke analyzer for simple term query?

2010-03-18 Thread Chris Hostetter
: 
: Thank you, Marco.  I see the debug out put that looks like:
: title_jpn:2001年
: title_jpn:2001年
: PhraseQuery(title_jpn:"2001 年")
: title_jpn:"2001 年"
...
: Does this mean the standard query parser does send the
: raw query string to the Analyzer and (because the query
: yielded more than one token?) it uses phrase query?

correct.

: I guess the cause of my problem is somewhere else.

what does the debug output look like when you "quote" the input (you 
mentioned before that you got differnet results when using/ommiting 
quotes)



-Hoss


Re: dynamic categorization & transactional data

2010-03-18 Thread caman

1) Took care of the first one by Transformer.
2) Any input on 2 please? I need to store # of views and popularity with
each document and that can change pretty often. Recommended to use database
or can this be updated to SOLr directly? My issue with DB is that with every
SOLR search hit, will have to do DB hit to retrieve meta-data. 

Any input id appreciated please

caman wrote:
> 
> Hello all,
> 
> Please see below.any help much appreciated.
> 1) Extracting data out of a text field to assign a category for certain
> configured words. e.g. If the text is "Google does it again with Android" 
> and If 'Google' and 'Android' are the configured words, I want to b able
> to assign the article to tags 'Google' and 'Android' and 'Technical' . Can
> I do this with a custom filter during analysis? Similarly setting up
> categories for each article based on keywords in the text.
> 2) How about using SOLR as transactional datastore? Need to keep track of
> rating for each document. Would 'ExternalFileField' be good choice for
> this use-case?
> 
> Thanks in advance.
> 

-- 
View this message in context: 
http://old.nabble.com/dynamic-categorization---transactional-data-tp27790233p27949786.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: dynamic categorization & transactional data

2010-03-18 Thread Smiley, David W.
You'll probably want to influence your relevancy on this popularity number that 
is changing often.  ExternalFileField looks like a possibility though I haven't 
used it.  Another would be using an in-memory cache which stores all popularity 
numbers for any data that has its popularity updated since the last index 
update (say since the previous night).  On second thought, it may need to be 
absolutely all of them but these are just #s so no big deal?  You could then 
customize a "ValueSource" subclass which gets data from this fast in-memory up 
to date source.  See FileFloatSource for an example that uses a file instead of 
an in-memory structure.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/


On Mar 18, 2010, at 2:44 PM, caman wrote:

> 2) Any input on 2 please? I need to store # of views and popularity with
> each document and that can change pretty often. Recommended to use database
> or can this be updated to SOLr directly? My issue with DB is that with every
> SOLR search hit, will have to do DB hit to retrieve meta-data. 
> 
> Any input is appreciated please




Re: dynamic categorization & transactional data

2010-03-18 Thread caman

David,

Much appreciated. This gives me enough to work with. 
I missed one important point. Our data changes pretty frequently which mean
we may be running deltas every 5-10 minutes. in-memory should work
thanks





David Smiley @MITRE.org wrote:
> 
> You'll probably want to influence your relevancy on this popularity number
> that is changing often.  ExternalFileField looks like a possibility though
> I haven't used it.  Another would be using an in-memory cache which stores
> all popularity numbers for any data that has its popularity updated since
> the last index update (say since the previous night).  On second thought,
> it may need to be absolutely all of them but these are just #s so no big
> deal?  You could then customize a "ValueSource" subclass which gets data
> from this fast in-memory up to date source.  See FileFloatSource for an
> example that uses a file instead of an in-memory structure.
> 
> ~ David Smiley
> Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/
> 
> 
> On Mar 18, 2010, at 2:44 PM, caman wrote:
> 
>> 2) Any input on 2 please? I need to store # of views and popularity with
>> each document and that can change pretty often. Recommended to use
>> database
>> or can this be updated to SOLr directly? My issue with DB is that with
>> every
>> SOLR search hit, will have to do DB hit to retrieve meta-data. 
>> 
>> Any input is appreciated please
> 
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/dynamic-categorization---transactional-data-tp27790233p27950036.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Return all Facets?

2010-03-18 Thread homerlex

Thanks for the reply.  Can someone point me to a sample on how to use the
luke request handler to get this info?


Erik Hatcher-4 wrote:
> 
> No, there isn't.  How would one know what all the facet fields are,  
> though?
> 
> One trick, use the luke request handler to get the list of fields,  
> then use that list to construct the facet fields request parameters.
> 
>   Erik
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Return-all-Facets--tp27944999p27950714.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Return all Facets?

2010-03-18 Thread Smiley, David W.
Coincidentally I'm working on something like this right now.  However in my 
case, I want results filtered by the current search for the facets they use 
(which is a subset of all available), with a count.  This is a sort of 
meta-facet since its faceting on the facetable fields.  I've implemented this 
by adding an UpdateRequestProcessor implementation that finds fields matching a 
certain pattern and simply adding their names to a field I've chosen -- which 
will at query time be faceted on.  Pretty easy.

~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Mar 18, 2010, at 3:58 PM, homerlex wrote:

> 
> Thanks for the reply.  Can someone point me to a sample on how to use the
> luke request handler to get this info?
> 
> 
> Erik Hatcher-4 wrote:
>> 
>> No, there isn't.  How would one know what all the facet fields are,  
>> though?
>> 
>> One trick, use the luke request handler to get the list of fields,  
>> then use that list to construct the facet fields request parameters.
>> 
>>  Erik
>> 
>> 
>> 
> 
> -- 
> View this message in context: 
> http://old.nabble.com/Return-all-Facets--tp27944999p27950714.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 







Issue with exact matching

2010-03-18 Thread Alex Thurlow
I'm trying to give a super boost to fields that match exactly, but it 
doesn't appear to be working.  I have this:


stored="true"/>







sortMissingLast="true" omitNorms="true">








The dataset has two items with title="Rude Boy", but they are coming up 
way down the list.  My query looks like this: title:rude boy^100 OR 
artist:rude boy^100 OR description:rude boy^5 OR tags:rude boy^10 OR 
title:"rude boy"~100^100 OR artist:"rude boy"~100^100 OR 
description:"rude boy"~10 OR tags:"rude boy"~1000^10 OR 
title_tight:rude boy^1000 or artist_tight:rude boy^1500 OR 
artist_title:rude boy^100


The debug output seems to say that it's not even matching that field:
0.33547562 = (MATCH) product of: 1.2748073 = (MATCH) sum of: 0.004359803 
= (MATCH) weight(title:rude in 19218), product of: 7.7693147E-4 = 
queryWeight(title:rude), product of: 8.978507 = idf(docFreq=6, 
maxDocs=20423) 8.653237E-5 = queryNorm 5.611567 = (MATCH) 
fieldWeight(title:rude in 19218), product of: 1.0 = 
tf(termFreq(title:rude)=1) 8.978507 = idf(docFreq=6, maxDocs=20423) 
0.625 = fieldNorm(field=title, doc=19218) 0.002346348 = (MATCH) 
weight(tags:rude in 19218), product of: 8.0604723E-4 = 
queryWeight(tags:rude), product of: 9.31498 = idf(docFreq=4, 
maxDocs=20423) 8.653237E-5 = queryNorm 2.910931 = (MATCH) 
fieldWeight(tags:rude in 19218), product of: 1.0 = 
tf(termFreq(tags:rude)=1) 9.31498 = idf(docFreq=4, maxDocs=20423) 0.3125 
= fieldNorm(field=tags, doc=19218) 1.203806 = weight(title:"rude 
boy"~100^100.0 in 19218), product of: 0.12910038 = 
queryWeight(title:"rude boy"~100^100.0), product of: 100.0 = boost 
14.919317 = idf(title: rude=6 boy=145) 8.653237E-5 = queryNorm 9.3245735 
= fieldWeight(title:"rude boy" in 19218), product of: 1.0 = 
tf(phraseFreq=1.0) 14.919317 = idf(title: rude=6 boy=145) 0.625 = 
fieldNorm(field=title, doc=19218) 0.060910318 = weight(tags:"rude 
boy"~1000^10.0 in 19218), product of: 0.012987026 = 
queryWeight(tags:"rude boy"~1000^10.0), product of: 10.0 = boost 
15.008287 = idf(tags: rude=4 boy=186) 8.653237E-5 = queryNorm 4.6900897 
= fieldWeight(tags:"rude boy" in 19218), product of: 1.0 = 
tf(phraseFreq=1.0) 15.008287 = idf(tags: rude=4 boy=186) 0.3125 = 
fieldNorm(field=tags, doc=19218) 0.0033848688 = (MATCH) 
weight(artist_title:rude in 19218), product of: 7.6537667E-4 = 
queryWeight(artist_title:rude), product of: 8.844975 = idf(docFreq=7, 
maxDocs=20423) 8.653237E-5 = queryNorm 4.4224877 = (MATCH) 
fieldWeight(artist_title:rude in 19218), product of: 1.0 = 
tf(termFreq(artist_title:rude)=1) 8.844975 = idf(docFreq=7, 
maxDocs=20423) 0.5 = fieldNorm(field=artist_title, doc=19218) 0.2631579 
= coord(5/19)



Someone else suggested I use DisMax, but I can't really get that to do 
what I want right now either.  I'm just wondering why this seems to not 
be using this field at all.


-Alex



Re: dynamic categorization & transactional data

2010-03-18 Thread Grant Ingersoll

On Mar 18, 2010, at 2:44 PM, caman wrote:

> 
> 1) Took care of the first one by Transformer.

This is often also something done by a classifier that is trained to deal with 
all the statistical variations in your text.  Tools like Weka, Mahout, OpenNLP, 
etc. can be applied here.

> 2) Any input on 2 please? I need to store # of views and popularity with
> each document and that can change pretty often. Recommended to use database
> or can this be updated to SOLr directly? My issue with DB is that with every
> SOLR search hit, will have to do DB hit to retrieve meta-data. 

Define often, please.  Less than a minute or more than a minute?

> 
> Any input is appreciated please
> 
> caman wrote:
>> 
>> Hello all,
>> 
>> Please see below.any help much appreciated.
>> 1) Extracting data out of a text field to assign a category for certain
>> configured words. e.g. If the text is "Google does it again with Android" 
>> and If 'Google' and 'Android' are the configured words, I want to b able
>> to assign the article to tags 'Google' and 'Android' and 'Technical' . Can
>> I do this with a custom filter during analysis? Similarly setting up
>> categories for each article based on keywords in the text.
>> 2) How about using SOLR as transactional datastore? Need to keep track of
>> rating for each document. Would 'ExternalFileField' be good choice for
>> this use-case?
>> 
>> Thanks in advance.
>> 
> 
> -- 
> View this message in context: 
> http://old.nabble.com/dynamic-categorization---transactional-data-tp27790233p27949786.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search



Re: [ANN] Zoie Solr Plugin - Zoie Solr Plugin enables real-time update functionality for Apache Solr 1.4+

2010-03-18 Thread brad anderson
Tried following their tutorial for plugging zoie into solr:
http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Server

It appears it only allows you to search on documents after you do a commit?
Am I missing something here, or does plugin not doing anything.

Their tutorial tells you to do a commit when you index the docs:

curl http://localhost:8983/solr/update/csv?commit=true --data-binary
@books.csv -H 'Content-type:text/plain; charset=utf-8'


When I don't do the commit, I cannot search the documents I've indexed.

Thanks,
Brad

On 9 March 2010 23:34, Don Werve  wrote:

> 2010/3/9 Shalin Shekhar Mangar 
>
> > I think Don is talking about Zoie - it requires a long uniqueKey.
> >
>
> Yep; we're using UUIDs.
>


trimfilterfactory on string fieldtype?

2010-03-18 Thread Tommy Chheng

 Can the trim filter factory work on string fieldtypes?

When I define a trim filter factory on a string fieldtype, i get an 
exception:
org.apache.solr.common.SolrException: Unknown fieldtype 'string' 
specified on field id
at 
org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:477)

at org.apache.solr.schema.IndexSchema.(IndexSchema.java:95)
at org.apache.solr.core.SolrCore.(SolrCore.java:520)
at 
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)


This is how i define the field in the schema:

omitNorms="true">






--
Tommy Chheng
Programmer and UC Irvine Graduate Student
Twitter @tommychheng
http://tommy.chheng.com



good spell dictionary

2010-03-18 Thread michaelnazaruk

Can anyone tell me, where I can buy or download free spell dictionary for
solr? I need not simple dictionary! I need very good spell american-english
dictionary(or only american)! 
-- 
View this message in context: 
http://old.nabble.com/good-spell-dictionary-tp27950854p27950854.html
Sent from the Solr - User mailing list archive at Nabble.com.



DIH questions

2010-03-18 Thread Shawn Heisey
Below is my data-config.xml file, which I am using to build an index for 
my first shard.  I have a couple of questions.


Can Solr include the hostname (short version) it's running on in the 
query?  Alternatively, is there a way to override the query with a URL 
parameter before or when doing the full import? I'd like to avoid having 
to parse and rewrite the config file.


My ultimate goal is to write a completely generic query that gets the 
values represented by 6, 0, and 229615984 in the example below from a 
small config table, but I'll take baby steps in that direction.




url="jdbc:mysql://[hostname]:3306/[database]?zeroDateTimeBehavior=convertToNull"

batchSize="-1"
user="[user]"
password="[password]"/>

  query="SELECT * FROM [table] WHERE (did % 6) = 0 AND 229615984 >= 
did">







Re: DIH questions

2010-03-18 Thread Lukas Kahwe Smith

On 18.03.2010, at 23:12, Shawn Heisey wrote:

> Below is my data-config.xml file, which I am using to build an index for my 
> first shard.  I have a couple of questions.
> 
> Can Solr include the hostname (short version) it's running on in the query?  
> Alternatively, is there a way to override the query with a URL parameter 
> before or when doing the full import? I'd like to avoid having to parse and 
> rewrite the config file.
> 
> My ultimate goal is to write a completely generic query that gets the values 
> represented by 6, 0, and 229615984 in the example below from a small config 
> table, but I'll take baby steps in that direction.
> 
> 
> driver="com.mysql.jdbc.Driver"
>encoding="UTF-8"
>
> url="jdbc:mysql://[hostname]:3306/[database]?zeroDateTimeBehavior=convertToNull"
>batchSize="-1"
>user="[user]"
>password="[password]"/>
> 
>   query="SELECT * FROM [table] WHERE (did % 6) = 0 AND 229615984 >= did">
> 
> 
> 


I recently asked about this on this list. You can use request parameters in 
your DIH xml:
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters

However you can even also define default for these parameters inside your 
solrconfig.xml request handler configuration.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org





Re: Boundary match as part of query language?

2010-03-18 Thread Chris Hostetter

: Now, I know how to work-around this, by appending some unique character 
: sequence at each end of the field and then include this in my search in 
: the front end. However, I wonder if any of you have been planning a 
: patch to add a native boundary match feature to Solr that would 
: automagically add tokens (also for multi-value fields!), and expand the 
: query language to allow querying for starts-with(), ends-with() and 
: equals()

well, if you *always* want boundary rules to be applied, that can be done 
as simply as adding your boundary tokens automaticly in both the index and 
query time analyzers ... then a search for q="New York" can 
automaticly be translated into a PhraseQuery for "_BEGIN New York _END"

If you want special QueryParser markup to specify when you wnat specific 
boundary conditions that can also be done with a custom QParser, and 
automaicly applying the boundry tokens in your indexing analyzer (but not 
the query analyzer -- the QParser would take care of that part)  In 
general though it's hard to see how something like q=begin(New York) is 
easier syntax then q="_BEGIN New York"

THe point is it's realtively easy to implement something like this when 
meeting specific needs, but i don't know of any working on a truely 
generalized Qparser that deals with this -- largely because most people 
who care about this sort of thing either have really complicated use cases 
(ie: not just begin/end boudnary markers, but also want sentence, 
paragraph, page, chapter, section, etc...) or want extremely specific 
query syntax (ie: they're trying to recreate the syntax of an existing 
system they are replacing) so a general solution doesn't work well.

The cosest i've ever seen is Mark Miller's QSolr parser, which actually 
went a completley differnet direction using a home grown syntax to 
generate Span queries ... if that slacker ever gets off his butt and 
starts running his webserver again, you could download it and try it out, 
and probably find that it would be trivial to turn it into a QParser.


-Hoss



Re: Facet pagination

2010-03-18 Thread Chris Hostetter

: Is there a way to get *total count of facets* per field?

sorry, no.  you can skip ahead, but the only way to know when you're done 
is when you stop getting constraints back for that field.




-Hoss



Re: Generating a sitemap

2010-03-18 Thread Chris Hostetter

: Been testing nutch to crawl for solr and I was wondering if anyone had
: already worked on a system for getting the urls out of solr and generating
: an XML sitemap for Google.

it's pretty easy to just paginate through all docs in solr, so you could 
do that -- but I'd be really suprised if Nutch wasn't also loggign all the 
URLs it indexed, so you could just post-process that log to build the 
sitemap as well.



-Hoss



Re: Multi valued fields

2010-03-18 Thread Chris Hostetter

: Can I build a query such as : 
: 
:   -field: A
: 
: which will return all documents that do not have "exclusive" A in the 
: their field's values. By exclusive I mean that I don't want documents 
: that only have A in their list of values. In my sample case, the query 
: would return doc A and B. Because they both have other values in field1.

the most straight forward way i know of to deal with requirements like 
this is to also have a field_count field where you record the number of 
values indexed into "field" ... an UpdateProcessor can automate creating 
this field for you, and then you can query for something like...

-(+field:A +field_count:1)


-Hoss



Re: Filtering search results

2010-03-18 Thread Chris Hostetter

: For example, in dice.com, the visitor can search by some keyword and filter
: further by Skill, Country, Province, City, Telecommute, Travel Required
: (shown on the left pane on dice.com). We were wondering if there is some
: built-in feature/functionality that can be used from Solr to implement this.
: Can someone give us pointers to get started with this ?

"facet.field" and "fq" (filter query)

http://wiki.apache.org/solr/SolrFacetingOverview
http://wiki.apache.org/solr/SimpleFacetParameters




-Hoss



Re: Issue with exact matching

2010-03-18 Thread Erick Erickson
I only have time for a quick glance, but what jumps out is
that this part:
title:rude boy^100

probably isn't matching "boy" against your title field, it's matching "rude"
against title, but "boy" against your default field and boosting the "boy"
part.

Try parenthesizing (at least that works in Lucene).

HTH
Erick

On Thu, Mar 18, 2010 at 5:06 PM, Alex Thurlow  wrote:

> I'm trying to give a super boost to fields that match exactly, but it
> doesn't appear to be working.  I have this:
>
>  stored="true"/>
>  stored="true"/>
>
> 
> 
>
>
>  sortMissingLast="true" omitNorms="true">
> 
> 
> 
> 
> 
> 
>
> The dataset has two items with title="Rude Boy", but they are coming up way
> down the list.  My query looks like this: title:rude boy^100 OR artist:rude
> boy^100 OR description:rude boy^5 OR tags:rude boy^10 OR title:"rude
> boy"~100^100 OR artist:"rude boy"~100^100 OR description:"rude boy"~10
> OR tags:"rude boy"~1000^10 OR title_tight:rude boy^1000 or artist_tight:rude
> boy^1500 OR artist_title:rude boy^100
>
> The debug output seems to say that it's not even matching that field:
> 0.33547562 = (MATCH) product of: 1.2748073 = (MATCH) sum of: 0.004359803 =
> (MATCH) weight(title:rude in 19218), product of: 7.7693147E-4 =
> queryWeight(title:rude), product of: 8.978507 = idf(docFreq=6,
> maxDocs=20423) 8.653237E-5 = queryNorm 5.611567 = (MATCH)
> fieldWeight(title:rude in 19218), product of: 1.0 =
> tf(termFreq(title:rude)=1) 8.978507 = idf(docFreq=6, maxDocs=20423) 0.625 =
> fieldNorm(field=title, doc=19218) 0.002346348 = (MATCH) weight(tags:rude in
> 19218), product of: 8.0604723E-4 = queryWeight(tags:rude), product of:
> 9.31498 = idf(docFreq=4, maxDocs=20423) 8.653237E-5 = queryNorm 2.910931 =
> (MATCH) fieldWeight(tags:rude in 19218), product of: 1.0 =
> tf(termFreq(tags:rude)=1) 9.31498 = idf(docFreq=4, maxDocs=20423) 0.3125 =
> fieldNorm(field=tags, doc=19218) 1.203806 = weight(title:"rude
> boy"~100^100.0 in 19218), product of: 0.12910038 = queryWeight(title:"rude
> boy"~100^100.0), product of: 100.0 = boost 14.919317 = idf(title: rude=6
> boy=145) 8.653237E-5 = queryNorm 9.3245735 = fieldWeight(title:"rude boy" in
> 19218), product of: 1.0 = tf(phraseFreq=1.0) 14.919317 = idf(title: rude=6
> boy=145) 0.625 = fieldNorm(field=title, doc=19218) 0.060910318 =
> weight(tags:"rude boy"~1000^10.0 in 19218), product of: 0.012987026 =
> queryWeight(tags:"rude boy"~1000^10.0), product of: 10.0 = boost 15.008287 =
> idf(tags: rude=4 boy=186) 8.653237E-5 = queryNorm 4.6900897 =
> fieldWeight(tags:"rude boy" in 19218), product of: 1.0 = tf(phraseFreq=1.0)
> 15.008287 = idf(tags: rude=4 boy=186) 0.3125 = fieldNorm(field=tags,
> doc=19218) 0.0033848688 = (MATCH) weight(artist_title:rude in 19218),
> product of: 7.6537667E-4 = queryWeight(artist_title:rude), product of:
> 8.844975 = idf(docFreq=7, maxDocs=20423) 8.653237E-5 = queryNorm 4.4224877 =
> (MATCH) fieldWeight(artist_title:rude in 19218), product of: 1.0 =
> tf(termFreq(artist_title:rude)=1) 8.844975 = idf(docFreq=7, maxDocs=20423)
> 0.5 = fieldNorm(field=artist_title, doc=19218) 0.2631579 = coord(5/19)
>
>
> Someone else suggested I use DisMax, but I can't really get that to do what
> I want right now either.  I'm just wondering why this seems to not be using
> this field at all.
>
>-Alex
>
>


Re: DIH questions

2010-03-18 Thread Shawn Heisey

That looks very useful.  So does this mean that this will work?

URL text:
?command=full-import&numShards=6&modValue=0&minDid=229615984

XML:
query="SELECT * FROM [table] WHERE (did % 
${dataimporter.request.numShards}) = ${dataimporter.request.modValue} 
AND ${dataimporter.request.minDid} >= did"


Thanks,
Shawn

On 3/18/2010 4:15 PM, Lukas Kahwe Smith wrote:

I recently asked about this on this list. You can use request
parameters in your DIH xml:
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters

However you can even also define default for these parameters inside your 
solrconfig.xml request handler configuration.

regards,
Lukas Kahwe Smith
m...@pooteeweet.org








stream.url Contention

2010-03-18 Thread Giovanni Fernandez-Kincade
I recently switched from posting a file (PDFs in this case) to the Extract 
handler, to using the Stream.URL parameter. I've noticed a huge amount of 
contention around opening URL connections:

http-8080-Processor36 [BLOCKED] CPU time: 0:47
sun.net.www.protocol.file.Handler.openConnection(URL)
java.net.URL.openConnection()
sun.net.www.protocol.jar.JarURLConnection.(URL, Handler)
sun.net.www.protocol.jar.Handler.openConnection(URL)
java.net.URL.openConnection()
java.net.URL.openStream()
java.lang.ClassLoader.getResourceAsStream(String)
org.pdfbox.util.ResourceLoader.loadResource(String)
org.pdfbox.util.ResourceLoader.loadProperties(String)
org.pdfbox.util.PDFTextStripper.()
org.apache.tika.parser.pdf.PDF2XHTML.(ContentHandler, Metadata)
org.apache.tika.parser.pdf.PDF2XHTML.process(PDDocument, ContentHandler, 
Metadata)
org.apache.tika.parser.pdf.PDFParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.tika.parser.CompositeParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.tika.parser.AutoDetectParser.parse(InputStream, ContentHandler, 
Metadata)
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(SolrQueryRequest,
 SolrQueryResponse, ContentStream)
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.handler.RequestHandlerBase.handleRequest(SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(SolrQueryRequest,
 SolrQueryResponse)
org.apache.solr.core.SolrCore.execute(SolrRequestHandler, SolrQueryRequest, 
SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.execute(HttpServletRequest, 
SolrRequestHandler, SolrQueryRequest, SolrQueryResponse)
org.apache.solr.servlet.SolrDispatchFilter.doFilter(ServletRequest, 
ServletResponse, FilterChain)
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ServletRequest,
 ServletResponse)
org.apache.catalina.core.ApplicationFilterChain.doFilter(ServletRequest, 
ServletResponse)
org.apache.catalina.core.StandardWrapperValve.invoke(Request, Response)
org.apache.catalina.core.StandardContextValve.invoke(Request, Response)
org.apache.catalina.core.StandardHostValve.invoke(Request, Response)
org.apache.catalina.valves.ErrorReportValve.invoke(Request, Response)
org.apache.catalina.core.StandardEngineValve.invoke(Request, Response)
org.apache.catalina.connector.CoyoteAdapter.service(Request, Response)
org.apache.coyote.http11.Http11Processor.process(InputStream, OutputStream)
org.apache.coyote.http11.Http11BaseProtocol$Http11ConnectionHandler.processConnection(TcpConnection,
 Object[])
org.apache.tomcat.util.net.PoolTcpEndpoint.processSocket(Socket, TcpConnection, 
Object[])
org.apache.tomcat.util.net.LeaderFollowerWorkerThread.runIt(Object[])
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run()
java.lang.Thread.run()

This seems to be a significant bottleneck, even when running only a handful of 
thread. Has anyone else run into this? Any ideas on how to reduce the blocking?

Thanks,
Gio.


Re: good spell dictionary

2010-03-18 Thread Erick Erickson
Spellcheck is generally more useful it it's derived from words
already *in* your index. It's of little use to a user to have
spellcheck/autosuggest show terms that aren't in the
index...

HTH
Erick


On Thu, Mar 18, 2010 at 6:00 PM, michaelnazaruk wrote:

>
> Can anyone tell me, where I can buy or download free spell dictionary for
> solr? I need not simple dictionary! I need very good spell american-english
> dictionary(or only american)!
> --
> View this message in context:
> http://old.nabble.com/good-spell-dictionary-tp27950854p27950854.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


Re: Generating a sitemap

2010-03-18 Thread Jon Baer
It's also possible to try and use the Velocity contrib response writer and 
paging it w/ the sitemap elements.

BTW generating a sitemap was a big reason of a switch we did from GSA to Solr 
because (for some reason) the map took way too long to generate (even simple 
requests).

If you page through w/ Solr (ie rows=100&wt=velocity&v.template=sitemap) its 
fairly painless to build on cron.

- Jon

On Mar 18, 2010, at 6:25 PM, Chris Hostetter wrote:

> 
> : Been testing nutch to crawl for solr and I was wondering if anyone had
> : already worked on a system for getting the urls out of solr and generating
> : an XML sitemap for Google.
> 
> it's pretty easy to just paginate through all docs in solr, so you could 
> do that -- but I'd be really suprised if Nutch wasn't also loggign all the 
> URLs it indexed, so you could just post-process that log to build the 
> sitemap as well.
> 
> 
> 
> -Hoss
> 



Re: DIH questions

2010-03-18 Thread Shawn Heisey
I gave this config idea a try, looks like it works perfectly.  I thought 
at first that it wasn't working, but as is usual with such things, my 
XML was faulty.


Many many thanks!

Shawn


On 3/18/2010 5:19 PM, Shawn Heisey wrote:

That looks very useful.  So does this mean that this will work?

URL text:
?command=full-import&numShards=6&modValue=0&minDid=229615984

XML:
query="SELECT * FROM [table] WHERE (did % 
${dataimporter.request.numShards}) = ${dataimporter.request.modValue} 
AND ${dataimporter.request.minDid} >= did"


Thanks,
Shawn

On 3/18/2010 4:15 PM, Lukas Kahwe Smith wrote:

I recently asked about this on this list. You can use request
parameters in your DIH xml:
http://wiki.apache.org/solr/DataImportHandler#Accessing_request_parameters 



However you can even also define default for these parameters inside 
your solrconfig.xml request handler configuration.


regards,
Lukas Kahwe Smith
m...@pooteeweet.org










[POLL] Users of abortOnConfigurationError ?

2010-03-18 Thread Chris Hostetter


Due to some issues with the (lack of) functionality behind the 
"abortOnConfigurationError" option in solrconfig.xml, I'd like to take a 
quick poll of the solr-user community...


 * If you have never heard of the abortOnConfigurationError
   option prior to this message, please ignore this email.

 * If you have seen abortOnConfigurationError in solrconfig.xml,
   or in error messages when using Solr, but you have never
   modified the value of this option in your configs, or changed
   it at run time, please ignore this email.

 * If you have ever set abortOnConfigurationError=false, either
   in your config files or at run time, please reply to these
   three questions...

1) What version of Solr are you using ?

2) What advantages do you percieve that you have by setting
   abortOnConfigurationError=false ?

3) What problems do you suspect you would encounter if this
   option was eliminated in future versions of Solr ?

Thank you.

(For people who are interested, the impetuses for this Poll can be found 
in SOLR-1743, SOLR-1817, SOLR-1824, and SOLR-1832)



-Hoss



Re: Return all Facets?

2010-03-18 Thread Erik Hatcher
David - sounds kinda like this one: http://issues.apache.org/jira/browse/SOLR-1280 
 :)


Maybe you'd be up for rounding this issue out with your enhancements  
and get this committable?


Erik

On Mar 18, 2010, at 4:06 PM, Smiley, David W. wrote:

Coincidentally I'm working on something like this right now.   
However in my case, I want results filtered by the current search  
for the facets they use (which is a subset of all available), with a  
count.  This is a sort of meta-facet since its faceting on the  
facetable fields.  I've implemented this by adding an  
UpdateRequestProcessor implementation that finds fields matching a  
certain pattern and simply adding their names to a field I've chosen  
-- which will at query time be faceted on.  Pretty easy.


~ David Smiley
Author: http://www.packtpub.com/solr-1-4-enterprise-search-server/

On Mar 18, 2010, at 3:58 PM, homerlex wrote:



Thanks for the reply.  Can someone point me to a sample on how to  
use the

luke request handler to get this info?


Erik Hatcher-4 wrote:


No, there isn't.  How would one know what all the facet fields are,
though?

One trick, use the luke request handler to get the list of fields,
then use that list to construct the facet fields request parameters.

Erik





--
View this message in context: 
http://old.nabble.com/Return-all-Facets--tp27944999p27950714.html
Sent from the Solr - User mailing list archive at Nabble.com.











Re: [ANN] Zoie Solr Plugin - Zoie Solr Plugin enables real-time update functionality for Apache Solr 1.4+

2010-03-18 Thread Erik Hatcher
"When I don't do the commit, I cannot search the documents I've  
indexed." - that's exactly how Solr without Zoie works, and it's how  
Lucene itself works.  Gotta commit to see the documents indexed.


Erik


On Mar 18, 2010, at 5:41 PM, brad anderson wrote:


Tried following their tutorial for plugging zoie into solr:
   http://snaprojects.jira.com/wiki/display/ZOIE/Zoie+Server

It appears it only allows you to search on documents after you do a  
commit?

Am I missing something here, or does plugin not doing anything.

Their tutorial tells you to do a commit when you index the docs:

curl http://localhost:8983/solr/update/csv?commit=true --data-binary
@books.csv -H 'Content-type:text/plain; charset=utf-8'


When I don't do the commit, I cannot search the documents I've  
indexed.


Thanks,
Brad

On 9 March 2010 23:34, Don Werve  wrote:


2010/3/9 Shalin Shekhar Mangar 


I think Don is talking about Zoie - it requires a long uniqueKey.



Yep; we're using UUIDs.





How many facet values are too many?

2010-03-18 Thread Andy
My understanding is that too many facet values will decrease performance

How many is too many? Are there any rules of thumb for this?

2 related questions:

- I expect a facet field to have many values (values are user generated), any 
thing I can do to minimize the performance impact?

- Any way to tell Solr to just at the top N values, kinda like the LIMIT clause 
in SQL?

Thanks



  

Re: Weired behaviour for certain search terms

2010-03-18 Thread Akash Sahu

I tired adding &hl.maxAnalyzedChars=-1 to my search query but it didnt
helped.
Just wanted to know if there are limitations on the certain search terms.
Its bit strange that solr is not behaving properly for certain terms
(especially returning the excerpts in highlighting dictionary).
The terms which i have found so far are:
1. co-ownership
2. "co ownership"
3. co-employees

I am doing some more test on it.
thanks



Ahmet Arslan wrote:
> 
> 
>> Solr is behaving a bit weirdly for some of the search
>> terms. EG:
>> co-ownership, "co ownership".
>> It works fine with terms like quasi-delict,
>> non-interference etc.
>> 
>> The issue is, its not return any excerpts in "highlighting"
>> key of the
>> result dictionary. My search query is something like this:
>> http://192.168.1.50:8080/solr/core_SFS/select?q=content:(co-ownership)+AND+permauid:(AAAE1292-rw)&hl=true&hl.fl=content&hl.requireFieldMatch=true&hl.fragsize=600&hl.usePhraseHighlighter=true&facet=true&facet.field=permauid&facet.field=info_owner&facet.sort=true&facet.mincount=1&facet.limit=-1&wt=python&sort=promulgation_date_igprs_date+asc&start=0&rows=200&fl=uid,permauid
>> 
>> but when i search for terms like quasi-delict,
>> non-interference, it gives me
>> proper excerpts.
> 
> 
> If the problem is only empty snippets (numFound > 0) then adding
> &hl.maxAnalyzedChars=-1 can help.
> 
> 
> 
>   
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Weired-behaviour-for-certain-search-terms-tp27927995p27950885.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: PDFBox/Tika Performance Issues

2010-03-18 Thread Mattmann, Chris A (388J)
Hi Giovanni,

Let's try and isolate the problem. Can you try parsing the PDF file with 
tika-app as a standalone? Take your tika-app jar file then run java -jar 
tika-app-0.7-SNAPSHOT.jar -m /path/to/pdf/file

That should give you something like:

Content-Type: application/pdf
created: Thu Sep 06 00:41:55 PDT 2007
creator: TeX
producer: pdfeTeX-1.21a
resourceName: Dissertation.pdf

(e.g., this is what I got when I ran it on my Dissertation PDF file).

Let's start there - if that works, then there is something up with the 
integration into SolrCell, and we can start to figure that out...

Cheers,
Chris



On 3/17/10 8:06 AM, "Giovanni Fernandez-Kincade" 
 wrote:

Hmm. Unfortunately that didn't work. Same problem - Solr doesn't report an 
error, but the data doesn't get extracted. Using the same PDF with my previous 
/Lib contents works fine.

Any other ideas?

These are the jar files I have in my /Lib

apache-solr-cell-1.4-dev.jar
asm-3.1.jar
bcmail-jdk15-1.45.jar
bcprov-jdk15-1.45.jar
commons-codec-1.3.jar
commons-compress-1.0.jar
commons-io-1.4.jar
commons-lang-2.1.jar
commons-logging-1.1.1.jar
dom4j-1.6.1.jar
fontbox-1.0.0.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
hamcrest-core-1.1.jar
icu4j-3.8.jar
jempbox-1.0.0.jar
junit-3.8.1.jar
log4j-1.2.14.jar
lucene-core-2.9.1-dev.jar
lucene-misc-2.9.1-dev.jar
metadata-extractor-2.4.0-beta-1.jar
mockito-core-1.7.jar
nekohtml-1.9.9.jar
objenesis-1.0.jar
ooxml-schemas-1.0.jar
pdfbox-1.0.0.jar
poi-3.6.jar
poi-ooxml-3.6.jar
poi-ooxml-schemas-3.6.jar
poi-scratchpad-3.6.jar
tagsoup-1.2.jar
tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar
xercesImpl-2.8.1.jar
xml-apis-1.0.b2.jar
xmlbeans-2.3.0.jar

-Original Message-
From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
Sent: Tuesday, March 16, 2010 11:50 PM
To: solr-user@lucene.apache.org
Subject: Re: PDFBox/Tika Performance Issues

Hi Giovanni,

Comments below:

> I'm pretty unclear on how to patch the Tika 0.7-trunk on our Solr instance.
> This is what I've tried so far (which was really just me guessing):
>
>
>
> 1. Got the latest version of the trunk code from
> http://svn.apache.org/repos/asf/lucene/tika/trunk
>
> 2. Built this using Maven (mvn install)
>

On track so far.

> 3. I took the resulting tika-app-0.7-SNAPSHOT.jar, copied it to the /Lib
> folder for my Solr Core, and renamed it to the name of the existing Tika Jar
> (tika-0.3.jar).

I don't think you need to do this (w.r.t to the renaming). I think what you
need to do is to drop:

tika-core-0.7-SNAPSHOT.jar
tika-parsers-0.7-SNAPSHOT.jar

Into your Solr core /lib folder. Also you should make sure to take the
updated PDFBox 1.0.0 jar (you can get this by typing mvn:copy-dependencies
in the tika-parsers project, see here:
http://maven.apache.org/plugins/maven-dependency-plugin/copy-dependencies-mo
jo.html), along with the rest of the jar deps for tika-parsers and drop them
in there as well. Then, make sure to remove the existing tika-0.3.jar, as
well as any of the existing parser lib jar files and replace them with the
new deps.

A bunch of manual labor yes, but you're on the bleeding edge, so c'est la
vie, right? :) The alternative is to wait for Tika 0.7 to be released and
then for Solr to upgrade to it.

>
> 4. Then I bounced my servlet server and tried indexing a document. The
> document was successfully indexed, and there were no errors logged as a
> result, but the PDF data does not appear to have been extracted (the field I
> used for map.content had an empty-string as a value).

I think probably has to do with the lib deps. Try what I mentioned above and
let's go from there.

Cheers,
Chris

> -Original Message-
> From: Giovanni Fernandez-Kincade [mailto:gfernandez-kinc...@capitaliq.com]
> Sent: Tuesday, March 16, 2010 5:41 PM
> To: solr-user@lucene.apache.org
> Subject: RE: PDFBox/Tika Performance Issues
>
>
>
> Thanks Chris!
>
>
>
> I'll try the patch.
>
>
>
> -Original Message-
>
> From: Mattmann, Chris A (388J) [mailto:chris.a.mattm...@jpl.nasa.gov]
>
> Sent: Tuesday, March 16, 2010 5:37 PM
>
> To: solr-user@lucene.apache.org
>
> Subject: Re: PDFBox/Tika Performance Issues
>
>
>
> Guys, I think this is an issue with PDFBOX and the version that Tika 0.6
> depends on. Tika 0.7-trunk upgraded to PDFBox 1.0.0 (see [1]), so it may
> include a fix for the problem you're seeing.
>
>
>
> See this discussion [2] on how to patch Tika to use the new PDFBox if you
> can't wait for the 0.7 release which should happen soon (hopefully next few
> weeks).
>
>
>
> Cheers,
>
> Chris
>
>
>
> [1] http://issues.apache.org/jira/browse/TIKA-380
>
> [2] http://www.mail-archive.com/tika-u...@lucene.apache.org/msg00302.html
>
>
>
>
>
> On 3/16/10 2:31 PM, "Giovanni Fernandez-Kincade"
>  wrote:
>
>
>
> Originally 16 (the number of CPUs on the machine), but even with 5 threads
> it's not looking so hot.
>
>
>
> -Original Message-
>
> From: Grant Ingersoll [mailto:gsi...@gmail.com] On Behalf Of Grant Inger

Omitting norms question

2010-03-18 Thread blargy

Should I include not omit-norms on any fields that I would like to boost via
a boost-query/function query?

For example I have a created_on field on one of my documents and I would
like to add some sort of function query to this field when querying. In this
case does this mean I need to have the norms?

What about sortable fields? Facetable fields?

Thanks!
-- 
View this message in context: 
http://old.nabble.com/Omitting-norms-question-tp27950893p27950893.html
Sent from the Solr - User mailing list archive at Nabble.com.