from:"Peter Wolanin"

any docs on using the GeoHashField?

2011-09-02 Thread Peter Wolanin

looking at http://wiki.apache.org/solr/SpatialSearchDev

I would think I could index a lat,lon pair into a GeoHashField (that
works) and then retrieve the field value to see the computed geohash.

however, that doesn't seem to work.  If I index:  21.4,33.5

The retrieved value is not a hash, but approximately the same lat,lon:
21.4001527369,33.498472631

If I try to filter on a geohash, &fq=geos_test:sezcd*  that works, so
I guess the hash is stored internally.

What am I missing - how can I retrieve the hash?

-Peter

-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Re: any docs on using the GeoHashField?

2011-09-14 Thread Peter Wolanin

When I retrieve the value the lat/lon pair that comes out is not
exactly the same as what I indexed, which made be think it was
actually stored as the hash and then transformed back?

Anyhow - I'm trying to understand the actual use case for the field as
it exists - essentially you are saying I could query with a geohash
and use data in this field type to do a distance-based filter from the
lat,lon point corresponding to the geohash?

-Peter

On Thu, Sep 8, 2011 at 5:34 PM, Chris Hostetter
 wrote:
>
> : I would think I could index a lat,lon pair into a GeoHashField (that
> : works) and then retrieve the field value to see the computed geohash.
>        ...
> : What am I missing - how can I retrieve the hash?
>
> I don't think it's designed to work that way.
>
> GeoHashField provides GeoHash based search support for lat/lon values
> through it's internal (indexed) representaiton -- much like TrieLongField
> provides efficient range queries using trie encoding -- but the "stored"
> value is still the lat/lon pair (just as a TrieLongField is still the long
> value)
>
> If you want to store/retrive a raq GeoHash string, i think you have to
> compute it yourself (or put the logic in an UpdateProcessor).
>
> org.apache.lucene.spatial.geohash.GeoHashUtils should take care of all the
> heavy lifting for you.
>
> -Hoss
>



-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Re: Setting up Solr 3.4 example with Tomcat 7

2011-10-02 Thread Peter Wolanin

I've seen a number of users fail to get Solr working correctly in
combination with the Drupal client code when using the .deb installer
so I have been strongly recommending against it personally.

It's also a rather stale version of Solr, generally.

-Peter

On Sun, Oct 2, 2011 at 4:04 AM, Gora Mohanty  wrote:
> On Sun, Oct 2, 2011 at 12:22 PM, Stardrive Engineering
>  wrote:
>> Thanks. Since Tomcat and Solr are running already Tomcat oriented samples to 
>> quickly get up to
>> speed would be good to have next.
>
> I think that the issue is that Jetty is small, and easy to embed and get
> running, which is why it is packaged along with Solr.
>
>>                                                  What do you think of this 
>> site, is it up to date and worth learning? The site seems to get cut off 
>> prematurely, are there more tutorials of this kind?
>>
>> http://synapticloop.com/tomes/solr/solr-tutorial/solr-from-whoa-to-go/
>
> Just skimmed through this part,
> http://synapticloop.com/tomes/solr/solr-tutorial/the-base-solr-install/
> and it looks reasonable.
>
> What operating system are you using? Some of them, e.g., Debian and
> Ubuntu, have packages for Solr (though, probably version 1.4)
> running in Tomcat. It might be easiest to look for such a package
> for your OS.
>
> Regards,
> Gora
>



-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Retrieving matched tokens and their payload?

2011-12-05 Thread Peter Wolanin

A colleague came to be with a problem that intrigued me.  I can see
partly how to solve it with Solr, but looking for insight into solving
the last step.

The problem:

1) Start from a set of text transcriptions of videos where there is a
timestamp associated with each word.

2) Index into Solr with analysis including stemming, so that a user
can search for videos based on keywords.

3) When the user clicks into a single video in the search result,
retrieve from the corresponding doc in Solr the timestamps of all
words matching the keyword(s) (including stemming).

So, obviously #1 and 2 are easy.  As part of #2 it would seem one
could use the DelimitedPayloadTokenFilterFactory to index the
timestamp as a payload for each word.  I don't want the payload to
influence score, but my understanding is that by default it will not.

Ok, so now for the harder part.  For #3 it would seem I need something
roughly like the highlighter - to return each matching word and the
payload which is the timestamp.

I'm not seeing any existing request handler or component that would do
this.  Is there an easy way to retrieve the indexed words (or analyzed
tokens) and their payload?

Thanks,

-Peter


--
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Re: Lucene/Solr

2011-12-05 Thread Peter Wolanin

Assuming you are using Drupal for the website, you can have Solr set
up and integrated with Drupal in < 5 minutes for local development
purposes.

See: https://drupal.org/node/1358710 for a pre-configured download.

-Peter

On Mon, Dec 5, 2011 at 11:46 AM, Achebe, Ike, JCL
 wrote:
> Hi,
> My name is Ike Achebe and I am a Developer Analyst with the Johnson County 
> Library.  I'm actually researching better and less expensive alternatives to 
> "Google Appliance Search " , which is currently our search engine.  
> Fortunately, I have come across a variety of blogs recommending Lucene/Solr  
> as one of the best if not the best open source search engine.
> Fortunately, I have read a few articles and documentations about Solr, 
> however , I'm still in awe as to how to go about installing and integrating 
> this search engine.
> Could you in simple terms intimate me on how to go about acquiring or 
> subscribing to solr?
> Thank You.
> Sincerely,
> Ike Achebe



-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Polish language support?

2010-07-09 Thread Peter Wolanin

In IRC trying to help someone find Polish-language support for Solr.

Seems lucene has nothing to offer?  Found one stemmer that looks to be
compatibly licensed in case someone wants to take a shot at
incorporating it:  http://www.getopt.org/stempel/

-Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

access control for spellcheck suggestions?

2010-10-07 Thread Peter Wolanin

We have a content access control system that works well for the actual
search results, but we see that the spellcheck suggestions include
words that are not within the set of documents the current user is
allowed to access.  Does anyone have an approach to this problem for
Solr 1.4.x?  Anything new in Solr trunk to address this?  Maybe
spellcheck..key?

Is there something in the Solr API that lets us control which
spellcheck index a certain document goes into at index time, since one
approach might be to at least obey some gross access control rules per
user role by having multiple spellcheck indexes.

-Peter

-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 978-296-5247

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Re: access control for spellcheck suggestions?

2010-10-08 Thread Peter Wolanin

Thanks for the info - I'll try out this patch.

-Peter

On Thu, Oct 7, 2010 at 10:43 AM, Dyer, James  wrote:
> Look at SOLR-2010 which has patches for 1.4.1 and trunk.  It works with the 
> spellcheck "collate" functionality and ensures that collations are returned 
> only if they can result in hits if requeried (it tests each collation with 
> any "fq" you put on the original query).  This would effectively prevent 
> users from seeing sensitive data in their spell suggestions.
>
> James Dyer
> E-Commerce Systems
> Ingram Content Group
> (615) 213-4311
>
>
> -Original Message-
> From: Peter Wolanin [mailto:peter.wola...@acquia.com]
> Sent: Thursday, October 07, 2010 9:00 AM
> To: solr-user@lucene.apache.org
> Subject: access control for spellcheck suggestions?
>
> We have a content access control system that works well for the actual
> search results, but we see that the spellcheck suggestions include
> words that are not within the set of documents the current user is
> allowed to access.  Does anyone have an approach to this problem for
> Solr 1.4.x?  Anything new in Solr trunk to address this?  Maybe
> spellcheck..key?
>
> Is there something in the Solr API that lets us control which
> spellcheck index a certain document goes into at index time, since one
> approach might be to at least obey some gross access control rules per
> user role by having multiple spellcheck indexes.
>
> -Peter
>
> --
> Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com : 978-296-5247
>
> "Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";
>



-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 978-296-5247

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

mergePolicy element format change in 3.6 vs 3.5?

2012-04-13 Thread Peter Wolanin

Trying to maintain the Drupal integration module across multiple versions
of 3.x, we've gotten a bug report suggesting that Solr 3.6 needs this
change to solrconfig:

-
 org.apache.lucene.index.LogByteSizeMergePolicy
+


I don't see this mentioned in the release notes - is the second format
useable with 3.5, 3.4, etc?

-- 
Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Re: mergePolicy element format change in 3.6 vs 3.5?

2012-04-13 Thread Peter Wolanin

Ok, thanks for the info.  As long as the second one works, we can just use
that.

I just verified that it works for 3.5 at least.

-Peter

On Fri, Apr 13, 2012 at 1:12 PM, Michael Ryan  wrote:

> It looks like the first format was removed in 3.6 as part of
> https://issues.apache.org/jira/browse/SOLR-1052. The second format works
> in all 3.x versions.
>
> -Michael
>
> -Original Message-
> From: Peter Wolanin [mailto:peter.wola...@acquia.com]
> Sent: Friday, April 13, 2012 12:32 PM
> To: solr-user@lucene.apache.org
> Subject: mergePolicy element format change in 3.6 vs 3.5?
>
> Trying to maintain the Drupal integration module across multiple versions
> of 3.x, we've gotten a bug report suggesting that Solr 3.6 needs this
> change to solrconfig:
>
> -
>  org.apache.lucene.index.LogByteSizeMergePolicy
> +
>
>
> I don't see this mentioned in the release notes - is the second format
> useable with 3.5, 3.4, etc?
>



-- 
Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Re: Highlighting words with non-ascii chars

2011-05-02 Thread Peter Wolanin

Does your servlet container have the URI encoding set correctly, e.g.
URIEncoding="UTF-8" for tomcat6?

http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config

Older versions of Jetty use ISO-8859-1 as the default URI encoding,
but jetty 6 should use UTF-8 as default:

http://docs.codehaus.org/display/JETTY/International+Characters+and+Character+Encodings

-Peter

On Sat, Apr 30, 2011 at 6:31 AM, Pavel Kukačka  wrote:
> Hello,
>
>        I've hit a (probably trivial) roadblock I don't know how to overcome 
> with Solr 3.1:
> I have a document with common fields (title, keywords, content) and I'm
> trying to use highlighting.
>        With queries using ASCII characters there is no problem; it works 
> smoothly. However,
> when I search using a czech word including non-ascii chars (like "slovíčko" 
> for example - 
> http://localhost:8983/solr/select/?q=slov%C3%AD%C4%8Dko&version=2.2&start=0&rows=10&indent=on&hl=on&hl.fl=*),
>  the document is found, but
> the response doesn't contain the highlighted snippet in the highlighting node 
> - there is only an
> empty node - like this:
> **
> .
> .
> .
> 
>  
> 
> 
>
>
> When searching for the other keyword ( 
> http://localhost:8983/solr/select/?q=slovo&version=2.2&start=0&rows=10&indent=on&hl=on&hl.fl=*),
>  the resulting response is fine - like this:
> 
> 
>  
> 
>      slovíčko  id="highlighting">slovo
>    
>  
> 
>
> 
>
> Did anyone come accross this problem?
> Cheers,
> Pavel
>
>
>



-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 978-296-5247

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

what data type for geo fields?

2011-07-27 Thread Peter Wolanin

Looking at the example schema:

http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_3/solr/example/solr/conf/schema.xml

the solr.PointType field type uses double (is this just an example
field, or used for geo search?), while the solr.LatLonType field uses
tdouble and it's unclear how the geohash is translated into lat/lon
values or if the geohash itself might typically be used as a copyfield
and use just for matching a query on a geohash?

Is there an advantage in terms of speed to using Trie fields for
solr.LatLonType?  I would assume so, e.g. for bbox operations.

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 978-296-5247

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Re: what data type for geo fields?

2011-07-28 Thread Peter Wolanin

Thanks for the feedback.  I'll have look more at how geohash works.

Looking at the sample schema more closely, I see:

 

So in fact "double" is also Trie, but just with precisionStep 0 in the example.

-Peter

On Wed, Jul 27, 2011 at 9:57 AM, Yonik Seeley
 wrote:
> On Wed, Jul 27, 2011 at 9:01 AM, Peter Wolanin  
> wrote:
>> Looking at the example schema:
>>
>> http://svn.apache.org/repos/asf/lucene/dev/branches/lucene_solr_3_3/solr/example/solr/conf/schema.xml
>>
>> the solr.PointType field type uses double (is this just an example
>> field, or used for geo search?)
>
> While you could possibly use PointType for geo search, it doesn't have
> good support for it (it's more of a general n-dimension point)
> The LatLonType has all the geo support currently.
>
>>, while the solr.LatLonType field uses
>> tdouble and it's unclear how the geohash is translated into lat/lon
>> values or if the geohash itself might typically be used as a copyfield
>> and use just for matching a query on a geohash?
>
> There's no geohash used in LatLonType
> It is indexed as a lat and lon under the covers (using the suffix "_d")
>
>> Is there an advantage in terms of speed to using Trie fields for
>> solr.LatLonType?
>
> Currently only for explicit range queries... like point:[10,10 TO 20,20]
>
>>  I would assume so, e.g. for bbox operations.
>
> It's a bit of an implementation detail, but bbox doesn't currently use
> range queries.
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Peter M. Wolanin, Ph.D.      : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 978-296-5247

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Solr 4.0 - distributed updates without zookeeper?

2012-11-11 Thread Peter Wolanin

Looking at how we could upgrade some of our infrastructure to Solr 4.0
- I would really like to take advantage of distributed updates to get
NRT, but we want to keep our fixed master and slave server roles since
we use different hardware appropriate to the different roles.

Looking at the solr 4.0 distributed update code, it seems really
hard-coded and bound to zookeeper.  Is there a way to have a solr
master distribute updates without using ZK, or a way to mock the ZK
interface to provide a fixed cluster topography that will work when
sending updates just to the master?

To be clear, if the master goes doen I don't want a slave promoted,
nor do I want most of the other SolrCloud features - we have already
built out a system for managing groups of servers.

Thanks,

Peter

Re: Solr 4.0 - distributed updates without zookeeper?

2012-11-13 Thread Peter Wolanin

Yes, basically I want to at least avoid leader election and the other
dynamic behaviors.  I don't have any experience with ZK, and a lot of
"magic" behavior seems baked in now that's I'm concerned I'd need to
dig into SK to debug or monitor what's really happening as we scale
out.

We also have a somewhat non-typical use case, of lots of small
cores/indexes on the same server, rather large indexes that might need
multiple shards.

We have master servers that have persistent (but sometimes slower)
storage, and slaves with faster non-persistent disk.

My colleague noticed that their is a param to flag a server as
eligible to be a shard leader, so I guess we could enable that for
only the preferred master?

I'm also having trouble understanding config handling from the docs.
Even browsing the java code I don't see if Solr is creating the
instance dirs, or somehow just linking to config files?  It sounds as
though if I create a core using core admin, it would get associated
with a collection of the same name.

-Peter

On Mon, Nov 12, 2012 at 9:41 PM, Otis Gospodnetic
 wrote:
> Hi Peter,
>
> Not sure I have the answer for you, but are you looking to avoid using ZK
> for some reason?
> Or are you OK with ZK per se, but just don't want any leader re-election
> and any other dynamic/cloudy behaviour?
>
> Could you not simply treat 1 node as the "master" to which you send all
> your updates and let SolrCloud distribute that to the rest of the cluster?
> Is your main/only worry around what happens if this 1 node that you
> designated as the master goes down? What would you like to happen?  You'd
> like indexing to start failing, while the search functionality remains up?
>
> Otis
> --
> Search Analytics - http://sematext.com/search-analytics/index.html
> Performance Monitoring - http://sematext.com/spm/index.html
>
>
> On Sun, Nov 11, 2012 at 7:42 PM, Peter Wolanin 
> wrote:
>
>> Looking at how we could upgrade some of our infrastructure to Solr 4.0
>> - I would really like to take advantage of distributed updates to get
>> NRT, but we want to keep our fixed master and slave server roles since
>> we use different hardware appropriate to the different roles.
>>
>> Looking at the solr 4.0 distributed update code, it seems really
>> hard-coded and bound to zookeeper.  Is there a way to have a solr
>> master distribute updates without using ZK, or a way to mock the ZK
>> interface to provide a fixed cluster topography that will work when
>> sending updates just to the master?
>>
>> To be clear, if the master goes doen I don't want a slave promoted,
>> nor do I want most of the other SolrCloud features - we have already
>> built out a system for managing groups of servers.
>>
>> Thanks,
>>
>> Peter
>>

-- 
Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

Re: Solr 4.0 - distributed updates without zookeeper?

2012-11-14 Thread Peter Wolanin

So, from looking at the code and talking to some of the Lucid guys
today, it seems like there is no good way (currently) to control the
shard leader selection, or even to "fail back" if the preferred leader
server comes back up.

We currently let indexing fail if the one master goes down, but adding
HA there would be helpful in some cases.

-Peter

On Tue, Nov 13, 2012 at 9:12 PM, Peter Wolanin  wrote:
> Yes, basically I want to at least avoid leader election and the other
> dynamic behaviors.  I don't have any experience with ZK, and a lot of
> "magic" behavior seems baked in now that's I'm concerned I'd need to
> dig into SK to debug or monitor what's really happening as we scale
> out.
>
> We also have a somewhat non-typical use case, of lots of small
> cores/indexes on the same server, rather large indexes that might need
> multiple shards.
>
> We have master servers that have persistent (but sometimes slower)
> storage, and slaves with faster non-persistent disk.
>
> My colleague noticed that their is a param to flag a server as
> eligible to be a shard leader, so I guess we could enable that for
> only the preferred master?
>
> I'm also having trouble understanding config handling from the docs.
> Even browsing the java code I don't see if Solr is creating the
> instance dirs, or somehow just linking to config files?  It sounds as
> though if I create a core using core admin, it would get associated
> with a collection of the same name.
>
> -Peter
>
> On Mon, Nov 12, 2012 at 9:41 PM, Otis Gospodnetic
>  wrote:
>> Hi Peter,
>>
>> Not sure I have the answer for you, but are you looking to avoid using ZK
>> for some reason?
>> Or are you OK with ZK per se, but just don't want any leader re-election
>> and any other dynamic/cloudy behaviour?
>>
>> Could you not simply treat 1 node as the "master" to which you send all
>> your updates and let SolrCloud distribute that to the rest of the cluster?
>> Is your main/only worry around what happens if this 1 node that you
>> designated as the master goes down? What would you like to happen?  You'd
>> like indexing to start failing, while the search functionality remains up?
>>
>> Otis
>> --
>> Search Analytics - http://sematext.com/search-analytics/index.html
>> Performance Monitoring - http://sematext.com/spm/index.html
>>
>>
>> On Sun, Nov 11, 2012 at 7:42 PM, Peter Wolanin 
>> wrote:
>>
>>> Looking at how we could upgrade some of our infrastructure to Solr 4.0
>>> - I would really like to take advantage of distributed updates to get
>>> NRT, but we want to keep our fixed master and slave server roles since
>>> we use different hardware appropriate to the different roles.
>>>
>>> Looking at the solr 4.0 distributed update code, it seems really
>>> hard-coded and bound to zookeeper.  Is there a way to have a solr
>>> master distribute updates without using ZK, or a way to mock the ZK
>>> interface to provide a fixed cluster topography that will work when
>>> sending updates just to the master?
>>>
>>> To be clear, if the master goes doen I don't want a slave promoted,
>>> nor do I want most of the other SolrCloud features - we have already
>>> built out a system for managing groups of servers.
>>>
>>> Thanks,
>>>
>>> Peter
>>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com : 781-313-8322
>
> "Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";



-- 
Peter M. Wolanin, Ph.D.  : Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com : 781-313-8322

"Get a free, hosted Drupal 7 site: http://www.drupalgardens.com";

tika 0.4?

2009-07-24 Thread Peter Wolanin

Sadly, I had to muis the meetup in NYC, but looking over the slides
(http://files.meetup.com/1482573/YonikSeeley_NYCMeetup_solr14_features.pdf)
I see:

Solr Cell:

Integrates Apache Tika (v0.4) into Solr

My current checkout of solr still has tika 0.3, and I don't see a jira
issue for updating to 0.4.  Is this something that's going to be in
Solr 1.4 for sure?

-Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: server won't start using configs from Drupal

2009-07-24 Thread Peter Wolanin

Looks like we better update our schema for the Drupal module - what
rev of Solr incorporates this change?

-Peter


On Fri, Jul 24, 2009 at 8:38 AM, Koji Sekiguchi wrote:
> David,
>
> Try to change solr.CharStreamAwareWhitespaceTokenizerFactory to
> solr.WhitespaceTokenizerFactory
> in your schema.xml and reboot Solr.
>
> Koji
>
>
> david wrote:
>>
>>
>> Otis Gospodnetic wrote:
>>>
>>> I think the problem is CharStreamAwareWhitespaceTokenizerFactory, which
>>> used to live in Solr (when Drupal schema.xml for Solr was made), but has
>>> since moved to Lucene.  I'm half guessing. :)
>>>
>>>  Otis
>>> --
>>
>> Thanks  unfortunately I have no idea about Java. Do you know when that
>> change was made?
>>
>> regards,
>>
>> David.
>>
>>
>>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>>
>>>
>>>
>>> - Original Message 

 From: david 
 To: solr-user@lucene.apache.org
 Sent: Thursday, July 23, 2009 9:59:53 PM
 Subject: server won't start using configs from Drupal
 I've downloaded solr-2009-07-21.tgz and followed the instructions at
 http://drupal.org/node/343467 including retrieving the solrconfig.xml and
 schema.xml files from the Drupal apachesolr module.

 The server seems to start properly with the original solrconfig.xml and
 schema.xml files

 When I try to start up the server with the Drupal supplied files, I get
 errors on the command line, and a 500 error from the server.

 solrconfig.xml         http://pastebin.com/m23d14a2
 schema.xml         http://pastebin.com/m2e79f304
 output of http://localhost:8983/solr/admin/:
  http://pastebin.com/m410fa74d


 Following looks to me like the important bits, but I'm not a java coder,
 so I could easily be wrong.

 command line extract:

 22/07/2009 5:58:54 PM org.apache.solr.common.SolrException log
 SEVERE: org.apache.solr.common.SolrException: analyzer without class or
 tokenizer & filter list
 (plus lots of WARN messages)

 extract from browser at http://localhost:8983/solr/admin/

 org.apache.solr.common.SolrException: Unknown fieldtype 'text' specified
 on field title
 (snip lots of stuff)
 org.apache.solr.common.SolrException: analyzer without class or
 tokenizer & filter list
 (snip lots of stuff)
 org.apache.solr.common.SolrException: Error loading class
 'solr.CharStreamAwareWhitespaceTokenizerFactory'
 (snip lots of stuff)
 Caused by: java.lang.ClassNotFoundException:
 solr.CharStreamAwareWhitespaceTokenizerFactory

 Nothing in apache logs...

 solr logs contain this:
 127.0.0.1 - - [22/07/2009:08:01:10 +] "GET /solr/admin/ HTTP/1.1"
 500 10292

 Any help greatly appreciated.

 David.
>>>
>>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: "standard" requestHandler components

2009-09-14 Thread Peter Wolanin

I just copied this information to the wiki at
http://wiki.apache.org/solr/SolrRequestHandler

-Peter

On Fri, Sep 11, 2009 at 7:43 PM, Jay Hill  wrote:
> RequestHandlers are configured in solrconfig.xml. If no components are
> explicitly declared in the request handler config the the defaults are used.
> They are:
> - QueryComponent
> - FacetComponent
> - MoreLikeThisComponent
> - HighlightComponent
> - StatsComponent
> - DebugComponent
>
> If you wanted to have a custom list of components (either omitting defaults
> or adding custom) you can specify the components for a handler directly:
>    
>      query
>      facet
>      mlt
>      highlight
>      debug
>      someothercomponent
>    
>
> You can add components before or after the main ones like this:
>    
>      mycomponent
>    
>
>    
>      myothercomponent
>    
>
> and that's how the spell check component can be added:
>    
>      spellcheck
>    
>
> Note that the a component (except the defaults) must be configured in
> solrconfig.xml with the name used in the str element as well.
>
> Have a look at the solrconfig.xml in the example directory
> (".../example/solr/conf/") for examples on how to set up the spellcheck
> component, and on how the request handlers are configured.
>
> -Jay
> http://www.lucidimagination.com
>
>
> On Fri, Sep 11, 2009 at 3:04 PM, michael8  wrote:
>
>>
>> Hi,
>>
>> I have a newbie question about the 'standard' requestHandler in
>> solrconfig.xml.  What I like to know is where is the config information for
>> this requestHandler kept?  When I go to http://localhost:8983/solr/admin,
>> I
>> see the following info, but am curious where are the supposedly 'chained'
>> components (e.g. QueryComponent, FacetComponent, MoreLikeThisComponent)
>> configured for this requestHandler.  I see timing and process debug output
>> from these components with "debugQuery=true", so somewhere these components
>> must have been configured for this 'standard' requestHandler.
>>
>> name:    standard
>> class:  org.apache.solr.handler.component.SearchHandler
>> version:        $Revision: 686274 $
>> description:    Search using components:
>>
>> org.apache.solr.handler.component.QueryComponent,org.apache.solr.handler.component.FacetComponent,org.apache.solr.handler.component.MoreLikeThisComponent,org.apache.solr.handler.component.HighlightComponent,org.apache.solr.handler.component.DebugComponent,
>> stats:  handlerStart : 1252703405335
>> requests : 3
>> errors : 0
>> timeouts : 0
>> totalTime : 201
>> avgTimePerRequest : 67.0
>> avgRequestsPerSecond : 0.015179728
>>
>>
>> What I like to do from understanding this is to properly integrate
>> spellcheck component into the standard requestHandler as suggested in a
>> solr
>> spellcheck example.
>>
>> Thanks for any info in advance.
>> Michael
>> --
>> View this message in context:
>> http://www.nabble.com/%22standard%22-requestHandler-components-tp25409075p25409075.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: dismax + wildcard

2009-11-09 Thread Peter Wolanin

There are some open issues (not for 1.4 at this point) to make dismax
more flexible or add wildcard handling, e.g:

https://issues.apache.org/jira/browse/SOLR-756
https://issues.apache.org/jira/browse/SOLR-758

You might participate in those to try to get this in a future version
and/or get a working patch for 1.4

-Peter

On Wed, Nov 4, 2009 at 7:04 PM, Koji Sekiguchi  wrote:
> Jan Kammer wrote:
>>
>> Hi there,
>>
>> what is the best way to search all fields AND use wildcards?
>> Somewhere I read that there are problems with this combination... (dismax
>> + wildcard)
>>
> It's a feature of dismax. WildcardQuery cannot be used in dismax q
> parameter.
>
> You can copy the "all fields" to a destination field by using
> copyField, then search the destination field with wildcards
> (without using dismax).
>
> Koji
>
> --
> http://www.rondhuit.com/en/
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

any docs on solr.EdgeNGramFilterFactory?

2009-11-10 Thread Peter Wolanin

This fairly recent blog post:

http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/

describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
for the index.  I don't see any mention of that tokenizer on the Solr
wiki - is it just waiting to be added, or is there any other
documentation in addition to the blog post?  In particular, there was
a thread last year about using an N-gram tokenizer to enable
reasonable (if not ideal) searching of CJK text, so I'd be curious to
know how people are configuring their schema (with this tokenizer?)
for that use case.

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: any docs on solr.EdgeNGramFilterFactory?

2009-11-10 Thread Peter Wolanin

So, this is the normal N-gram one?  NGramTokenizerFactory

Digging deeper - there are actualy CJK and Chinese tokenizers in the
Solr codebase:

http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html

The CJK one uses the lucene CJKTokenizer
http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html

and there seems to be another one even that no one has wrapped into Solr:
http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html

So seems like the existing options are a little better than I thought,
though it would be nice to have some docs on properly configuring
these.

-Peter

On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
 wrote:
> Peter,
>
> For CJK and n-grams, I think you don't want the *Edge* n-grams, but just 
> n-grams.
> Before you take the n-gram route, you may want to look at the smart Chinese 
> analyzer in Lucene contrib (I think it works only for Simplified Chinese) and 
> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months 
> back.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> - Original Message 
>> From: Peter Wolanin 
>> To: solr-user@lucene.apache.org
>> Sent: Tue, November 10, 2009 4:06:52 PM
>> Subject: any docs on solr.EdgeNGramFilterFactory?
>>
>> This fairly recent blog post:
>>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>>
>> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> for the index.  I don't see any mention of that tokenizer on the Solr
>> wiki - is it just waiting to be added, or is there any other
>> documentation in addition to the blog post?  In particular, there was
>> a thread last year about using an N-gram tokenizer to enable
>> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> know how people are configuring their schema (with this tokenizer?)
>> for that use case.
>>
>> Thanks,
>>
>> Peter
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: any docs on solr.EdgeNGramFilterFactory?

2009-11-11 Thread Peter Wolanin

It looks like the CJK one actually does 2-grams plus a little
processing separate processing on latin text.

That's kind of interesting - in general can I build a custom tokenizer
from existing tokenizers that treats different parts of the input
differently based on the utf-8 range of the characters?  E.g. use a
porter stemmer for stretches of Latin text and n-gram or something
else for CJK?

-Peter

On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
 wrote:
> Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is 
> really just an n-gram tokenizer, so no different than the normal n-gram 
> tokenizer.
>
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>
>
>
> - Original Message 
>> From: Peter Wolanin 
>> To: solr-user@lucene.apache.org
>> Sent: Tue, November 10, 2009 7:34:37 PM
>> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
>>
>> So, this is the normal N-gram one?  NGramTokenizerFactory
>>
>> Digging deeper - there are actualy CJK and Chinese tokenizers in the
>> Solr codebase:
>>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
>>
>> The CJK one uses the lucene CJKTokenizer
>> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
>>
>> and there seems to be another one even that no one has wrapped into Solr:
>> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
>>
>> So seems like the existing options are a little better than I thought,
>> though it would be nice to have some docs on properly configuring
>> these.
>>
>> -Peter
>>
>> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
>> wrote:
>> > Peter,
>> >
>> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but just
>> n-grams.
>> > Before you take the n-gram route, you may want to look at the smart Chinese
>> analyzer in Lucene contrib (I think it works only for Simplified Chinese) and
>> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few months
>> back.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > - Original Message 
>> >> From: Peter Wolanin
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Tue, November 10, 2009 4:06:52 PM
>> >> Subject: any docs on solr.EdgeNGramFilterFactory?
>> >>
>> >> This fairly recent blog post:
>> >>
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >>
>> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> >> for the index.  I don't see any mention of that tokenizer on the Solr
>> >> wiki - is it just waiting to be added, or is there any other
>> >> documentation in addition to the blog post?  In particular, there was
>> >> a thread last year about using an N-gram tokenizer to enable
>> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> >> know how people are configuring their schema (with this tokenizer?)
>> >> for that use case.
>> >>
>> >> Thanks,
>> >>
>> >> Peter
>> >>
>> >> --
>> >> Peter M. Wolanin, Ph.D.
>> >> Momentum Specialist,  Acquia. Inc.
>> >> peter.wola...@acquia.com
>> >
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: any docs on solr.EdgeNGramFilterFactory?

2009-11-13 Thread Peter Wolanin

Thanks for the link - there doesn't seem a be a fix version specified,
so I guess this will not officially ship with lucene 2.9?

-Peter

On Wed, Nov 11, 2009 at 10:36 PM, Robert Muir  wrote:
> Peter, here is a project that does this:
> http://issues.apache.org/jira/browse/LUCENE-1488
>
>
>> That's kind of interesting - in general can I build a custom tokenizer
>> from existing tokenizers that treats different parts of the input
>> differently based on the utf-8 range of the characters?  E.g. use a
>> porter stemmer for stretches of Latin text and n-gram or something
>> else for CJK?
>>
>> -Peter
>>
>> On Tue, Nov 10, 2009 at 9:21 PM, Otis Gospodnetic
>>  wrote:
>> > Yes, that's the n-gram one.  I believe the existing CJK one in Lucene is
>> really just an n-gram tokenizer, so no different than the normal n-gram
>> tokenizer.
>> >
>> > Otis
>> > --
>> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >
>> >
>> >
>> > - Original Message 
>> >> From: Peter Wolanin 
>> >> To: solr-user@lucene.apache.org
>> >> Sent: Tue, November 10, 2009 7:34:37 PM
>> >> Subject: Re: any docs on solr.EdgeNGramFilterFactory?
>> >>
>> >> So, this is the normal N-gram one?  NGramTokenizerFactory
>> >>
>> >> Digging deeper - there are actualy CJK and Chinese tokenizers in the
>> >> Solr codebase:
>> >>
>> >>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/CJKTokenizerFactory.html
>> >>
>> http://lucene.apache.org/solr/api/org/apache/solr/analysis/ChineseTokenizerFactory.html
>> >>
>> >> The CJK one uses the lucene CJKTokenizer
>> >>
>> http://lucene.apache.org/java/2_9_1/api/contrib-analyzers/org/apache/lucene/analysis/cjk/CJKTokenizer.html
>> >>
>> >> and there seems to be another one even that no one has wrapped into
>> Solr:
>> >>
>> http://lucene.apache.org/java/2_9_1/api/contrib-smartcn/org/apache/lucene/analysis/cn/smart/package-summary.html
>> >>
>> >> So seems like the existing options are a little better than I thought,
>> >> though it would be nice to have some docs on properly configuring
>> >> these.
>> >>
>> >> -Peter
>> >>
>> >> On Tue, Nov 10, 2009 at 6:05 PM, Otis Gospodnetic
>> >> wrote:
>> >> > Peter,
>> >> >
>> >> > For CJK and n-grams, I think you don't want the *Edge* n-grams, but
>> just
>> >> n-grams.
>> >> > Before you take the n-gram route, you may want to look at the smart
>> Chinese
>> >> analyzer in Lucene contrib (I think it works only for Simplified
>> Chinese) and
>> >> Sen (on java.net).  I also spotted a Korean analyzer in the wild a few
>> months
>> >> back.
>> >> >
>> >> > Otis
>> >> > --
>> >> > Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> >> > Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>> >> >
>> >> >
>> >> >
>> >> > - Original Message 
>> >> >> From: Peter Wolanin
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Sent: Tue, November 10, 2009 4:06:52 PM
>> >> >> Subject: any docs on solr.EdgeNGramFilterFactory?
>> >> >>
>> >> >> This fairly recent blog post:
>> >> >>
>> >> >>
>> >>
>> http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
>> >> >>
>> >> >> describes the use of the solr.EdgeNGramFilterFactory as the tokenizer
>> >> >> for the index.  I don't see any mention of that tokenizer on the Solr
>> >> >> wiki - is it just waiting to be added, or is there any other
>> >> >> documentation in addition to the blog post?  In particular, there was
>> >> >> a thread last year about using an N-gram tokenizer to enable
>> >> >> reasonable (if not ideal) searching of CJK text, so I'd be curious to
>> >> >> know how people are configuring their schema (with this tokenizer?)
>> >> >> for that use case.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Peter
>> >> >>
>> >> >> --
>> >> >> Peter M. Wolanin, Ph.D.
>> >> >> Momentum Specialist,  Acquia. Inc.
>> >> >> peter.wola...@acquia.com
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Peter M. Wolanin, Ph.D.
>> >> Momentum Specialist,  Acquia. Inc.
>> >> peter.wola...@acquia.com
>> >
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

changes to highlighting config or syntax in 1.4?

2009-11-13 Thread Peter Wolanin

I'm testing out the final release of Solr 1.4 as compared to the build
I have been using from around June.

I'm using hte dismax handler for searches.  I'm finding that
highlighting is completely broken as compared to previously.  Much
more text is returned than it should for each string in , but the search words  are never highlighted in
that response.  Setting usePhraseHighlighter=false makes no
difference.

Any pointers appreciated.

-Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: changes to highlighting config or syntax in 1.4?

2009-11-13 Thread Peter Wolanin

Apparently one of my conf files was broken - odd that I didn't see any
exceptions.  Anyhow - excuse my haste, I don't see the problem now.

-Peter

On Fri, Nov 13, 2009 at 11:06 PM, Peter Wolanin
 wrote:
> I'm testing out the final release of Solr 1.4 as compared to the build
> I have been using from around June.
>
> I'm using hte dismax handler for searches.  I'm finding that
> highlighting is completely broken as compared to previously.  Much
> more text is returned than it should for each string in  name="highlighting">, but the search words  are never highlighted in
> that response.  Setting usePhraseHighlighter=false makes no
> difference.
>
> Any pointers appreciated.
>
> -Peter
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Newbie Solr questions

2009-11-15 Thread Peter Wolanin

Take a look at the example schema - you can have dynamic fields that
are used based on wildcard matching to the field name if a field
doesn't mtach the name of an existing field.

-Peter

On Sun, Nov 15, 2009 at 10:50 AM, yz5od2  wrote:
> Thanks for the reply:
>
> I follow the schema.xml concept, but what if my requirement is more dynamic
> in nature? I.E. I would like my developers to be able to annotate a POJO and
> submit it to the Solr server (embedded) to be indexed according to public
> properties OR annotations. Is that possible?
>
> If that is not possible, can I programatically define documents and fields
> (and the field options) in straight Java? I.E. in pseudo code below...
>
> // this is made up but this is what I would like to be able to do
> SolrDoc document = new SolrDoc();
> SolrField field = new SolrField()
> field.isIndexed=true;
> field.isStored=true;
> field.name = 'myField'
>
> field.value = myPOJO.getValue();
>
> solrServer.index(document);
>
>
>
>
>
> On Nov 15, 2009, at 12:50 AM, Avlesh Singh wrote:
>
>>>
>>> a) Since Solr is built on top of lucene, using SolrJ, can I still
>>> directly
>>> create custom documents, specify the field specifics etc (indexed, stored
>>> etc) and then map POJOs to those documents, simular to just using the
>>> straight lucene API?
>>>
>>> b) I took a quick look at the SolrJ javadocs but did not see anything in
>>> there that allowed me to customize if a field is stored, indexed, not
>>> indexed etc. How do I do that with SolrJ without having to go directly to
>>> the lucene apis?
>>>
>>> c) The SolrJ beans package. By annotating a POJO with @Field, how exactly
>>> does SolrJ treat that field? Indexed/stored, or just indexed? Is there
>>> any
>>> other way to control this?
>>>
>> The answer to all your questions above is the magical file called
>> schema.xml. For more read here - http://wiki.apache.org/solr/SchemaXml.
>> SolrJ is simply a java client to access (read and update from) the solr
>> server.
>>
>> c) If I create a custom index outside of Solr using straight lucene, is it
>>>
>>> easy to import a pre-exisiting lucene index into a Solr Server?
>>>
>> As long as the Lucene index matches the definitions in your schema you can
>> use the same index. The data however needs to copied into a predictable
>> location inside SOLR_HOME.
>>
>> Cheers
>> Avlesh
>>
>> On Sun, Nov 15, 2009 at 9:26 AM, yz5od2
>> wrote:
>>
>>> Hi,
>>> I am new to Solr but fairly advanced with lucene.
>>>
>>> In the past I have created custom Lucene search engines that indexed
>>> objects in a Java application, so my background is coming from this
>>> requirement
>>>
>>> a) Since Solr is built on top of lucene, using SolrJ, can I still
>>> directly
>>> create custom documents, specify the field specifics etc (indexed, stored
>>> etc) and then map POJOs to those documents, simular to just using the
>>> straight lucene API?
>>>
>>> b) I took a quick look at the SolrJ javadocs but did not see anything in
>>> there that allowed me to customize if a field is stored, indexed, not
>>> indexed etc. How do I do that with SolrJ without having to go directly to
>>> the lucene apis?
>>>
>>> c) The SolrJ beans package. By annotating a POJO with @Field, how exactly
>>> does SolrJ treat that field? Indexed/stored, or just indexed? Is there
>>> any
>>> other way to control this?
>>>
>>> c) If I create a custom index outside of Solr using straight lucene, is
>>> it
>>> easy to import a pre-exisiting lucene index into a Solr Server?
>>>
>>> thanks!
>>>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

using Xinclude with multi-core

2009-11-27 Thread Peter Wolanin

I'm trying to take advantage of the Solr 1.4 Xinclude feature to
include a different xml fragment (e.g. a different analyzer chain in
schema.xml) for each core in a multi-core setup.  When the Xinclude
operates on a relative path, it seems to NOT be acting relative to the
xml file with the Xinlclude statement.  Using the jetty example, it
looks for a file in example/.

Is this a bug in the way Solr invokes Xinclude?  If not, is there a
variable that contains the instanceDir that can be used?
${solr.instanceDir} or ${solr/instanceDir}

DOMUtil.substituteProperties(doc, loader.getCoreProperties());

I see that I could potentially specify solrcore.properties,
http://wiki.apache.org/solr/SolrConfigXml#System_property_substitutionin
order to determine the correct based path, but this seems overly
complicated in terms of wht the useual use case would be for Xinclude?

-Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

is it possible to use Xinclude in schema.xml?

2009-11-28 Thread Peter Wolanin

I'm trying to determine if it's possible to use Xinclude to (for
example) have a base schema file and then substitute various pieces.

It seems that the schema fieldTypes throw exceptions if there is an
unexpected attribute?

SEVERE: java.lang.RuntimeException: schema fieldtype
text(org.apache.solr.schema.TextField) invalid
arguments:{xml:base=solr/core2/conf/text-analyzer.xml}

This is what I'm trying to do (details of the analyzer chain omitted -
nothing unusual) - so the error occurs when the external xml file is
actually included:

http://www.w3.org/2001/XInclude"; >
  

  
...
  
  
...
  

  



Where (for testing) the text-analyzer.xml file just looks like the fallback:




  
...
  
  
...
  



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: is it possible to use Xinclude in schema.xml?

2009-11-28 Thread Peter Wolanin

Follow-up:  it seems the schema parser doesn't barf if you use
xinclude with a single analyzer element, but so far seems like it's
impossible for a field type.  So this seems to work:



  
  
...
  
  

  
...
  


On Sat, Nov 28, 2009 at 1:40 PM, Peter Wolanin  wrote:
> I'm trying to determine if it's possible to use Xinclude to (for
> example) have a base schema file and then substitute various pieces.
>
> It seems that the schema fieldTypes throw exceptions if there is an
> unexpected attribute?
>
> SEVERE: java.lang.RuntimeException: schema fieldtype
> text(org.apache.solr.schema.TextField) invalid
> arguments:{xml:base=solr/core2/conf/text-analyzer.xml}
>
> This is what I'm trying to do (details of the analyzer chain omitted -
> nothing unusual) - so the error occurs when the external xml file is
> actually included:
>
>  xmlns:xi="http://www.w3.org/2001/XInclude"; >
>  
>    
>      
> ...
>      
>      
> ...
>      
>    
>  
> 
>
>
> Where (for testing) the text-analyzer.xml file just looks like the fallback:
>
>
> 
>    
>      
> ...
>      
>      
> ...
>      
>    
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

boosting certain terms within one field?

2008-11-29 Thread Peter Wolanin

I've recently started working on the Drupal integration module for
SOLR, and we are looking for suggestions for how to address this
question:  how do we boost the importance of a subset of terms within
a field.

For example, we are using the standard request handler for queries,
and the default field for keyword searches is a concatentation of the
title, body, taxonomy terms, etc.

One "hackish" way I can imagine is that terms we want to boost (for
example the title, or text inside h2 tags) could be concatenated on
multiple times.  Would this be effective and reasonable?

It seems like the alternative is to try to switch to using the dismax
handler, storing the terms that we desire to have different boosts
into different fields, all of which are in the list of query fields?

Thanks in advance for your suggestions.

-Peter

--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[EMAIL PROTECTED]

Re: boosting certain terms within one field?

2008-11-30 Thread Peter Wolanin

Hi Grant,

Thanks for your feedback.  The major short-term downside to switching
to dismax with multiple fields would be the required re-writing of our
current PHP code  - especially our code to handle addition of facets
fields to the q parameter.  From reading about dismax, seems we would
need to instead use fq to limit the search results to those matching a
specific facet value.

Best,

Peter


On Sun, Nov 30, 2008 at 8:43 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Hi Peter,
>
> What are the downsides to your last alternative approach below?  That seems
> like the simplest approach and should work as long as the terms within those
> fields do not need to be boosted separately.
>
> If you want to go the boosting terms route, this is handled via a thing
> called Payloads in Lucene.  Payloads are an array of bytes that are added
> during indexing at the term level through the analysis process.  To do this
> in Solr, you would need to write your own TokenFilter that adds payloads as
> needed.  Then, during search, you can take advantage of these payloads by
> using the BoostingTermQuery from Lucene.  The downside to all of this is
> Solr doesn't currently support it, so you would be coding it up yourself.
>  I'm sure, though, that if you were to start a patch on it, there would be
> others who are interested.
>
> Note, on the payloads.  The biggest sticking point, I think, is coming up w/
> an efficient way of encoding the byte array and putting it into the XML
> format, such that one can send in payloads when indexing.  It's not
> particularly hard, but no one has done it yet.
>
> -Grant
>
>
> On Nov 29, 2008, at 10:45 PM, Peter Wolanin wrote:
>
>> I've recently started working on the Drupal integration module for
>> SOLR, and we are looking for suggestions for how to address this
>> question:  how do we boost the importance of a subset of terms within
>> a field.
>>
>> For example, we are using the standard request handler for queries,
>> and the default field for keyword searches is a concatentation of the
>> title, body, taxonomy terms, etc.
>>
>> One "hackish" way I can imagine is that terms we want to boost (for
>> example the title, or text inside h2 tags) could be concatenated on
>> multiple times.  Would this be effective and reasonable?
>>
>> It seems like the alternative is to try to switch to using the dismax
>> handler, storing the terms that we desire to have different boosts
>> into different fields, all of which are in the list of query fields?
>>
>> Thanks in advance for your suggestions.
>>
>> -Peter
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> [EMAIL PROTECTED]
>
> --
> Grant Ingersoll
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>



-- 
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[EMAIL PROTECTED]

Re: problem index accented character with release version of solr 1.3

2008-12-09 Thread Peter Wolanin

We have been having this problem also. and have resorted to just
stripping control characters before sending the text for indexing:

preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', '', $text);

-Peter

On Tue, Dec 9, 2008 at 7:59 AM, knietzie <[EMAIL PROTECTED]> wrote:
>
> hi joshua,
>
> i'm having the same problem as yours.
> just curious, have you found any fix for this?
>
> thnks
>
>
> Joshua Reedy wrote:
>>
>> I have been using a stable dev version of 1.3 for a few months.
>> Today, I began testing the final release version, and I encountered a
>> strange problem.
>> The only thing that has changed in my setup is the solr code (I didn't
>> make any config change or change the schema).
>>
>> a document has a text field with a value that contains:
>> "Andr\005é 3000"
>>
>> Indexing the document by itself or as part of a batch, produces the
>> following error:
>> Sep 17, 2008 5:00:27 PM org.apache.solr.common.SolrException log
>> SEVERE: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal
>> character ((CTRL-CHAR, code 5))
>>  at [row,col {unknown-source}]: [5,205]
>> at
>> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4668)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
>> at
>> com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
>> at
>> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327)
>> at
>> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
>> at
>> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
>> at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
>> at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
>> at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>> at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>> at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>> at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
>> at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
>> at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>> at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>> at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
>> at
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
>> at
>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
>> at
>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
>> at java.lang.Thread.run(Thread.java:595)
>>
>> The latest version of the solr doesn't seem to like control characters
>> (\005, in this case), but previous versions handled them (or at least
>> ignored them).
>>
>> These characters shouldn't be in my documents, so there's a bug on my
>> end to track down.  However, I'm wondering if this was an expected
>> change or an unintended consequence of recent work . . .
>>
>>
>>
>>
>> --
>> -
>> Be who you are and say what you feel,
>> because those who mind don't matter and
>> those who matter don't mind.
>>  -- Dr. Seuss
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/problem-index-accented-character-with-release-version-of-solr-1.3-tp19544660p20914244.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
[EMAIL PROTECTED]

does this break Solr? dynamicField name="*" type="ignored"

2008-12-18 Thread Peter Wolanin

I'm seeing a weird effect with a '*' field.  In the example
schema.xml, there is a commented out sample:

   
   

We have this un-commented, and in the schema browser via the admin
interface I see that all non-dynamic fields get a type of "ignored".

I see this in the Solr admin interface:

Field: uid
Dynamically Created From Pattern: *
Field Type: ignored

though the field definition is:

   

Is this a bug in the admin interface, or a problem with using this '*'
in the schema?

Thanks,

Peter

-- 
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: does this break Solr? dynamicField name="*" type="ignored"

2008-12-18 Thread Peter Wolanin

created issue:  https://issues.apache.org/jira/browse/SOLR-929

-Peter

On Thu, Dec 18, 2008 at 3:32 PM, Yonik Seeley  wrote:
> Looks like it's a bug in the schema browser (i.e. just this display,
> no the inner workings of Solr).
> Could you open a JIRA issue for this?
>
> -Yonik
>
>
> On Thu, Dec 18, 2008 at 3:20 PM, Peter Wolanin  
> wrote:
>> I'm seeing a weird effect with a '*' field.  In the example
>> schema.xml, there is a commented out sample:
>>
>>   
>>   
>>
>> We have this un-commented, and in the schema browser via the admin
>> interface I see that all non-dynamic fields get a type of "ignored".
>>
>> I see this in the Solr admin interface:
>>
>> Field: uid
>> Dynamically Created From Pattern: *
>> Field Type: ignored
>>
>> though the field definition is:
>>
>>   
>>
>> Is this a bug in the admin interface, or a problem with using this '*'
>> in the schema?
>>
>> Thanks,
>>
>> Peter
>>
>> --
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>



-- 
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: How can i omit the illegal characters,when indexing the docs?

2009-01-04 Thread Peter Wolanin

For documents we are indexing via the PHP client, we are currently
using the following regex to strip control characters from each field
that might contain them:

function apachesolr_strip_ctl_chars($text) {
  // See:  http://w3.org/International/questions/qa-forms-utf-8.html
  // Printable utf-8 does not include any of these chars below x7F
  return preg_replace('@[\x00-\x08\x0B\x0C\x0E-\x1F]@', ' ', $text);
}

-Peter

On Fri, Jan 2, 2009 at 3:41 AM, RaghavPrabhu  wrote:
>
> Hi all,
>
>  I am extracting the word document using Apache POI,then generate the xml
> doc,which is the document that i want to indexing in the solr. The problem
> which i faced was,it thrown the error in the browser is shown below.
>
> HTTP Status 500 - Illegal character ((CTRL-CHAR, code 8)) at [row,col
> {unknown-source}]: [1,1592]
> com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR,
> code 8)) at [row,col {unknown-source}]: [1,1592] at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:675) at
> com.ctc.wstx.sr.StreamScanner.throwInvalidSpace(StreamScanner.java:660) at
> com.ctc.wstx.sr.BasicStreamReader.readCDataPrimary(BasicStreamReader.java:4240)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTreeCommentOrCData(BasicStreamReader.java:3280)
> at
> com.ctc.wstx.sr.BasicStreamReader.nextFromTree(BasicStreamReader.java:2824)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1019) at
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:321)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.java:195)
> at
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:123)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
> at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
> at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at
> org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:179)
> at
> org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:84)
> at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
> at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
> at
> org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:157)
> at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:262)
> at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
> at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:446)
> at java.lang.Thread.run(Thread.java:619)
>
> The extracted word document contains the special character ( its like a
> square box).How can i omit those characters,when i submit the document to
> the solr.
>
>
> Thanks in advance,
> Regards
> Prabhu.K
>
>
> --
> View this message in context: 
> http://www.nabble.com/How-can-i-omit-the-illegal-characters%2Cwhen-indexing-the-docs--tp21249084p21249084.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
--
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

can the TermsComponent be used in combination with fq?

2009-02-16 Thread Peter Wolanin

We have been trying to figure out how to construct, for example, a
directory page with an overview of available facets for several
fields.

Looking at the issue and wiki

http://wiki.apache.org/solr/TermsComponent
https://issues.apache.org/jira/browse/SOLR-877

It would seem like this component would be useful for this.  However -
we often require that some filtering be applied to search results
based on which user is searching (e.g. public vs. private content).
Is it possible to apply filtering here, or will we need to do
something like running a q=*:*&fq=status:1 and then getting facets?

Note - also - the wiki page references a tutorial including this
/autocomplete path, but I cannot ifnd any trace of such.  I was able
to get results similar to the examples on the wiki page by adding the
following to solrconfig.xml:

  
  
  

  explicit


  terms

  


Is this the right way to activate this?

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Finding total range of dates for date faceting

2009-02-17 Thread Peter Wolanin

It *looks* as though Solr supports returning the results of arbitrary
calculations:

http://wiki.apache.org/solr/SolrQuerySyntax

However, I am so far unable to get any example working except in the
context of a dismax bf.  It seems like one ought to be able to write a
query to return the doc matching the max OR the min of a particular
field.

-Peter

On Tue, Feb 17, 2009 at 5:33 AM, Jacob Singh  wrote:
> Hi,
>
> I'm trying to write some code to build a facet list for a date field,
> but I don't know what the first and last available dates are.  I would
> adjust the gap param accordingly.  If there is a 10yr stretch between
> min(date) and max(date) I'd want to facet by year.  If it is a 1 month
> gap, I'd want to facet by day.
>
> Is there a way to do this?
>
> Thanks,
> Jacob
>
> --
>
> +1 510 277-0891 (o)
> +91  33 7458 (m)
>
> web: http://pajamadesign.com
>
> Skype: pajamadesign
> Yahoo: jacobsingh
> AIM: jacobsingh
> gTalk: jacobsi...@gmail.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Store content out of solr

2009-02-17 Thread Peter Wolanin

Sure, we are doing essentially that with our Drupal integration module
- each search result contains a link to the "real" content, which is
stored in MySQL, etc, and presented via the Drupal CMS.

http://drupal.org/project/apachesolr

-Peter

On Tue, Feb 17, 2009 at 11:57 AM, roberto  wrote:
> Hello,
>
> We are indexing information from diferent sources so we would like to
> centralize the information content so i can retrieve using the ID
> provided buy solr?
>
> Does anyone did something like this, and have some advices ? I
> thinking in store the information into a database like mysql ?
>
> Thanks,
> --
> "Without love, we are birds with broken wings."
> Morrie
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

make the suggested ignored field multi-valued?

2009-02-18 Thread Peter Wolanin

In the example schema.xml, there is a field type 'ignored' which it is
suggested can be used with the wildcard * to prevent errors when a
document contains fields that don't match any in the schema.   My
experience recently in using this is that it does not worked as
desired if the unmatched field is multiValued, and that that suggested
* field should be designated multiValued:

https://issues.apache.org/jira/browse/SOLR-1022

Obviously this has no effect out of the box, since the field is commented out.

-Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: why don't we have a forum for discussion?

2009-02-18 Thread Peter Wolanin

If some stuff is asked over and over again, it would be great to grab
some reasonable responses and add them to the wiki.

I've edited it a few times when I've struggled with what's there and
found something that wasn't covered or was out of date - even the best
forum or mailing list will not replicate an organized and maintained
doc site in terms of ready access to knowledge.

-Peter

2009/2/18 Martin Lamothe :
> E-mails wouldn't go away with a discussion forum as they have e-mail
> notifications tooit could compliment this  mailing list... some stuff is
> asked over and over and over ... isn't it? With a forum, it would be
> possible to say.. go see this post.. .or that thread.. etc...
>
> Multi-core could use it's own Topic
> Scalling could use it's own too
> Indexing
> Optimizing Indexes
> etc...

Suggested hardening of Solr schema.jsp admin interface

2009-02-20 Thread Peter Wolanin

My colleague Paul opened this issue and supplied a patch and I
commented on it regarding a potential security weakness in the admin
interface:

https://issues.apache.org/jira/browse/SOLR-1031


-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

What is the performance impact of a fq that matches all docs?

2009-02-20 Thread Peter Wolanin

We are working on integration with the Drupal CMS, and so are writing
code that carries out operations that might only be relevant for only
a small subset of the sites/indexes that might use the integration
module.  In this regard, I'm wondering if adding to the query (using
the dismax or mlt handlers) a fq that matches all documents would have
any impact on performance?  I gatehr that there is caching for the fq
matches, but it seems liek that would still incur some overhead,
especially for a large index?

As a more concrete example, suppose each document has a string field
that names the role of user that is allowed to see the content.  e.g.
'public', 'registered', 'admin'.  Most sites have only public content,
but because our code is generic, we might add  &fq=role:public to
every query.  What would the expected performance effect be compared
to omitting that fq if, for example, we had a way to determine in
advance that all site content matches 'public'.

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Error with highlighter and UTF-8 chars?

2009-02-23 Thread Peter Wolanin

We are using Solr trunk (1.4)  - currently " nightly exported - yonik
- 2009-02-05 08:06:00"

-Peter

On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi  wrote:
> Jacob,
>
> What Solr version are you using? There is a bug in SolrHighlighter of Solr
> 1.3,
> you may want to look at:
>
> https://issues.apache.org/jira/browse/SOLR-925
> https://issues.apache.org/jira/browse/LUCENE-1500
>
> regards,
>
> Koji
>
>
> Jacob Singh wrote:
>>
>> Hi,
>>
>> We ran into a weird one today.  We have a document which is written in
>> German and everytime we make a query which matches it, we get the
>> following:
>>
>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
>>at java.lang.String.substring(String.java:1935)
>>at
>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
>>
>>
>> >From source diving it looks like Lucene's highlighter is trying to
>> subStr against an offset that is outside the bounds of the body field
>> which it is highlighting against.  Running a fq against the ID of the
>> doucment returns it fine (because no highlighting is done) and I took
>> the body and tried to cut the first 2822 chars and while it is near
>> the end of the body, it is still in range.
>>
>> Here is the related code:
>>
>> startOffset = tokenGroup.matchStartOffset;
>> endOffset = tokenGroup.matchEndOffset;
>> tokenText = text.substring(startOffset, endOffset);
>>
>>
>> This leads me to believe there is some problem with mb string encoding
>> and Lucene's counting.
>>
>> Any ideas here?  Tomcat is configured with UTF-8 btw.
>>
>> Best,
>> Jacob
>>
>>
>>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Error with highlighter and UTF-8 chars?

2009-02-24 Thread Peter Wolanin

Here you can see a manifestation of it when trying to highlight with ?q=daß


−

−

−

-Community" einfach nicht mehr wahrnimmt.
Hätte mir am letzten Montag Nachmittag jemand gesagt, daß
ich am Abend

−

recht, wenn er sagte, daß das wirklich wertvolle an
Drupal schlichtweg seine (Entwickler- und Anwender-)

−

die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn
Ihr vermutet schon richtig, daß da






You can see the "strong" tags each get offset one character more from
where they are supposed to be.


-Peter



On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin  wrote:
> We are using Solr trunk (1.4)  - currently " nightly exported - yonik
> - 2009-02-05 08:06:00"
>
> -Peter
>
> On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi  wrote:
>> Jacob,
>>
>> What Solr version are you using? There is a bug in SolrHighlighter of Solr
>> 1.3,
>> you may want to look at:
>>
>> https://issues.apache.org/jira/browse/SOLR-925
>> https://issues.apache.org/jira/browse/LUCENE-1500
>>
>> regards,
>>
>> Koji
>>
>>
>> Jacob Singh wrote:
>>>
>>> Hi,
>>>
>>> We ran into a weird one today.  We have a document which is written in
>>> German and everytime we make a query which matches it, we get the
>>> following:
>>>
>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
>>>        at java.lang.String.substring(String.java:1935)
>>>        at
>>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
>>>
>>>
>>> >From source diving it looks like Lucene's highlighter is trying to
>>> subStr against an offset that is outside the bounds of the body field
>>> which it is highlighting against.  Running a fq against the ID of the
>>> doucment returns it fine (because no highlighting is done) and I took
>>> the body and tried to cut the first 2822 chars and while it is near
>>> the end of the body, it is still in range.
>>>
>>> Here is the related code:
>>>
>>> startOffset = tokenGroup.matchStartOffset;
>>> endOffset = tokenGroup.matchEndOffset;
>>> tokenText = text.substring(startOffset, endOffset);
>>>
>>>
>>> This leads me to believe there is some problem with mb string encoding
>>> and Lucene's counting.
>>>
>>> Any ideas here?  Tomcat is configured with UTF-8 btw.
>>>
>>> Best,
>>> Jacob
>>>
>>>
>>>
>>
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Error with highlighter and UTF-8 chars?

2009-02-24 Thread Peter Wolanin

So - something in the highlighting code is counting bytes when it
should be counting characters.  Looks like a lucene bug, so I'm
surprised others have not hit this before.  Probably this is it:
https://issues.apache.org/jira/browse/LUCENE-1500

-Peter


On Tue, Feb 24, 2009 at 2:22 PM, Peter Wolanin  wrote:
> Here you can see a manifestation of it when trying to highlight with ?q=daß
>
> 
> −
> 
> −
> 
> −
> 
> -Community" einfach nicht mehr wahrnimmt.
> Hätte mir am letzten Montag Nachmittag jemand gesagt, daß
> ich am Abend
> 
> −
> 
> recht, wenn er sagte, daß das wirklich wertvolle an
> Drupal schlichtweg seine (Entwickler- und Anwender-)
> 
> −
> 
> die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn
> Ihr vermutet schon richtig, daß da
> 
> 
> 
> 
>
>
> You can see the "strong" tags each get offset one character more from
> where they are supposed to be.
>
>
> -Peter
>
>
>
> On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin  
> wrote:
>> We are using Solr trunk (1.4)  - currently " nightly exported - yonik
>> - 2009-02-05 08:06:00"
>>
>> -Peter
>>
>> On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi  wrote:
>>> Jacob,
>>>
>>> What Solr version are you using? There is a bug in SolrHighlighter of Solr
>>> 1.3,
>>> you may want to look at:
>>>
>>> https://issues.apache.org/jira/browse/SOLR-925
>>> https://issues.apache.org/jira/browse/LUCENE-1500
>>>
>>> regards,
>>>
>>> Koji
>>>
>>>
>>> Jacob Singh wrote:
>>>>
>>>> Hi,
>>>>
>>>> We ran into a weird one today.  We have a document which is written in
>>>> German and everytime we make a query which matches it, we get the
>>>> following:
>>>>
>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
>>>>        at java.lang.String.substring(String.java:1935)
>>>>        at
>>>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
>>>>
>>>>
>>>> >From source diving it looks like Lucene's highlighter is trying to
>>>> subStr against an offset that is outside the bounds of the body field
>>>> which it is highlighting against.  Running a fq against the ID of the
>>>> doucment returns it fine (because no highlighting is done) and I took
>>>> the body and tried to cut the first 2822 chars and while it is near
>>>> the end of the body, it is still in range.
>>>>
>>>> Here is the related code:
>>>>
>>>> startOffset = tokenGroup.matchStartOffset;
>>>> endOffset = tokenGroup.matchEndOffset;
>>>> tokenText = text.substring(startOffset, endOffset);
>>>>
>>>>
>>>> This leads me to believe there is some problem with mb string encoding
>>>> and Lucene's counting.
>>>>
>>>> Any ideas here?  Tomcat is configured with UTF-8 btw.
>>>>
>>>> Best,
>>>> Jacob
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Error with highlighter and UTF-8 chars?

2009-02-24 Thread Peter Wolanin

Actually, looking at the Lucene source and the trace:

java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
at java.lang.String.substring(String.java:1765)
at 
org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
at 
org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:313)
at 
org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:84)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   ...

I see now that getBestTextFragments() takes in a token stream - and
each token in this steam already has start/end positions set.  So, the
patch at LUCENE-1500 would mitigate the exception, but looks like the
real bug is in Solr.

-Peter

On Tue, Feb 24, 2009 at 4:28 PM, Peter Wolanin  wrote:
> So - something in the highlighting code is counting bytes when it
> should be counting characters.  Looks like a lucene bug, so I'm
> surprised others have not hit this before.  Probably this is it:
> https://issues.apache.org/jira/browse/LUCENE-1500
>
> -Peter
>
>
> On Tue, Feb 24, 2009 at 2:22 PM, Peter Wolanin  
> wrote:
>> Here you can see a manifestation of it when trying to highlight with ?q=daß
>>
>> 
>> −
>> 
>> −
>> 
>> −
>> 
>> -Community" einfach nicht mehr wahrnimmt.
>> Hätte mir am letzten Montag Nachmittag jemand gesagt, daß
>> ich am Abend
>> 
>> −
>> 
>> recht, wenn er sagte, daß das wirklich wertvolle an
>> Drupal schlichtweg seine (Entwickler- und Anwender-)
>> 
>> −
>> 
>> die Entstehungsgeschichte des Portals) auch dokumentiert worden, denn
>> Ihr vermutet schon richtig, daß da
>> 
>> 
>> 
>> 
>>
>>
>> You can see the "strong" tags each get offset one character more from
>> where they are supposed to be.
>>
>>
>> -Peter
>>
>>
>>
>> On Mon, Feb 23, 2009 at 8:24 AM, Peter Wolanin  
>> wrote:
>>> We are using Solr trunk (1.4)  - currently " nightly exported - yonik
>>> - 2009-02-05 08:06:00"
>>>
>>> -Peter
>>>
>>> On Mon, Feb 23, 2009 at 8:07 AM, Koji Sekiguchi  wrote:
>>>> Jacob,
>>>>
>>>> What Solr version are you using? There is a bug in SolrHighlighter of Solr
>>>> 1.3,
>>>> you may want to look at:
>>>>
>>>> https://issues.apache.org/jira/browse/SOLR-925
>>>> https://issues.apache.org/jira/browse/LUCENE-1500
>>>>
>>>> regards,
>>>>
>>>> Koji
>>>>
>>>>
>>>> Jacob Singh wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> We ran into a weird one today.  We have a document which is written in
>>>>> German and everytime we make a query which matches it, we get the
>>>>> following:
>>>>>
>>>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 2822
>>>>>        at java.lang.String.substring(String.java:1935)
>>>>>        at
>>>>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:274)
>>>>>
>>>>>
>>>>> >From source diving it looks like Lucene's highlighter is trying to
>>>>> subStr against an offset that is outside the bounds of the body field
>>>>> which it is highlighting against.  Running a fq against the ID of the
>>>>> doucment returns it fine (because no highlighting is done) and I took
>>>>> the body and tried to cut the first 2822 chars and while it is near
>>>>> the end of the body, it is still in range.
>>>>>
>>>>> Here is the related code:
>>>>>
>>>>> startOffset = tokenGroup.matchStartOffset;
>>>>> endOffset = tokenGroup.matchEndOffset;
>>>>> tokenText = text.substring(startOffset, endOffset);
>>>>>
>>>>>
>>>>> This leads me to believe there is some problem with mb string encoding
>>>>> and Lucene's counting.
>>>>>
>>>>> Any ideas here?  Tomcat is configured with UTF-8 btw.
>>>>>
>>>>> Best,
>>>>> Jacob
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Peter M. Wolanin, Ph.D.
>>> Momentum Specialist,  Acquia. Inc.
>>> peter.wola...@acquia.com
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

up/down sides to using compound file format for index?

2009-03-09 Thread Peter Wolanin

Trying to set up a server to host multiple Solr cores, we have run
into the issue of too many open files a few times.  The 2nd ed "Lucene
in Action" book suggests using the compound file format to reduce the
required number of files when having multiple indexes, but mentions a
possible ~10% slow-down when indexing.  Are there any other down-sides
to this?  Seems to work by just changing this line in solrconfig.xml:

  

true

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Query Boosting using both BQ and BF

2009-03-09 Thread Peter Wolanin

This doesn't seem to match what I'm seeing in terms of using bq -
using any value > 0 increases the score.  For example, with no bq:

  solr
  title,score,type
  2.2
 


 

  1.6885357
  Building a killer search for Drupal
  wikipage
 
 
  1.5547959
  New Solr module available for testing

  story
 
 
  1.408378
  Check out the Solr project!
  story
 


Versus with a bq < 1, the scores of matching docs are still increased
compared to using no bq:

  type:story^0.5
  on
  solr
  title,score,type
  2.2
 



 
  1.6885297
  Building a killer search for Drupal
  wikipage
 
 
  1.5585454
  New Solr module available for testing
  story
 
 
  1.4121282
  Check out the Solr project!
  story
 



On Sun, Mar 8, 2009 at 9:48 AM, Otis Gospodnetic
 wrote:
>
> Also note that the following is not doing what you want:
>
> -is_mp_parent_b:true^50.0
>
> You want something like:
>
> is_mp_parent_b:true^0.20
>
> for negative boosting use a boost that is less than 1.0.
>
> Otis--
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
>> From: "Dean Missikowski (Consultant), CLSA" 
>> To: solr-user@lucene.apache.org
>> Sent: Sunday, March 8, 2009 3:30:28 AM
>> Subject: RE: Query Boosting using both BQ and BF
>>
>> Some more experiments have helped me answer my own questions.
>>
>> > Q1. Can anyone confirm whether bf and bq can both
>> > be used together in solrconfig.xml?
>> yes
>>
>> > Q3. Can I have multiple bq parameters? If so, do I
>> > space-separate them as a single bq, or provide
>> > multiple BQs?
>> Yes, multiple bq parameters work, space-separating multiple query terms
>> in a single bq also works.
>>
>> Here's a snippet of my solrconfig.xml:
>>
>> published_date_d:[NOW-3MONTHS/DAY TO NOW/DAY+1DAY]^5.2 OR
>> (published_date_d:[NOW-12MONTHS/DAY TO NOW/DAY+1DAY] AND
>> report_type_id_i:1004)^10.0 OR (published_date_d:[NOW-6MONTHS/DAY TO
>> NOW/DAY+1DAY] AND is_printed_b:true)^4.0
>> is_mp_parent_b:false^10.0
>> recip(rord(published_date_d),20,5000,5)^5.5
>>
>>
>> -Original Message-
>> From: Dean Missikowski (Consultant), CLSA
>> Sent: 08/03/2009 12:01 PM
>> To: solr-user@lucene.apache.org
>> Subject: Query Boosting using both BQ and BF
>>
>> Hi,
>>
>> I have a couple of questions related to query boosting using the dismax
>> request handler. I'm using a recent 1.4 build (to take advantage of
>> omitTf), and have also tried all of this with 1.3.
>>
>> To apply a query-time boost to the previous 3 months of documents in my
>> index I use:
>>
>> published_date_d:[NOW-3MONTHS/DAY TO NOW/DAY+1DAY]^10.2
>>
>>
>> And, to provide a boosting that helps rank recently documents higher I
>> use:
>> recip(rord(published_date_d),20,5000,5)^5.5
>>
>> This seems to be working well.
>>
>> But I have more boosting requirements.  For example, I need to boost
>> documents that are tagged as printed. So, I tried to add another bq
>> parameter:
>>
>> is_printed_b:true^4.0
>>
>> Also, tried to append this space-separated all in one bq parameter like
>> this:
>> published_date_d:[NOW-3MONTHS/DAY TO NOW/DAY+1DAY]^10.2
>> is_printed_b:true^4.0
>>
>> Lastly, I need to apply a negative boost to documents of a certain type,
>> so I use:
>> -is_mp_parent_b:true^50.0
>>
>> Not sure if it matters, but I have
>> defaultOperator="AND"/> in schema.xml
>>
>> None of those variations return expected results (it's like the bq is
>> being applied as a filter instead of just applying boosts).
>>
>> Q1. Can anyone confirm whether bf and bq can both be used together in
>> solrconfig.xml?
>> Q2. Is there a way I can do ths using only BF? How?
>> Q3. Can I have multiple bq parameters? If so, do I space-separate them
>> as a single bq, or provide multiple BQs?
>> Q3. Am I formulating my BQs that use Boolean fields correctly?
>>
>> Any help or insights much appreciated,
>>
>> Thanks Dean
>>
>> CLSA CLEAN & GREEN: Please consider our environment before printing this 
>> email.
>> The content of this communication is subject to CLSA Legal and Regulatory
>> Notices.
>> These can be viewed at https://www.clsa.com/disclaimer.html or sent to you 
>> upon
>> request.
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: ExtractingRequestHandler and SolrRequestHandler issue

2009-04-22 Thread Peter Wolanin

I had problems with this when trying to set this up with multiple
cores - I had to set the shared lib as:



in example/solr/solr.xml in order for it to find the jars in example/solr/lib

-Peter

On Wed, Apr 22, 2009 at 11:43 AM, Grant Ingersoll  wrote:
>
> On Apr 20, 2009, at 12:46 PM, francisco treacy wrote:
>
>> Additionally, here's what I've got in example/lib:
>
> These need to be in your Solr home lib, not example/lib.  I sometimes get
> confused on this one, too, forgetting that I need to go down a few more
> directories.  The example/lib directory is where the Jetty stuff lives,
> example/solr/lib is the lib where the plugins go.  In fact, if you run "ant
> example" from the top level (or contrib/extraction) it should place the JARs
> in the right places for the example.
>
>>
>>
>> apache-solr-cell-nightly.jar   bcmail-jdk14-132.jar
>> commons-lang-2.1.jar       icu4j-3.8.jar         log4j-1.2.14.jar
>> poi-3.5-beta5.jar             slf4j-api-1.5.5.jar
>> xml-apis-1.0.b2.jar
>> apache-solr-core-nightly.jar   bcprov-jdk14-132.jar
>> commons-logging-1.0.4.jar  jetty-6.1.3.jar       nekohtml-1.9.9.jar
>> poi-ooxml-3.5-beta5.jar       slf4j-jdk14-1.5.5.jar
>> xmlbeans-2.3.0.jar
>> apache-solr-solrj-nightly.jar  commons-codec-1.3.jar  dom4j-1.6.1.jar
>>         jetty-util-6.1.3.jar  ooxml-schemas-1.0.jar
>> poi-scratchpad-3.5-beta5.jar  tika-0.3.jar
>> asm-3.1.jar                    commons-io-1.4.jar
>> fontbox-0.1.0-dev.jar      jsp-2.1               pdfbox-0.7.3.jar
>> servlet-api-2.5-6.1.3.jar     xercesImpl-2.8.1.jar
>>
>> Actually I wasn't very accurate. Following the wiki didn't suffice. I
>> had to add other jars, in order to avoid ClassNotFoundExceptions at
>> startup. These are
>>
>> apache-solr-core-nightly.jar
>> apache-solr-solrj-nightly.jar
>> slf4j-api-1.5.5.jar
>> slf4j-jdk14-1.5.5.jar
>>
>> even while using solr nightly war (in example/webapps).
>>
>> Perhaps something wrong with jar versions?
>>
>> Francisco
>>
>>
>> 2009/4/20 francisco treacy :
>>>
>>> Hi Grant,
>>>
>>> Here is the full stacktrace:
>>>
>>> 20-Apr-2009 12:36:39 org.apache.solr.common.SolrException log
>>> SEVERE: java.lang.ClassCastException:
>>> org.apache.solr.handler.extraction.ExtractingRequestHandler cannot be
>>> cast to org.apache.solr.request.SolrRequestHandler
>>>       at
>>> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:154)
>>>       at
>>> org.apache.solr.core.RequestHandlers$1.create(RequestHandlers.java:163)
>>>       at
>>> org.apache.solr.util.plugin.AbstractPluginLoader.load(AbstractPluginLoader.java:141)
>>>       at
>>> org.apache.solr.core.RequestHandlers.initHandlersFromConfig(RequestHandlers.java:171)
>>>       at org.apache.solr.core.SolrCore.(SolrCore.java:535)
>>>       at
>>> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:122)
>>>       at
>>> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:69)
>>>       at
>>> org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:99)
>>>       at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>>>       at
>>> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java:594)
>>>       at org.mortbay.jetty.servlet.Context.startContext(Context.java:139)
>>>       at
>>> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1218)
>>>       at
>>> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:500)
>>>       at
>>> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:448)
>>>       at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>>>       at
>>> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
>>>       at
>>> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:161)
>>>       at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>>>       at
>>> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:147)
>>>       at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>>>       at
>>> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:117)
>>>       at org.mortbay.jetty.Server.doStart(Server.java:210)
>>>       at
>>> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:40)
>>>       at org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:929)
>>>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>       at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>>>       at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>       at java.lang.reflect.Method.invoke(Method.java:616)
>>>       at org.mortbay.start.Main.invokeMain(Main.java:183)
>>>       at org.mortbay.start.Main.start(Main.java:497)
>>>       at org.mortbay.start.Main.main(Main.java:115)
>>>
>>> Thanks
>>>
>>> Francisco
>>>
>

bug? No highlighting results with dismax and q.alt=:

2009-05-07 Thread Peter Wolanin

For the Drupal Apache Solr Integration module, we are exploring the
possibility of doing facet browsing  - since we are using dismax as
the default handler, this would mean issuing a query with an empty q
and falling back to to q.alt='*:*' or some other q.alt that matches
all docs.

However, I notice when I do this that we do not get any highlights
back in the results despite defining a highlight alternate field.

In contrast, if I force the standard request handler then I do get
text back from the highlight alternate field:

select/?q=*:*&qt=standard&hl=true&hl.fl=body&hl.alternateField=body&hl.maxAlternateFieldLength=256

However, I then loose the nice dismax features of weighting the
results using bq and bf parameters.  So, is this a bug or the intended
behavior?

The relevant fragment of the solrconfig.xml is this:

  

 dismax

 *:*

   
 true
 body
 3
 true
   
 body
 256


Full solrconfig.xml and other files:
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/apachesolr/?pathrev=DRUPAL-6--1

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: bug? No highlighting results with dismax and q.alt=:

2009-05-08 Thread Peter Wolanin

Possibly this issue is related:  https://issues.apache.org/jira/browse/SOLR-825

Though it seems that might affect the standard handler, while what I'm
seeing is more sepcific to the dismax handler.

-Peter

On Thu, May 7, 2009 at 8:27 PM, Peter Wolanin  wrote:
> For the Drupal Apache Solr Integration module, we are exploring the
> possibility of doing facet browsing  - since we are using dismax as
> the default handler, this would mean issuing a query with an empty q
> and falling back to to q.alt='*:*' or some other q.alt that matches
> all docs.
>
> However, I notice when I do this that we do not get any highlights
> back in the results despite defining a highlight alternate field.
>
> In contrast, if I force the standard request handler then I do get
> text back from the highlight alternate field:
>
> select/?q=*:*&qt=standard&hl=true&hl.fl=body&hl.alternateField=body&hl.maxAlternateFieldLength=256
>
> However, I then loose the nice dismax features of weighting the
> results using bq and bf parameters.  So, is this a bug or the intended
> behavior?
>
> The relevant fragment of the solrconfig.xml is this:
>
>  
>    
>     dismax
>
>     *:*
>
>   
>     true
>     body
>     3
>     true
>   
>     body
>     256
>
>
> Full solrconfig.xml and other files:
> http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/apachesolr/?pathrev=DRUPAL-6--1
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: bug? No highlighting results with dismax and q.alt=:

2009-05-09 Thread Peter Wolanin

Well, in this case we want to match all documents, so I'm not sure
that can be accomplished with dismax other than using a q.alt ?

-Peter

On Fri, May 8, 2009 at 1:32 PM, Marc Sturlese  wrote:
>
> I have experienced it before... maybe you can manage something similar to
> your q.alt using the params q and qf. Highlight will work in that case (I
> sorted it out doing that)
>
> Peter Wolanin-2 wrote:
>>
>> Possibly this issue is related:
>> https://issues.apache.org/jira/browse/SOLR-825
>>
>> Though it seems that might affect the standard handler, while what I'm
>> seeing is more sepcific to the dismax handler.
>>
>> -Peter
>>
>> On Thu, May 7, 2009 at 8:27 PM, Peter Wolanin 
>> wrote:
>>> For the Drupal Apache Solr Integration module, we are exploring the
>>> possibility of doing facet browsing  - since we are using dismax as
>>> the default handler, this would mean issuing a query with an empty q
>>> and falling back to to q.alt='*:*' or some other q.alt that matches
>>> all docs.
>>>
>>> However, I notice when I do this that we do not get any highlights
>>> back in the results despite defining a highlight alternate field.
>>>
>>> In contrast, if I force the standard request handler then I do get
>>> text back from the highlight alternate field:
>>>
>>> select/?q=*:*&qt=standard&hl=true&hl.fl=body&hl.alternateField=body&hl.maxAlternateFieldLength=256
>>>
>>> However, I then loose the nice dismax features of weighting the
>>> results using bq and bf parameters.  So, is this a bug or the intended
>>> behavior?
>>>
>>> The relevant fragment of the solrconfig.xml is this:
>>>
>>>  >> default="true">
>>>    
>>>     dismax
>>>
>>>     *:*
>>>
>>>   
>>>     true
>>>     body
>>>     3
>>>     true
>>>   
>>>     body
>>>     256
>>>
>>>
>>> Full solrconfig.xml and other files:
>>> http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/apachesolr/?pathrev=DRUPAL-6--1
>>>
>>> --
>>> Peter M. Wolanin, Ph.D.
>>> Momentum Specialist,  Acquia. Inc.
>>> peter.wola...@acquia.com
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>>
>
> --
> View this message in context: 
> http://www.nabble.com/bug--No-highlighting-results-with-dismax-and-q.alt%3D*%3A*-tp23438048p23450189.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Replication master+slave

2009-05-13 Thread Peter Wolanin

Indeed - that looks nice - having some kind of conditional includes
would make many things easier.

-Peter

On Wed, May 13, 2009 at 4:22 PM, Otis Gospodnetic
 wrote:
>
> This looks nice and simple.  I don't know enough about this stuff to see any 
> issues.  If there are no issues.?
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
>> From: Bryan Talbot 
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, May 13, 2009 11:26:41 AM
>> Subject: Re: Replication master+slave
>>
>> I see that Nobel's final comment in SOLR-1154 is that config files need to be
>> able to include snippets from external files.  In my limited testing, a 
>> simple
>> patch to enable XInclude support seems to work.
>>
>>
>>
>> --- src/java/org/apache/solr/core/Config.java   (revision 774137)
>> +++ src/java/org/apache/solr/core/Config.java   (working copy)
>> @@ -100,8 +100,10 @@
>>   if (lis == null) {
>>     lis = loader.openConfig(name);
>>   }
>> -      javax.xml.parsers.DocumentBuilder builder =
>> DocumentBuilderFactory.newInstance().newDocumentBuilder();
>> -      doc = builder.parse(lis);
>> +      javax.xml.parsers.DocumentBuilderFactory dbf =
>> DocumentBuilderFactory.newInstance();
>> +      dbf.setNamespaceAware(true);
>> +      dbf.setXIncludeAware(true);
>> +      doc = dbf.newDocumentBuilder().parse(lis);
>>
>>     DOMUtil.substituteProperties(doc, loader.getCoreProperties());
>> } catch (ParserConfigurationException e)  {
>>
>>
>>
>> This allows a clause like this to include the contents of replication.xml if 
>> it
>> exists.  If it's not found an exception will be thrown.
>>
>>
>>
>> href="http://localhost:8983/solr/corename/admin/file/?file=replication.xml";
>>          xmlns:xi="http://www.w3.org/2001/XInclude";>
>>
>>
>>
>> If the file is optional and no exception should be thrown if the file is
>> missing, simply include a fallback action: in this case the fallback is empty
>> and does nothing.
>>
>>
>>
>> href="http://localhost:8983/solr/forum_en/admin/file/?file=replication.xml";
>>          xmlns:xi="http://www.w3.org/2001/XInclude";>
>>
>>
>>
>>
>> -Bryan
>>
>>
>>
>>
>> On May 12, 2009, at May 12, 8:05 PM, Jian Han Guo wrote:
>>
>> > I was looking at the same problem, and had a discussion with Noble. You can
>> > use a hack to achieve what you want, see
>> >
>> > https://issues.apache.org/jira/browse/SOLR-1154
>> >
>> > Thanks,
>> >
>> > Jianhan
>> >
>> >
>> > On Tue, May 12, 2009 at 5:13 PM, Bryan Talbot wrote:
>> >
>> >> So how are people managing solrconfig.xml files which are largely the same
>> >> other than differences for replication?
>> >>
>> >> I don't think it's a "good thing" to maintain two copies of the same file
>> >> and I'd like to avoid that.  Maybe enabling the XInclude feature in
>> >> DocumentBuilders would make it possible to modularize configuration files 
>> >> to
>> >> make this possible?
>> >>
>> >>
>> >>
>> http://java.sun.com/j2se/1.5.0/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setXIncludeAware(boolean)
>> >>
>> >>
>> >> -Bryan
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On May 12, 2009, at May 12, 11:43 AM, Shalin Shekhar Mangar wrote:
>> >>
>> >> On Tue, May 12, 2009 at 10:42 PM, Bryan Talbot
>>  wrote:
>> >>>
>> >>> For replication in 1.4, the wiki at
>>  http://wiki.apache.org/solr/SolrReplication says that a node can be both
>>  the master and a slave:
>> 
>>  A node can act as both master and slave. In that case both the master 
>>  and
>>  slave configuration lists need to be present inside the
>>  ReplicationHandler
>>  requestHandler in the solrconfig.xml.
>> 
>>  What does this mean?  Does the core then poll itself for updates?
>> 
>> >>>
>> >>>
>> >>> No. This type of configuration is meant for "repeaters". Suppose there 
>> >>> are
>> >>> slaves in multiple data-centers (say data center A and B). There is 
>> >>> always
>> >>> a
>> >>> single master (say in A). One of the slaves in B is used as a master for
>> >>> the
>> >>> other slaves in B. Therefore, this one slave in B is both a master as 
>> >>> well
>> >>> as the slave.
>> >>>
>> >>>
>> >>>
>>  I'd like to have a single set of configuration files that are shared by
>>  masters and slaves and avoid duplicating configuration details in
>>  multiple
>>  files (one for master and one for slave) to ease management and 
>>  failover.
>>  Is this possible?
>> 
>> 
>> >>> You wouldn't want the master to be a slave. So I guess you'd need to have
>> >>> a
>> >>> separate file. Also, it needs to be a separate file so that the slave 
>> >>> does
>> >>> not become a master when the solrconfig.xml is replicated.
>> >>>
>> >>>
>> >>>
>>  When I attempt to setup a multi server master-slave configuration and
>>  include both master and slave replication configuration options, I into
>>  some
>>  problems.  I'm  running a nightly build from May 7.
>> 
>> 
>>

Re: Solr memory requirements?

2009-05-17 Thread Peter Wolanin

I think that if you have in your index any documents with norms, you
will still use norms for those fields even if the schema is changed
later.  Did you wipe and re-index after all your schema changes?

-Peter

On Fri, May 15, 2009 at 9:14 PM, vivek sar  wrote:
> Some more info,
>
>  Profiling the heap dump shows
> "org.apache.lucene.index.ReadOnlySegmentReader" as the biggest object
> - taking up almost 80% of total memory (6G) - see the attached screen
> shot for a smaller dump. There is some norms object - not sure where
> are they coming from as I've omitnorms=true for all indexed records.
>
> I also noticed that if I run a query - let's say generic query that
> hits 100million records and then follow up with a specific query -
> which hits only 1 record, the second query causes the increase in
> heap.
>
> Looks like there are few bytes being loaded into memory for each
> document - I've checked the schema all indexes have omitNorms=true,
> all caches are commented out - still looking to see what else might
> put things in memory which don't get collected by GC.
>
> I also saw, https://issues.apache.org/jira/browse/SOLR- for Solr
> 1.4 (which I'm using). Not sure if that can cause any problem. I do
> use range queries for dates - would that have any effect?
>
> Any other ideas?
>
> Thanks,
> -vivek
>
> On Thu, May 14, 2009 at 8:38 PM, vivek sar  wrote:
>> Thanks Mark.
>>
>> I checked all the items you mentioned,
>>
>> 1) I've omitnorms=true for all my indexed fields (stored only fields I
>> guess doesn't matter)
>> 2) I've tried commenting out all caches in the solrconfig.xml, but
>> that doesn't help much
>> 3) I've tried commenting out the first and new searcher listeners
>> settings in the solrconfig.xml - the only way that helps is that at
>> startup time the memory usage doesn't spike up - that's only because
>> there is no auto-warmer query to run. But, I noticed commenting out
>> searchers slows down any other queries to Solr.
>> 4) I don't have any sort or facet in my queries
>> 5) I'm not sure how to change the "Lucene term interval" from Solr -
>> is there a way to do that?
>>
>> I've been playing around with this memory thing the whole day and have
>> found that it's the search that's hogging the memory. Any time there
>> is a search on all the records (800 million) the heap consumption
>> jumps by 5G. This makes me think there has to be some configuration in
>> Solr that's causing some terms per document to be loaded in memory.
>>
>> I've posted my settings several times on this forum, but no one has
>> been able to pin point what configuration might be causing this. If
>> someone is interested I can attach the solrconfig and schema files as
>> well. Here are the settings again under Query tag,
>>
>> 
>>  1024
>>  true
>>  50
>>  200
>>   
>>  false
>>  2
>>  
>>
>> and schema,
>>
>>  > required="true" omitNorms="true" compressed="false"/>
>>
>>  > compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > default="NOW/HOUR"  compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > compressed="false"/>
>>  > compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > compressed="false"/>
>>  > compressed="false"/>
>>  > compressed="false"/>
>>  > omitNorms="true" compressed="false"/>
>>  > compressed="false"/>
>>  > default="NOW/HOUR" omitNorms="true"/>
>>
>>  
>>  > omitNorms="true" multiValued="true"/>
>>
>> Any help is greatly appreciated.
>>
>> Thanks,
>> -vivek
>>
>> On Thu, May 14, 2009 at 6:22 PM, Mark Miller  wrote:
>>> 800 million docs is on the high side for modern hardware.
>>>
>>> If even one field has norms on, your talking almost 800 MB right there. And
>>> then if another Searcher is brought up well the old one is serving (which
>>> happens when you update)? Doubled.
>>>
>>> Your best bet is to distribute across a couple machines.
>>>
>>> To minimize you would want to turn off or down caching, don't facet, don't
>>> sort, turn off all norms, possibly get at the Lucene term interval and raise
>>> it. Drop on deck searchers setting. Even then, 800 million...time to
>>> distribute I'd think.
>>>
>>> vivek sar wrote:

 Some update on this issue,

 1) I attached jconsole to my app and monitored the memory usage.
 During indexing the memory usage goes up and down, which I think is
 normal. The memory remains around the min heap size (4 G) for
 indexing, but as soon as I run a search the tenured heap usage jumps
 up to 6G and remains there. Subsequent searches increases the heap
 usage even more until it reaches the max (8G) - after which everything
 (indexing and searching becomes slow).

 Th

exceptions when using existing index with latest build

2009-05-25 Thread Peter Wolanin

Building Solr last night from updated svn, I'm now getting the
exception below when I use any fq parameter searching a pre-existing
index.  So far, I cannot fix it by tweak config files, but I had to
delete and re-index.

I note that Solr was recently updated to the latest lucene build, so
maybe something broke in the index format?

here's the relevant part of the trace:

org.apache.lucene.index.ReadOnlySegmentReader cannot be cast to
org.apache.solr.search.SolrIndexReader

java.lang.ClassCastException:
org.apache.lucene.index.ReadOnlySegmentReader cannot be cast to
org.apache.solr.search.SolrIndexReader
   at 
org.apache.solr.search.SortedIntDocSet$2.getDocIdSet(SortedIntDocSet.java:530)
   at 
org.apache.lucene.search.IndexSearcher.doSearch(IndexSearcher.java:237)
   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:221)
   at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:212)
   at org.apache.lucene.search.Searcher.search(Searcher.java:150)
   at 
org.apache.solr.search.SolrIndexSearcher.getDocListNC(SolrIndexSearcher.java:1032)
   at 
org.apache.solr.search.SolrIndexSearcher.getDocListC(SolrIndexSearcher.java:894)
   at 
org.apache.solr.search.SolrIndexSearcher.search(SolrIndexSearcher.java:337)
   at 
org.apache.solr.handler.component.QueryComponent.process(QueryComponent.java:176)
   at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:195)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1328)

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Recover crashed solr index

2009-05-25 Thread Peter Wolanin

you can use the lucene jar with solr to invoke the CheckIndex method -
this will  possibly allow you to recover if you pass the -fix param.

You may lose some docs, however, so this is only viable if you can,
for example, query to check what's missing.

The command looks like (from the root of the solr svn checkout):

java -ea:org.apache.lucene -cp lib/lucene-core-2.9-dev.jar
org.apache.lucene.index.CheckIndex [path to index directory]

For example, to check the example index:

java -ea:org.apache.lucene -cp lib/lucene-core-2.9-dev.jar
org.apache.lucene.index.CheckIndex example/solr/data/index/

-Peter

On Mon, May 25, 2009 at 4:42 AM, Wang Guangchen  wrote:
> Hi everyone,
>
> I have 8m docs to index, and each doc is around 50kb. The solr crashed in
> the middle of indexing. error message said that one of the file in the data
> directory is missing. I don't know why this is happened.
>
> So right now I have to find a way to recover the index to avoid re-index. Is
> there anyone know any tools or method to recover the crashed index? Please
> help.
>
> Thanks a lot.
>
> Regards
> GC
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

NPE when unloading an absent

2009-06-03 Thread Peter Wolanin

Is this a known bug?  When I try to unload a core that does not exist,
Solr throws a NullPointerException

java.lang.NullPointerException
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleUnloadAction(CoreAdminHandler.java:319)
at 
org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:125)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at 
org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:301)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at com.acquia.search.HmacFilter.doFilter(HmacFilter.java:62)
at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at 
org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:568)
at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:286)
at 
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
at 
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:583)
at 
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:619)


-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: NPE when unloading an absent

2009-06-03 Thread Peter Wolanin

I did not find any relevant issue, so here's a new issue with a patch:
 https://issues.apache.org/jira/browse/SOLR-1200

-Peter

On Wed, Jun 3, 2009 at 4:56 PM, Peter Wolanin  wrote:
> Is this a known bug?  When I try to unload a core that does not exist,
> Solr throws a NullPointerException
>
> java.lang.NullPointerException
>        at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleUnloadAction(CoreAdminHandler.java:319)
>        at 
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:125)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:301)
>        at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:174)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>        at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)




-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Dismax request handler and highlighting

2009-06-07 Thread Peter Wolanin

I had the same problem  - I think the answer is that highlighting is
not currently supported with q.alt and dismax.

http://www.nabble.com/bug--No-highlighting-results-with-dismax-and-q.alt%3D*%3A*-td23438048.html#a23438048

-Peter

On Sun, Jun 7, 2009 at 7:51 AM, Fouad Mardini wrote:
> Hello,
>
> I am having problems with the dismax request handler and highlighting.
> The following query works as intended
>
> http://localhost:8983/solr/select?indent=on&q=myquery&start=0&rows=10&fl=id%2Cscore&qt=standard&wt=standard&hl=true&hl.fl=myfield
>
> whereas
>
> http://localhost:8983/solr/select?indent=on&q.alt=myquery&start=0&rows=10&fl=id%2Cscore&qt=dismax&wt=standard&hl=true&hl.fl=myfield
>
> I am using dismax since i need boost functions.
> Furthermore, using the q parameter with dismax doesn't seem to work with me,
> debug gives the following output
>
> myquery
> +() ()
>
> is there a setting somewhere that i need to set?
>
> I am building SOLR right out of svn.
>
> Thanks,
> Fouad
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

can Trie fields be stored?

2009-06-11 Thread Peter Wolanin

Looking at the new examples of solr.TrieField

http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/schema.xml

I see that all have indexed="true" stored="false" in the field tpye
definition.  Does this mean that yo cannot ever store a value for one
of these fields?  I.e. if I want to do a range query and also return
the values, I need to store the values in a separate field?

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

multi-core, autocommit and resource use

2009-06-18 Thread Peter Wolanin

A question for anyone familiar with the details of the time-based
autocommit mechanism in Solr:

if I am running several core on the same server and send updates to
each core at the same time, what happens?   If all the cores have
their autocommit time run out at the same time, will every core try to
conduct operations (e.g. opening new searchers, merges, other things?)
at the same time and thus cause resource issues?  I think I understand
that all the pending changes are on disk already, so the "commit" that
happens when the time is up is really just opening new searchers that
include the added documents.

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: multi-core, autocommit and resource use

2009-06-18 Thread Peter Wolanin

So for now would it make sense to spread out the autocommit times for
the different cores?

Thanks.

-Peter

On Thu, Jun 18, 2009 at 7:07 PM, Yonik Seeley wrote:
> On Thu, Jun 18, 2009 at 4:27 PM, Peter Wolanin 
> wrote:
>>  I think I understand
>> that all the pending changes are on disk already, so the "commit" that
>> happens when the time is up is really just opening new searchers that
>> include the added documents.
>
> Only some of the pending changes may be on disk - a solr level commit
> involves closing the IndexWriter which flushes everything to disk, and
> then a new IndexReader is opened to read those changes.
>
> This will be improved in future versions such that an IndexReader can
> be opened *before* all of the changes have been flushed to disk (work
> on near-real-time indexing/searching in Lucene is progressing).
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: facets: case and accent insensitive sort

2009-06-28 Thread Peter Wolanin

Seems like this might be approached using a Lucene payload?  For
example where the original string is stored as the payload and
available in the returned facets for display purposes?

Payloads are byte arrays stored with Terms on Fields. See
https://issues.apache.org/jira/browse/LUCENE-755

Solr seems to have support for a few example payloads already like
NumericPayloadTokenFilter

Almost any way you approach this it seems like there are potentially
problems since you might have multiple combinations of case and accent
mapping to the same case-less accent-less value that you want to use
for sorting (and I assume for counting) your facets?

-Peter

On Fri, Jun 26, 2009 at 9:02 AM, Sébastien Lamy wrote:
> Shalin Shekhar Mangar a écrit :
>>
>> On Fri, Jun 26, 2009 at 6:02 PM, Sébastien Lamy  wrote:
>>
>>
>>>
>>> If I use a copyField to store into a string type, and facet on that, my
>>> problem remains:
>>> The facets are sorted case and accent sensitive. And I want an
>>> *insensitive* sort.
>>> If I use a copyField to store into a type with no accents and case (e.g
>>> alphaOnlySort), then solr return me facet values with no accents and no
>>> case. And I want the facet values returned by solr to *have accents and
>>> case*.
>>>
>>
>> Ah, of course you are right. There is no way to do this right now except
>> at
>> the client side.
>>
>
> Thank you for your response.
> Would it be easy to modify Solr to behave like I want. Where should I start
> to investigate?
>

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Select tika output for extract-only?

2009-07-11 Thread Peter Wolanin

I had been assuming that I could choose among possible tika output
formats when using the extracting request handler in extract-only mode
as if from the CLI with the tika jar:

-x or --xmlOutput XHTML content (default)
-h or --html   Output HTML content
-t or --text   Output plain text content
-m or --metadata   Output only metadata

However, looking at the docs and source, it seems that only the xml
option is available (hard-coded) in ExtractingDocumentLoader:

serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", true));

In addition, it seems that the metadata is always appended to the response.

Are there any open issues relating to this, or opinions on whether
adding additional flexibility to the response format would be of
interest for 1.4?

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Select tika output for extract-only?

2009-07-13 Thread Peter Wolanin

Ok, thanks. I played with it enough to to get plain text out at least,
but I'll wait for the resolution of SOLR-284

-Peter

On Sun, Jul 12, 2009 at 9:20 AM, Yonik Seeley wrote:
> Peter, I'm hacking up solr cell right now, trying to simplify the
> parameters and fix some bugs (see SOLR-284)
> A quick patch to specify the output format should make it into 1.4 -
> but you may want to wait until I finish.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Sat, Jul 11, 2009 at 5:39 PM, Peter Wolanin 
> wrote:
>> I had been assuming that I could choose among possible tika output
>> formats when using the extracting request handler in extract-only mode
>> as if from the CLI with the tika jar:
>>
>>    -x or --xml        Output XHTML content (default)
>>    -h or --html       Output HTML content
>>    -t or --text       Output plain text content
>>    -m or --metadata   Output only metadata
>>
>> However, looking at the docs and source, it seems that only the xml
>> option is available (hard-coded) in ExtractingDocumentLoader:
>>
>> serializer = new XMLSerializer(writer, new OutputFormat("XML", "UTF-8", 
>> true));
>>
>> In addition, it seems that the metadata is always appended to the response.
>>
>> Are there any open issues relating to this, or opinions on whether
>> adding additional flexibility to the response format would be of
>> interest for 1.4?
>>
>> Thanks,
>>
>> Peter
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

lucene or Solr bug with dismax?

2009-07-13 Thread Peter Wolanin

I have been getting exceptions thrown when users try to send boolean
queries into the dismax handler.  In particular, with a leading 'OR'.
I'm really not sure why this happens - I thought the dsimax parser
ignored AND/OR?

I'm using rev 779609 in case there were recent changes to this.  Is
this a known issue?


Jul 13, 2009 1:47:06 PM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.lucene.queryParser.ParseException: Cannot parse 'OR vti OR
bin OR vti OR aut OR author OR dll': Encountered "  "OR "" at line
1, column 0.
Was expecting one of:
 ...
"+" ...
"-" ...
"(" ...
"*" ...
 ...
 ...
 ...
 ...
"[" ...
"{" ...
 ...
 ...
"*" ...

at 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:110)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: lucene or Solr bug with dismax?

2009-07-13 Thread Peter Wolanin

Indeed - I assumed that only the "+" and "-" characters had any
special meaning when parsing dismax queries and that all other content
would be treated just as keywords.  That seems to be how it's
described in the dismax documentation?

Looks like this is a relevant issue (is there another)?

https://issues.apache.org/jira/browse/SOLR-874

-Peter

On Mon, Jul 13, 2009 at 4:12 PM, Mark Miller wrote:
> It doesn't ignore OR and AND, though it probably should. I think there is a
> JIRA issue for it somewhere.
>
> On Mon, Jul 13, 2009 at 4:10 PM, Peter Wolanin 
> wrote:
>
>> I can still generate this error with Solr built from svn trunk just now.
>>
>> http://localhost:8983/solr/select/?qt=dismax&q=OR+vti+OR+foo
>>
>> I'm doubly perplexed by this since 'or' is in the stopwords file.
>>
>> -Peter
>>
>> On Mon, Jul 13, 2009 at 3:15 PM, Peter Wolanin
>> wrote:
>> > I have been getting exceptions thrown when users try to send boolean
>> > queries into the dismax handler.  In particular, with a leading 'OR'.
>> > I'm really not sure why this happens - I thought the dsimax parser
>> > ignored AND/OR?
>> >
>> > I'm using rev 779609 in case there were recent changes to this.  Is
>> > this a known issue?
>> >
>> >
>> > Jul 13, 2009 1:47:06 PM org.apache.solr.common.SolrException log
>> > SEVERE: org.apache.solr.common.SolrException:
>> > org.apache.lucene.queryParser.ParseException: Cannot parse 'OR vti OR
>> > bin OR vti OR aut OR author OR dll': Encountered "  "OR "" at line
>> > 1, column 0.
>> > Was expecting one of:
>> >     ...
>> >    "+" ...
>> >    "-" ...
>> >    "(" ...
>> >    "*" ...
>> >     ...
>> >     ...
>> >     ...
>> >     ...
>> >    "[" ...
>> >    "{" ...
>> >     ...
>> >     ...
>> >    "*" ...
>> >
>> >        at
>> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:110)
>> >        at
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
>> >        at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>> >
>> >
>> >
>> > --
>> > Peter M. Wolanin, Ph.D.
>> > Momentum Specialist,  Acquia. Inc.
>> > peter.wola...@acquia.com
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
> --
> --
> - Mark
>
> http://www.lucidimagination.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: lucene or Solr bug with dismax?

2009-07-13 Thread Peter Wolanin

I can still generate this error with Solr built from svn trunk just now.

http://localhost:8983/solr/select/?qt=dismax&q=OR+vti+OR+foo

I'm doubly perplexed by this since 'or' is in the stopwords file.

-Peter

On Mon, Jul 13, 2009 at 3:15 PM, Peter Wolanin wrote:
> I have been getting exceptions thrown when users try to send boolean
> queries into the dismax handler.  In particular, with a leading 'OR'.
> I'm really not sure why this happens - I thought the dsimax parser
> ignored AND/OR?
>
> I'm using rev 779609 in case there were recent changes to this.  Is
> this a known issue?
>
>
> Jul 13, 2009 1:47:06 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.lucene.queryParser.ParseException: Cannot parse 'OR vti OR
> bin OR vti OR aut OR author OR dll': Encountered "  "OR "" at line
> 1, column 0.
> Was expecting one of:
>     ...
>    "+" ...
>    "-" ...
>    "(" ...
>    "*" ...
>     ...
>     ...
>     ...
>     ...
>    "[" ...
>    "{" ...
>     ...
>     ...
>    "*" ...
>
>        at 
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:110)
>        at 
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
>        at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Multivalued fields and scoring/sorting

2009-07-16 Thread Peter Wolanin

Assuming that you know the unique ID  when constructing the query
(which it sounds like you do) why not try a boost query with a high
boost for 2 and a lower boost for 1 - then the default sort by score
should match your desired ordering, and this order can be further
tweaked with other bf or bq arguments.

-Peter

On Thu, Jul 16, 2009 at 9:15 AM, Matt Schraeder wrote:
> The first number is a unique ID that points to a particular customer,
> the second is a value. It basically tells us whether or not a customer
> already has that product or not.  The main use of it is to be able to
> search our product listing for products the customer does not already
> have.
>
> The alternative would be to put that in a second index, but that would
> mean that I would be doing two searches for every single search I want
> to complete, which I am not sure would be a very good option.
>
 avl...@gmail.com 7/16/2009 12:04:53 AM >>>
>
> The harsh reality of life is that you cannot sort on multivalued
> fields.
> If you can explain your domain problem (the significance of numbers
> "818",
> "2" etc), maybe people can come up with an alternate index design which
> fits
> into your use cases.
>
> Cheers
> Avlesh
>
> On Thu, Jul 16, 2009 at 1:18 AM, Matt Schraeder 
> wrote:
>
>> I am trying to come up with a way to sort (or score, and sort based
> on
>> the score) of a multivalued field.  I was looking at FunctionQueries
> and
>> saw fieldvalue, but as that only works on single valued fields that
>> doesn't help me.
>>
>> The field is as follows:
>>
>>    > sortMissingLast="true" omitNorms="true">
>>      
>>        
>>        
>>        
>>        
>>        
>>        
>>      
>>    
>>
>>    > multiValued="true" />
>>
>> The actual data that gets put in this field is a string consisting of
> a
>> number, a space, and a 1 or a 2.  For example:
>>
>> "818 2"
>> "818 1"
>> "950 1"
>> "1022 2"
>>
>> I want to be able to give my search results given a boost if a
>> particular document contains "818 2" and a smaller boost if the
> document
>> contains "818 1" but not "818 2".
>>
>> The end result would be documents sorted as follows:
>>
>> 1) Documents with "818 2"
>> 2) Documents with "818 1" but not "818 2"
>> 3) Documents that contain neither "818 2" nor "818 1"
>>
>> Is this possible with solr? How would I go about doing this?
>>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Wikipedia or reuters like index for testing facets?

2009-07-16 Thread Peter Wolanin

AWS provides some standard data sets, including an extract of all
wikipedia content:

http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2345&categoryID=249

Looks like it's not being updated often, so this or another AWS data
set could be a consistent basis for benchmarking?

-Peter

On Wed, Jul 15, 2009 at 2:21 PM, Jason
Rutherglen wrote:
> Yeah that's what I was thinking of as an alternative, use enwiki
> and randomly generate facet data along with it. However for
> consistent benchmarking the random data would need to stay the
> same so that people could execute the same benchmark
> consistently in their own environment.
>
> On Tue, Jul 14, 2009 at 6:28 PM, Mark Miller wrote:
>> Why don't you just randomly generate the facet data? Thats prob the best way
>> right? You can control the uniques and ranges.
>>
>> On Wed, Jul 15, 2009 at 1:21 AM, Grant Ingersoll wrote:
>>
>>> Probably not as generated by the EnwikiDocMaker, but the WikipediaTokenizer
>>> in Lucene can pull out richer syntax which could then be Teed/Sinked to
>>> other fields.  Things like categories, related links, etc.  Mostly, though,
>>> I was just commenting on the fact that it isn't hard to at least use it for
>>> getting docs into Solr.
>>>
>>> -Grant
>>>
>>> On Jul 14, 2009, at 7:38 PM, Jason Rutherglen wrote:
>>>
>>>  You think enwiki has enough data for faceting?

 On Tue, Jul 14, 2009 at 2:56 PM, Grant Ingersoll
 wrote:

> At a min, it is trivial to use the EnWikiDocMaker and then send the doc
> over
> SolrJ...
>
> On Jul 14, 2009, at 4:07 PM, Mark Miller wrote:
>
>  On Tue, Jul 14, 2009 at 3:36 PM, Jason Rutherglen <
>> jason.rutherg...@gmail.com> wrote:
>>
>>  Is there a standard index like what Lucene uses for contrib/benchmark
>>> for
>>> executing faceted queries over? Or maybe we can randomly generate one
>>> that
>>> works in conjunction with wikipedia? That way we can execute real world
>>> queries against faceted data. Or we could use the Lucene/Solr mailing
>>> lists
>>> and other data (ala Lucid's faceted site) as a standard index?
>>>
>>>
>> I don't think there is any standard set of docs for solr testing - there
>> is
>> not a real benchmark contrib - though I know more than a few of us have
>> hacked up pieces of Lucene benchmark to work with Solr - I think I've
>> done
>> it twice now ;)
>>
>> Would be nice to get things going. I was thinking the other day: I
>> wonder
>> how hard it would be to make Lucene Benchmark generic enough to accept
>> Solr
>> impls and Solr algs?
>>
>> It does a lot that would suck to duplicate.
>>
>> --
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
>
>>> --
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>>> Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>>
>>>
>>
>>
>> --
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: spellcheck with misspelled words in index

2009-07-16 Thread Peter Wolanin

I think you can just tell the spellchecker to only supply "more
popular" suggestions, which would naturally omit these rare
misspellings:

  true

-Peter

On Wed, Jul 15, 2009 at 7:30 PM, Jay Hill wrote:
> We had the same thing to deal with recently, and a great solution was posted
> to the list. Create a stopwords filter on the field your using for your
> spell checking, and then populate a custom stopwords file with known
> misspelled words:
>
>     positionIncrementGap="100" >
>      
>        
>        
>         words="misspelled_words.txt"/>
>        
>      
>    
>
> Your spell field would look like this:
>    multiValued="true"/>
>
> Then add words like "cusine" to messpelled_words.txt
>
> -Jay
>
>
> On Tue, Jul 14, 2009 at 11:40 PM, Chris Williams wrote:
>
>> Hi,
>> I'm having some trouble getting the correct results from the
>> spellcheck component.  I'd like to use it to suggest correct product
>> titles on our site, however some of our products have misspellings in
>> them outside of our control.  For example, there's 2 products with the
>> misspelled word "cusine" (and 25k with the correct spelling
>> "cuisine").  So if someone searches for the word "cusine" on our site,
>> I would like to show the 2 misspelled products, and a suggestion with
>> "Did you mean cuisine?".
>>
>> However, I can't seem to ever get any spelling suggestions when I
>> search by the word "cusine", and correctlySpelled is always true.
>> Misspelled words that don't appear in the index work fine.
>>
>> I noticed that setting onlyMorePopular to true will return suggestions
>> for the misspelled word, but I've found that it doesn't work great for
>> other words and produces suggestions too often for correctly spelled
>> words.
>>
>> I incorrectly had thought that by setting thresholdTokenFrequency
>> higher on my spelling dictionary that these misspellings would not
>> appear in my spelling index and thus I would get suggestions for them,
>> but as I see now, the spellcheck doesn't quite work like that.
>>
>> Is there any way to somehow get spelling suggestions to work for these
>> misspellings in my index if they have a low frequency?
>>
>> Thanks in advance,
>> Chris
>>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Obtaining SOLR index size on disk

2009-07-20 Thread Peter Wolanin

Actually, if you have a server enabled as a replication master, the
stats.jsp page reports the index size, so that information is
available in some cases.

-Peter

On Sat, Jul 18, 2009 at 8:14 AM, Erik Hatcher wrote:
>
> On Jul 17, 2009, at 8:45 PM, J G wrote:
>>
>> Is it possible to obtain the SOLR index size on disk through the SOLR API?
>> I've read through the docs and mailing list questions but can't seem to find
>> the answer.
>
> No, but it'd be a great addition to the /admin/system handler which returns
> lots of other useful trivia like the free memory, ulimit, uptime, and such.
>
>        Erik
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: SOLR: Replication

2010-01-03 Thread Peter Wolanin

Related to the difference between rsync and native Solr replication -
we are seeing issues with Solr 1.4 where search queries that come in
during a replication request hang for excessive amount of time (up to
100's of seconds for a result normally that takes ~50 ms).

We are replicating pretty often (every 90 sec for multiple cores to
one slave server), but still did not think that replication would make
the master server unable to handle search requests.  Is there some
configuration option we are missing which would handle this situation
better?

Thanks,

Peter

On Sun, Jan 3, 2010 at 11:27 AM, Fuad Efendi  wrote:
> Thank you Yonik, excellent WIKI! I'll try without APR, I believe it's
> environmental issue; 100Mbps switched should do 10 times faster (current
> replica speed is 1Mbytes/sec)
>
>
>> -Original Message-
>> From: ysee...@gmail.com [mailto:ysee...@gmail.com] On Behalf Of Yonik
>> Seeley
>> Sent: January-03-10 10:03 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: SOLR: Replication
>>
>> On Sat, Jan 2, 2010 at 11:35 PM, Fuad Efendi  wrote:
>> > I tried... I set APR to improve performance... server is slow while
>> replica;
>> > but "top" shows only 1% of I/O wait... it is probably environment
>> specific;
>>
>> So you're saying that stock tomcat (non-native APR) was also 10 times
>> slower?
>>
>> > but the same happened in my home-based network, rsync was 10 times
>> faster...
>> > I don't know details of HTTP-replica, it could be base64 or something
>> like
>> > that; RAM-buffer, flush to disk, etc.
>>
>> The HTTP replication is using binary.
>> If you look here, it was benchmarked to be nearly as fast as rsync:
>> http://wiki.apache.org/solr/SolrReplication
>>
>> It does do a fsync to make sure that the files are on disk after
>> downloading, but that shouldn't make too much difference.
>>
>> -Yonik
>> http://www.lucidimagination.com
>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: SOLR Performance Tuning: Pagination

2010-01-03 Thread Peter Wolanin

At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers
from Near Infinity (Aaron McCurry I think) mentioned that he had a
patch for lucene that enabled unlimited depth memory-efficient paging.
 Is anyone in contact with him?

-Peter

On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll  wrote:
>
> On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote:
>
>> I used pagination for a while till found this...
>>
>>
>> I have filtered query ID:[* TO *] returning 20 millions results (no
>> faceting), and pagination always seemed to be fast. However, fast only with
>> low values for start=12345. Queries like start=28838540 take 40-60 seconds,
>> and even cause OutOfMemoryException.
>
> Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority 
> Queue management.  See http://issues.apache.org/jira/browse/LUCENE-2127 and 
> the linked discussion on java-dev.
>
>>
>> I use highlight, faceting on nontokenized "Country" field, standard handler.
>>
>>
>> It even seems to be a bug...
>>
>>
>> Fuad Efendi
>> +1 416-993-2060
>> http://www.linkedin.com/in/liferay
>>
>> Tokenizer Inc.
>> http://www.tokenizer.ca/
>> Data Mining, Vertical Search
>>
>>
>>
>>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene: 
> http://www.lucidimagination.com/search
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Indexing the latests MS Office documents

2010-01-04 Thread Peter Wolanin

You must have been searching old documentation - I think tika 0,3+ has
support for the new MS formats.  but don't take my word for it - why
don't you build tika and try it?

-Peter

On Sun, Jan 3, 2010 at 7:00 PM, Roland Villemoes  
wrote:
> Hi All,
>
> Anyone who knows how to index the latest MS office documents like .docx and 
> .xlsx  ?
>
> From searching it seems like Tika only supports the earlier formats .doc and 
> .xls
>
>
>
> med venlig hilsen/best regards
>
> Roland Villemoes
> Tel: (+45) 22 69 59 62
> E-Mail: mailto:r...@alpha-solutions.dk
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

dramatic load from stas.jsp page

2010-01-05 Thread Peter Wolanin

The attached screenshot shows the transition on a master search server
when we updated from a Solr 1.4 dev build (revision 779609 from
2009-05-28) to the Solr 1.4.0 released code.  Every 3 hours we have a
cron task to log some of the data from the stats.jsp page from each
core (about 100 cores, most of which are small indexes).

You can see there is a dramatic spiking of the load after the update -
I think due to added reporting on that page such as from the lucene
field cache.  Is this amount of increased load expected from the
stats.jsp page, or would your consider this a bug?  Other than
creating a custom jsp page with just a subset of this data, is I don't
see any way in Solr to query and display specific stats of interest
for one core via the REST interface - am I missing something?

Thanks,

Peter

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: internal XML parser used in Solr

2010-01-05 Thread Peter Wolanin

Config.java (which parses e.g. solrconfig.xml) in the solr core code has:

import org.w3c.dom.Document;
import org.w3c.dom.Node;
import org.xml.sax.SAXException;
import org.apache.solr.common.SolrException;
import org.apache.solr.common.util.DOMUtil;
import javax.xml.parsers.*;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathFactory;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.namespace.QName;

On Tue, Jan 5, 2010 at 10:05 AM, Smith G  wrote:
> Hello ,
>         There are some project specific schema xml files which should
> be parsed. I have used Jdom API for the same. But it seems more clean
> to shift to xml parser used by Solr itself. I have gone through source
> codes.Its a bit confusing. I have found javax.xml package and also
> org.xml.sax package . May I know which API should I use so that there
> is no need to add some external jar file to the solr-lib . I am also
> looking for the jar file ( in solr ) in which xml parser api is
> included.
> Thanks
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: SOLR Performance Tuning: Pagination

2010-01-07 Thread Peter Wolanin

Great - this issue?  https://issues.apache.org/jira/browse/LUCENE-2127

Sounds like it would be a real win for lucene.

-Peter

On Thu, Jan 7, 2010 at 4:12 PM, Otis Gospodnetic
 wrote:
> Peter - Aaron just commented on a recent Solr issue (reading large result 
> sets) and mentioned his patch.
> So far he has 2 x +1 from Grant and me to stick his patch in JIRA.
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> - Original Message 
>> From: Peter Wolanin 
>> To: solr-user@lucene.apache.org
>> Sent: Sun, January 3, 2010 3:37:01 PM
>> Subject: Re: SOLR Performance Tuning: Pagination
>>
>> At the NOVA Apache Lucene/Solr Meetup last May, one of the speakers
>> from Near Infinity (Aaron McCurry I think) mentioned that he had a
>> patch for lucene that enabled unlimited depth memory-efficient paging.
>> Is anyone in contact with him?
>>
>> -Peter
>>
>> On Thu, Dec 24, 2009 at 11:27 AM, Grant Ingersoll wrote:
>> >
>> > On Dec 24, 2009, at 11:09 AM, Fuad Efendi wrote:
>> >
>> >> I used pagination for a while till found this...
>> >>
>> >>
>> >> I have filtered query ID:[* TO *] returning 20 millions results (no
>> >> faceting), and pagination always seemed to be fast. However, fast only 
>> >> with
>> >> low values for start=12345. Queries like start=28838540 take 40-60 
>> >> seconds,
>> >> and even cause OutOfMemoryException.
>> >
>> > Yeah, deep pagination in Lucene/Solr can be problematic due to the Priority
>> Queue management.  See http://issues.apache.org/jira/browse/LUCENE-2127 and 
>> the
>> linked discussion on java-dev.
>> >
>> >>
>> >> I use highlight, faceting on nontokenized "Country" field, standard 
>> >> handler.
>> >>
>> >>
>> >> It even seems to be a bug...
>> >>
>> >>
>> >> Fuad Efendi
>> >> +1 416-993-2060
>> >> http://www.linkedin.com/in/liferay
>> >>
>> >> Tokenizer Inc.
>> >> http://www.tokenizer.ca/
>> >> Data Mining, Vertical Search
>> >>
>> >>
>> >>
>> >>
>> >
>> > --
>> > Grant Ingersoll
>> > http://www.lucidimagination.com/
>> >
>> > Search the Lucene ecosystem using Solr/Lucene:
>> http://www.lucidimagination.com/search
>> >
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 - stats page slow

2010-01-07 Thread Peter Wolanin

I recently noticed the same sort of thing.

The attached screenshot shows the transition on a search server
when we updated from a Solr 1.4 dev build (revision 779609 from
2009-05-28) to the Solr 1.4.0 released code.  Every 3 hours we have a
cron task to log some of the data from the stats.jsp page from each
core (about 100 cores, most of which are small indexes).

You can see there is a dramatic spiking of the load after the update -
I think due to added reporting on that page such as from the lucene
field cache.  Is this amount of load expected?

-Peter

On Thu, Dec 24, 2009 at 12:23 PM, Jay Hill  wrote:
> Also, what is your heap size and the amount of RAM on the machine?
>
> I've also noticed that, when watching memory usage through JConsole or
> YourKit while loading the stats page, the memory usage spikes dramatically -
> are you seeing this as well?
>
> -Jay
>
> On Thu, Dec 24, 2009 at 9:12 AM, Jay Hill  wrote:
>
>> I've noticed this as well, usually when working with a large field cache. I
>> haven't done in-depth analysis of this yet, but it seems like when the stats
>> page is trying to pull data from a large field cache it takes quite a long
>> time.
>>
>> Are you doing a lot of sorting? If so, what are the field types of the
>> fields you're sorting on? How large is the index both in document count and
>> file size?
>>
>> Another approach to get data from the Solr instance would be to use JMX.
>> And I've been working on a request handler (started by Erik Hatcher) that
>> will provide the same information as the stats page, but a little more
>> efficiently. I may try to put up a patch with this soon.
>>
>> -Jay
>>
>>
>>
>> On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss wrote:
>>
>>> We've been using Solr 1.4 for a few days now and one slight downside we've
>>> noticed is the stats page comes up very slowly for some reason - sometimes
>>> more than 10 seconds.  We call this programmatically to retrieve the last
>>> commit date so that we can keep users from committing too frequently.  This
>>> means some of our administration pages are now taking a long time to load.
>>>  Is there anything we should be doing to ensure that this page comes up
>>> quickly?  I see some notes on this back in October but it looks like that
>>> update should already be applied by now.  Or, better yet, is there now a
>>> better way to just retrieve the last commit date from Solr without pulling
>>> all of the statistics?
>>>
>>> Thanks in advance.
>>>
>>> --
>>> Steve
>>>
>>
>>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 - stats page slow

2010-01-08 Thread Peter Wolanin

Ah sorry - didn't realize attachments were stripped.

Here's a web version:
http://img.skitch.com/20100108-t99a1emmar32w9gkcfcius8afm.png

-Peter

On Thu, Jan 7, 2010 at 9:53 PM, Otis Gospodnetic
 wrote:
> I'd love to see the screenshot, but it didn't come through - got stripped by 
> ML manager.  Maybe upload it somewhere?
>  Otis
> --
> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
>
>
>
> ----- Original Message 
>> From: Peter Wolanin 
>> To: solr-user@lucene.apache.org
>> Sent: Thu, January 7, 2010 9:32:26 PM
>> Subject: Re: Solr 1.4 - stats page slow
>>
>> I recently noticed the same sort of thing.
>>
>> The attached screenshot shows the transition on a search server
>> when we updated from a Solr 1.4 dev build (revision 779609 from
>> 2009-05-28) to the Solr 1.4.0 released code.  Every 3 hours we have a
>> cron task to log some of the data from the stats.jsp page from each
>> core (about 100 cores, most of which are small indexes).
>>
>> You can see there is a dramatic spiking of the load after the update -
>> I think due to added reporting on that page such as from the lucene
>> field cache.  Is this amount of load expected?
>>
>> -Peter
>>
>> On Thu, Dec 24, 2009 at 12:23 PM, Jay Hill wrote:
>> > Also, what is your heap size and the amount of RAM on the machine?
>> >
>> > I've also noticed that, when watching memory usage through JConsole or
>> > YourKit while loading the stats page, the memory usage spikes dramatically 
>> > -
>> > are you seeing this as well?
>> >
>> > -Jay
>> >
>> > On Thu, Dec 24, 2009 at 9:12 AM, Jay Hill wrote:
>> >
>> >> I've noticed this as well, usually when working with a large field cache. 
>> >> I
>> >> haven't done in-depth analysis of this yet, but it seems like when the 
>> >> stats
>> >> page is trying to pull data from a large field cache it takes quite a long
>> >> time.
>> >>
>> >> Are you doing a lot of sorting? If so, what are the field types of the
>> >> fields you're sorting on? How large is the index both in document count 
>> >> and
>> >> file size?
>> >>
>> >> Another approach to get data from the Solr instance would be to use JMX.
>> >> And I've been working on a request handler (started by Erik Hatcher) that
>> >> will provide the same information as the stats page, but a little more
>> >> efficiently. I may try to put up a patch with this soon.
>> >>
>> >> -Jay
>> >>
>> >>
>> >>
>> >> On Wed, Dec 23, 2009 at 6:43 AM, Stephen Weiss wrote:
>> >>
>> >>> We've been using Solr 1.4 for a few days now and one slight downside 
>> >>> we've
>> >>> noticed is the stats page comes up very slowly for some reason - 
>> >>> sometimes
>> >>> more than 10 seconds.  We call this programmatically to retrieve the last
>> >>> commit date so that we can keep users from committing too frequently.  
>> >>> This
>> >>> means some of our administration pages are now taking a long time to 
>> >>> load.
>> >>>  Is there anything we should be doing to ensure that this page comes up
>> >>> quickly?  I see some notes on this back in October but it looks like that
>> >>> update should already be applied by now.  Or, better yet, is there now a
>> >>> better way to just retrieve the last commit date from Solr without 
>> >>> pulling
>> >>> all of the statistics?
>> >>>
>> >>> Thanks in advance.
>> >>>
>> >>> --
>> >>> Steve
>> >>>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Basic questions about Solr cost in programming time

2010-01-26 Thread Peter Wolanin

Having worked quite a bit on the Drupal integration - here's my quick take:

If you have someone help you the first time, you can have a basic
implementation running in Jetty in about 15 minutes.  On your own, a
couple hours maybe. For a non-public site (intranet) with modest
traffic and no requirements for high availability, that is likely
going to hold you for a while.

If you are not already using tomcat6 and want a more robust
deployment, getting that right will take you a couple days work I'd
guess.

There are already some options for indexing/searching documents via
the Drupal integration, but that's still a little rough.

Of course, we'd also be happy to have you get Drupal support and a
hosted Solr index from us at Acquia.
http://acquia.com/products-services/acquia-search-features  However, I
don't think you'll readily be able to use our service with Jive at the
moment - you don't really describe why you'd be using both Jive and
Drupal.

If you are not doing any customization and compiling the java isn't
something you enjoy, I'd think the certified distribution is a fine
place to start and you can get with it Lucid's free PDF book, which
is, I think, by far the best and most comprehensive Solr 1.4 reference
work that exists at the moment.

-Peter

On Tue, Jan 26, 2010 at 3:00 PM, Jeff Crump  wrote:
> Hi,
> I hope this message is OK for this list.
>
> I'm looking into search solutions for an intranet site built with Drupal.
> Eventually we'd like to scale to enterprise search, which would include the
> Drupal site, a document repository, and Jive SBS (collaboration software).
> I'm interested in Lucene/Solr because of its scalability, faceted search and
> optimization features, and because it is free. Our problem is that we are a
> non-profit organization with only three very busy programmers/sys admins
> supporting our employees around the world.
>
> To help me argue for Solr in terms of total cost, I'm hoping that members of
> this list can share their insights about the following:
>
> * About how many hours of programming did it take you to set up your
> instance of Lucene/Solr (not counting time spent on optimization)?
>
> * Are there any disadvantages of going with a certified distribution rather
> than the standard distribution?
>
>
> Thanks and best regards,
> Jeff
>
> Jeff Crump
> jcr...@hq.mercycorps.org
>
>
>
>
>
>
>
>
>
>
>

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 - stats page slow

2010-01-26 Thread Peter Wolanin

Sorry for not following up sooner- been a busy last couple weeks.

We do see a significant instanity count - could this be due to
updating indexes from the dev Solr build?  E.g. on one server I see


  61


and entries like:


  SUBREADER: Found caches for decendents of
org.apache.lucene.index.readonlydirectoryrea...@2b8d6cbf+created

'org.apache.lucene.index.readonlydirectoryrea...@2b8d6cbf'=>'created',class
org.apache.lucene.search.FieldCache$StringIndex,null=>org.apache.lucene.search.FieldCache$StringIndex#2002656056
(size =~ 74.4 KB)

'org.apache.lucene.store.niofsdirectory$niofsindexin...@47adeb94'=>'created',class
org.apache.lucene.search.FieldCache$StringIndex,null=>org.apache.lucene.search.FieldCache$StringIndex#1099177573
(size =~ 74.4 KB)




  SUBREADER: Found caches for decendents of
org.apache.lucene.index.readonlydirectoryrea...@d0340a9+created

'org.apache.lucene.index.readonlydirectoryrea...@d0340a9'=>'created',class
org.apache.lucene.search.FieldCache$StringIndex,null=>org.apache.lucene.search.FieldCache$StringIndex#868132357
(size =~ 831.2 KB)

'org.apache.lucene.store.niofsdirectory$niofsindexin...@78802615'=>'created',class
org.apache.lucene.search.FieldCache$StringIndex,null=>org.apache.lucene.search.FieldCache$StringIndex#1542727931
(size =~ 831.2 KB)




And I think it's higher on the one associated with the screenshot.

using the lucene checkIndex tool does not show any errors.

Most of what we want is returned by the Luke handler, except for the
pending adds and deletes and the index size.  I can hack around this
by creating a greatly reduced stats.jsp, but I'd also liek to
understand what we are experiencing.

-Peter

On Fri, Jan 8, 2010 at 1:38 PM, Mark Miller  wrote:
> Yonik Seeley wrote:
>> On Fri, Jan 8, 2010 at 1:03 PM, Mark Miller  wrote:
>>
>>> It should be fixed in trunk, but that was after 1.4. Currently, it
>>> should only do it if it sees insanity - which there shouldn't be any
>>> with stock Solr.
>>>
>>
>> http://svn.apache.org/viewvc/lucene/solr/tags/release-1.4.0/src/java/org/apache/solr/search/SolrFieldCacheMBean.java
>> http://svn.apache.org/viewvc?view=revision&revision=826788
>> Seems like it's there? Or was it a different commit?
>>
>> Perhaps there is just real instanity... which may be unavoidable at
>> this point since not everything in solr is done per-segment yet.
>>
>> -Yonik
>> http://www.lucidimagination.com
>>
>
> Your right - when looking at the Solr release date, I quickly took the
> 10 as October - but it was 11/10, so it is in 1.4.
>
> So people seeing this should also being seeing an insanity count over one.
>
> I'd think that would be rarer than one this sounds like though ... whats
> left that could cause insanity?
>
> We should prob switch to never calculating the size unless an explicit
> param is pass to the stats page.
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: schema.xml and Xinclude

2010-01-26 Thread Peter Wolanin

It doesn't really work with the schema.xml - I beat my head on it for
a few hours not long ago - maybe I sent an e-mail to this list about
it?

Yes, here:  
http://www.lucidimagination.com/search/document/ba68aa6f2f7702c3/is_it_possible_to_use_xinclude_in_schema_xml

-Peter

On Wed, Jan 6, 2010 at 8:36 AM, Patrick Sauts  wrote:
> As  in schema.xml are the same between all our indexes, I'd like to
> make them an XInclude so I tried :
>
> 
>
>  xmlns:xi="http://www.w3.org/2001/XInclude";>
>
>  
> 
> -
> -
> -
> 
>
> My Syntax might not be correct ?
> Or it is not possible ? yet ?
>
> Thank you again for your time.
>
> Patrick.
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 - stats page slow

2010-02-07 Thread Peter Wolanin

Yes, we do have some fields (like the creation date) that we use for
both sorting and faceting.

-Peter

On Tue, Jan 26, 2010 at 8:55 PM, Yonik Seeley
 wrote:
> On Tue, Jan 26, 2010 at 8:49 PM, Peter Wolanin  
> wrote:
>> Sorry for not following up sooner- been a busy last couple weeks.
>>
>> We do see a significant instanity count - could this be due to
>> updating indexes from the dev Solr build?  E.g. on one server I see
>
> Do you both sort (or use a function query) and facet on the "created" field?
> Faceting on single-valued fields is still currently done at the
> top-level reader, while sorting and function queries are at a segment
> level.
>
> -Yonik
> http://www.lucidimagination.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr/Drupal Integration - Query Question

2010-02-24 Thread Peter Wolanin

Can you tell me more about the rord() performance issues?  I'm one of
the maintainers of the Drupal module, so I'd like to switch if there
is a better option.

Thanks,

Peter


On Wed, Feb 10, 2010 at 12:00 AM, Lance Norskog  wrote:
> The admin/form.jsp is supposed to prepopulate fl= with '*,score' which
> means bring back all fields and the calculated relevance score.
>
> This is the Drupal search, decoded. I changed the %2B to + signs for
> readability. Have a look at the filter query fq= and the facet date
> range.
>
> Also, in Solr 1.4 the 'rord' function has become very slow. So the
> Drupal integration needs some updating anyway.
>
> INFO: [] webapp=/solr path=/select
> params={spellcheck=true&
> spellcheck.q=video&
> fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative&
>
> bf=recip(rord(created),4,19,19)^200.0&
>
> &hl.simple.post=&
> hl.simple.pre=&hl=&version=1.2&
> hl.fragsize=&
> hl.fl=&
> hl.snippets=&
>
> facet=true&facet.limit=20&
> facet.field=uid&facet.field=type&facet.field=language&
> facet.mincount=1&
>
> fq=(nodeaccess_all:0+OR+hash:c13a544eb3ac)&
> qf=name^3.0&facet.date=changed&
> json.nl=map&wt=json&
>
> f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR&
> f.changed.facet.date.end=2010-02-09T17:44:16Z+1HOUR/HOUR&
> f.changed.facet.date.gap=+1HOUR
>
> rows=10&start=0&facet.sort=true&
> q=video}
> hits=0 status=0 QTime=0
>
> On Tue, Feb 9, 2010 at 1:28 PM, jaybytez  wrote:
>>
>> I know this is not Drupal, but thought this question maybe more around the
>> Solr query.
>>
>> For instance, I pulled down LucidImaginations Solr install, just like the
>> apache solr install and ran the example solr and loaded the documents from
>> the exampledocs.
>>
>> I can go to:
>>
>> http://localhost:8983/solr/admin/
>>
>> And search for video and get responses
>>
>> But on my solr if I go to the full interface and use the defaults, I get no
>> results back because of search fields, etc.
>>
>> http://localhost:8983/solr/admin/form.jsp
>>
>> So my admin Solr search query looks like this when searching "video":
>>
>> Feb 9, 2010 1:25:49 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/select
>> params={explainOther=&fl=&indent=on&start=0&q=video&hl.fl=&qt=&wt=&fq=&version=2.2&rows=10}
>> hits=2 status=0 QTime=0
>>
>> But if I go into Drupal and search "video", this is the query and no results
>> come back:
>>
>> Feb 9, 2010 1:27:33 PM org.apache.solr.core.SolrCore execute
>> INFO: [] webapp=/solr path=/select
>> params={spellcheck=true&f.changed.facet.date.start=2010-02-09T07:01:14Z/HOUR&facet=true&facet.limit=20&spellcheck.q=video&hl.simple.pre=&hl=&version=1.2&fl=id,nid,title,comment_count,type,created,changed,score,path,url,uid,name,ss_image_relative&bf=recip(rord(created),4,19,19)^200.0&f.changed.facet.date.gap=%2B1HOUR&hl.simple.post=&facet.field=uid&facet.field=type&facet.field=language&fq=(nodeaccess_all:0+OR+hash:c13a544eb3ac)&hl.fragsize=&facet.mincount=1&qf=name^3.0&facet.date=changed&hl.fl=&json.nl=map&wt=json&f.changed.facet.date.end=2010-02-09T17:44:16Z%2B1HOUR/HOUR&rows=10&hl.snippets=&start=0&facet.sort=true&q=video}
>> hits=0 status=0 QTime=0
>>
>> Any thoughts on the search query that gets generated by the Drupal/Solr
>> module?
>>
>> Thanks...jay
>> --
>> View this message in context: 
>> http://old.nabble.com/Solr-Drupal-Integration---Query-Question-tp27522362p27522362.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr/Drupal Integration - Query Question

2010-02-24 Thread Peter Wolanin

The Drupal schema and solrconfig and the example schema and solrconfig
have different fields and defaults, and likely Drupal won't find the
fields its looking for and might not be even using the right query
perser.

-Peter

On Thu, Feb 11, 2010 at 3:19 PM, jaybytez  wrote:
>
> So I got it to work by running the drupal cron.php.
>
> I was originally trying to use the exampledocs, indexing that content, and
> making that index available to the Drupal solr.
>
> But it might just be that they are different indexes? And that's why I
> wasn't get responses.
>
> One quick question, the Drupal/Solr Facets are awesome, the only thing is
> the URLs are escaped and seem to cause problems when I click the link.  Is
> this most likely an encoding issue or something in Solr that is causing
> these links to be created poorly?
>
> For instance:
>
> http://localhost:8080/search/apachesolr_search/drupal?filters=tid%3A1%20tid%3A3%20%28nodeaccess_all%3A0%20OR%20hash%3Ac13a544eb3ac%29
>
> This returns no results and produces the following error in Solr (is this
> error related to http://issues.apache.org/jira/browse/SOLR-1231):
>
> Feb 11, 2010 12:18:58 PM org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.lucene.queryParser.ParseException: Cannot parse
> 'hash:c13a544eb3ac)': Encountered " ")" ") "" at line 1, column 17.
> Was expecting one of:
>    
>     ...
>     ...
>     ...
>    "+" ...
>    "-" ...
>    "(" ...
>    "*" ...
>    "^" ...
>     ...
>     ...
>     ...
>     ...
>     ...
>    "[" ...
>    "{" ...
>     ...
>
>        at
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:108)
>        at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:174)
>        at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>        at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>        at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
>        at
> org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
>        at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>        at
> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
>        at
> org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
>        at
> org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
>        at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
>        at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>        at
> org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)
>        at org.mortbay.jetty.Server.handle(Server.java:285)
>        at
> org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
>        at
> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:821)
>        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:513)
>        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:208)
>        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
>        at
> org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
>        at
> org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)
> Caused by: org.apache.lucene.queryParser.ParseException: Cannot parse
> 'hash:c13a544eb3ac)': Encountered " ")" ") "" at line 1, column 17.
> Was expecting one of:
>    
>     ...
>     ...
>     ...
>    "+" ...
>    "-" ...
>    "(" ...
>    "*" ...
>    "^" ...
>     ...
>     ...
>     ...
>     ...
>     ...
>    "[" ...
>    "{" ...
>     ...
>
>        at
> org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:205)
>        at
> org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:78)
>        at org.apache.solr.search.QParser.getQuery(QParser.java:131)
>        at
> org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:103)
>        ... 22 more
> Caused by: org.apache.lucene.queryParser.ParseException: Encountered " ")"
> ") "" at line 1, column 17.
> Was expecting one of:
>    
>     ...
>     ...
>     ...
>    "+" ...
>    "-" ...
>    "(" ...
>    "*" ...
>    "^" ...
>     ...
>     ...
>     ...
>     ...
>     ...
>    "[" ...
>    "{" ...
>     ...
>
>        at
> org.apache.lucene.queryParser.QueryParser.generateParseException(QueryParser.java:1846)
>        at
> org.apache.lucene.queryParser.QueryParser.jj_consume_token(QueryParser.java:1728)
>        at
> org.apache.lucene.queryParser.QueryParser.TopLevelQuery(QueryParser.java:1255)
>        at
> org.apache.lucene.queryParser.

Solr 1.4 bug? search fails but analyzer indicates a match

2010-03-27 Thread Peter Wolanin

Ran into an odd situation today searching for a string like a domain
name containing a '.', the Solr 1.4 analyzer tells me that I will get
a match, but when I enter the search either in the client or directly
in Solr, the search fails.  Our default handler is dismax, but this
also fails with the standard handler.  So I'm wondering if this is a
known issue, or am I missing something subtle in the analysis chain?
Solr is 1.4.0 that I built.

test string:  Identi.ca

queries that fail:  IdentiCa, Identi.ca, Identi-ca

query that matches: Identi ca

I would expect all the queries that fail to match.  Looking at the
schema browser, the index contains the expected terms: identica,
identi, ca

schema in use is:
http://drupalcode.org/viewvc/drupal/contributions/modules/apachesolr/schema.xml?revision=1.1.2.1.2.34&content-type=text%2Fplain&view=co&pathrev=DRUPAL-6--1

Screen shots:

analysis:  http://img.skitch.com/20100327-nt1uc1ctykgny28n8bgu99h923.png

dismax search: http://img.skitch.com/20100327-byiduuiry78caka7q5smsw7fp.png

dismax search: http://img.skitch.com/20100327-gckm8uhjx3t7px31ygfqc2ugdq.png

standard search: http://img.skitch.com/20100327-usqyqju1d12ymcpb2cfbtdwyh.png

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 bug? search fails but analyzer indicates a match

2010-03-27 Thread Peter Wolanin

Hi Mitch,

I am also seeing this locally with the exact same solr.war,
solrconfig.xml, and schema.xml running under Jetty, as well as on 2
different production servers with the same content indexed.

So this is really weird - this seems to be influenced by the surrounding text:

"would be great to have support for Identi.ca on the follow block"

fails to match "Identi.ca", but putting the content on its own or in
another sentence:

"Support Identi.ca"

the search matches.  More testing suggests the word "for" is the
problem.  I don't see an exception or error. Could be a problem with
how stopwords are removed?

-Peter

On Sat, Mar 27, 2010 at 1:19 PM, MitchK  wrote:
>
> Hi Peter,
>
> have you tried to reindex your data and did you do a commit?
> If you changed anything, have you restarted your Solr-server?
>
> I can't understand why this problem occurs, since the example seem to work
> at analysis.jsp.
>
> Kind regards
> - Mitch
> --
> View this message in context: 
> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680313.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 bug? search fails but analyzer indicates a match

2010-03-27 Thread Peter Wolanin

If I empty the stopword file and re-index, all expected matches
happen.  So maybe that provides a further suggestion of where the
problem is.  This certainly feels like a Solr bug (or lucene bug?).

-Peter

On Sat, Mar 27, 2010 at 3:05 PM, Peter Wolanin  wrote:
> Hi Mitch,
>
> I am also seeing this locally with the exact same solr.war,
> solrconfig.xml, and schema.xml running under Jetty, as well as on 2
> different production servers with the same content indexed.
>
> So this is really weird - this seems to be influenced by the surrounding text:
>
> "would be great to have support for Identi.ca on the follow block"
>
> fails to match "Identi.ca", but putting the content on its own or in
> another sentence:
>
> "Support Identi.ca"
>
> the search matches.  More testing suggests the word "for" is the
> problem.  I don't see an exception or error. Could be a problem with
> how stopwords are removed?
>
> -Peter
>
>
> On Sat, Mar 27, 2010 at 1:19 PM, MitchK  wrote:
>>
>> Hi Peter,
>>
>> have you tried to reindex your data and did you do a commit?
>> If you changed anything, have you restarted your Solr-server?
>>
>> I can't understand why this problem occurs, since the example seem to work
>> at analysis.jsp.
>>
>> Kind regards
>> - Mitch
>> --
>> View this message in context: 
>> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680313.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 bug? search fails but analyzer indicates a match

2010-03-27 Thread Peter Wolanin

The output on the analysis screen does look correct. Here are 2 screen shots:

empty stopwords: http://img.skitch.com/20100327-rcsjdih4bn3y8ahajqa5wjwybd.png

standard stopwords:
http://img.skitch.com/20100327-1w5ct1wr25jkir4sji8kumefn1.png

-Peter

On Sat, Mar 27, 2010 at 4:13 PM, MitchK  wrote:
>
> Peter,
>
> if you are right, please outcomment the stopword filter to make clear, that
> the problem is really a problem of how the stopword filter deletes
> stopwords.
>
> Is the output correct, if you enter "would be great to have support for
> Identi.ca on the follow block" in the query-label at the analysis.jsp? Can
> you make a screenshot for this sentence?
>
> - Mitch
> --
> View this message in context: 
> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 bug? search fails but analyzer indicates a match

2010-03-27 Thread Peter Wolanin

The stopwords stanza looks like:



Which is the same as the example schema
http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/example/solr/conf/schema.xml

changing this to enablePositionIncrements="false" seems to make the
searching work as expected.  Is it incorrect to have that directive
here, or is this a bug?

-Peter


On Sat, Mar 27, 2010 at 4:25 PM, Peter Wolanin  wrote:
> The output on the analysis screen does look correct. Here are 2 screen shots:
>
> empty stopwords: http://img.skitch.com/20100327-rcsjdih4bn3y8ahajqa5wjwybd.png
>
> standard stopwords:
> http://img.skitch.com/20100327-1w5ct1wr25jkir4sji8kumefn1.png
>
> -Peter
>
> On Sat, Mar 27, 2010 at 4:13 PM, MitchK  wrote:
>>
>> Peter,
>>
>> if you are right, please outcomment the stopword filter to make clear, that
>> the problem is really a problem of how the stopword filter deletes
>> stopwords.
>>
>> Is the output correct, if you enter "would be great to have support for
>> Identi.ca on the follow block" in the query-label at the analysis.jsp? Can
>> you make a screenshot for this sentence?
>>
>> - Mitch
>> --
>> View this message in context: 
>> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 bug? search fails but analyzer indicates a match

2010-03-27 Thread Peter Wolanin

Discussing this with Mark Miller in IRC - we are honing in on the problem.

Looks as though Identi.ca is treated as phrase query as if I had
quoted it like "Identi ca".  That phrase search also fails.  I had
expected that Identi.ca would be the same as Identi ca (i.e. 2
separate tokens, not a phrase).

-Peter

On Sat, Mar 27, 2010 at 4:32 PM, Peter Wolanin  wrote:
> The stopwords stanza looks like:
>
>                        ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>
> Which is the same as the example schema
> http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/example/solr/conf/schema.xml
>
> changing this to enablePositionIncrements="false" seems to make the
> searching work as expected.  Is it incorrect to have that directive
> here, or is this a bug?
>
> -Peter
>
>
> On Sat, Mar 27, 2010 at 4:25 PM, Peter Wolanin  
> wrote:
>> The output on the analysis screen does look correct. Here are 2 screen shots:
>>
>> empty stopwords: 
>> http://img.skitch.com/20100327-rcsjdih4bn3y8ahajqa5wjwybd.png
>>
>> standard stopwords:
>> http://img.skitch.com/20100327-1w5ct1wr25jkir4sji8kumefn1.png
>>
>> -Peter
>>
>> On Sat, Mar 27, 2010 at 4:13 PM, MitchK  wrote:
>>>
>>> Peter,
>>>
>>> if you are right, please outcomment the stopword filter to make clear, that
>>> the problem is really a problem of how the stopword filter deletes
>>> stopwords.
>>>
>>> Is the output correct, if you enter "would be great to have support for
>>> Identi.ca on the follow block" in the query-label at the analysis.jsp? Can
>>> you make a screenshot for this sentence?
>>>
>>> - Mitch
>>> --
>>> View this message in context: 
>>> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 bug? search fails but analyzer indicates a match

2010-03-27 Thread Peter Wolanin

Created a new issue:  https://issues.apache.org/jira/browse/SOLR-1852

further discussion there.

-Peter

On Sat, Mar 27, 2010 at 5:51 PM, Peter Wolanin  wrote:
> Discussing this with Mark Miller in IRC - we are honing in on the problem.
>
> Looks as though Identi.ca is treated as phrase query as if I had
> quoted it like "Identi ca".  That phrase search also fails.  I had
> expected that Identi.ca would be the same as Identi ca (i.e. 2
> separate tokens, not a phrase).
>
> -Peter
>
> On Sat, Mar 27, 2010 at 4:32 PM, Peter Wolanin  
> wrote:
>> The stopwords stanza looks like:
>>
>>        >                ignoreCase="true"
>>                words="stopwords.txt"
>>                enablePositionIncrements="true"
>>                />
>>
>> Which is the same as the example schema
>> http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.4/example/solr/conf/schema.xml
>>
>> changing this to enablePositionIncrements="false" seems to make the
>> searching work as expected.  Is it incorrect to have that directive
>> here, or is this a bug?
>>
>> -Peter
>>
>>
>> On Sat, Mar 27, 2010 at 4:25 PM, Peter Wolanin  
>> wrote:
>>> The output on the analysis screen does look correct. Here are 2 screen 
>>> shots:
>>>
>>> empty stopwords: 
>>> http://img.skitch.com/20100327-rcsjdih4bn3y8ahajqa5wjwybd.png
>>>
>>> standard stopwords:
>>> http://img.skitch.com/20100327-1w5ct1wr25jkir4sji8kumefn1.png
>>>
>>> -Peter
>>>
>>> On Sat, Mar 27, 2010 at 4:13 PM, MitchK  wrote:
>>>>
>>>> Peter,
>>>>
>>>> if you are right, please outcomment the stopword filter to make clear, that
>>>> the problem is really a problem of how the stopword filter deletes
>>>> stopwords.
>>>>
>>>> Is the output correct, if you enter "would be great to have support for
>>>> Identi.ca on the follow block" in the query-label at the analysis.jsp? Can
>>>> you make a screenshot for this sentence?
>>>>
>>>> - Mitch
>>>> --
>>>> View this message in context: 
>>>> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p680530.html
>>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>>
>>> --
>>> Peter M. Wolanin, Ph.D.
>>> Momentum Specialist,  Acquia. Inc.
>>> peter.wola...@acquia.com
>>>
>>
>>
>>
>> --
>> Peter M. Wolanin, Ph.D.
>> Momentum Specialist,  Acquia. Inc.
>> peter.wola...@acquia.com
>>
>
>
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Solr 1.4 bug? search fails but analyzer indicates a match

2010-03-28 Thread Peter Wolanin

I think it is clearly a bug - see comments on the issue by Robert
Muir.   https://issues.apache.org/jira/browse/SOLR-1852

The patch is a backport by Mark Miller of Robert's fixes for other
problems for the WordDelimiterFilter in Solr trunk.  Those fixes also
fix this bug as a side effect.

-Peter

On Sun, Mar 28, 2010 at 4:09 AM, MitchK  wrote:
>
> Peter,
>
> following your discussion, I was a bit confused: Is this still a bug or is
> the behaviour correct (since the positionIncrement is set to be true) and
> what changes did you do in the patch?
>
> Does the patch fits all your needs (Matches at "identi ca", "identica",
> "identi-ca", "identi.ca")?
>
> - Mitch
> --
> View this message in context: 
> http://n3.nabble.com/Solr-1-4-bug-search-fails-but-analyzer-indicates-a-match-tp680066p681185.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

Re: Evangelism

2010-04-29 Thread Peter Wolanin

A very abbreviated list of sites using Apache Solr + Drupal here:
http://drupal.org/node/447564

-Peter

On Thu, Apr 29, 2010 at 2:10 PM, Daniel Baughman  wrote:
> Hi I'm new to the list here,
>
>
>
> I'd like to steer someone in the direction of Solr, and I see the list of
> companies using solr, but none have a "power by solr" logo or anything.
>
>
>
> Does anyone have any great links with evidence to majorly successful solr
> projects?
>
>
>
> Thanks in advance,
>
>
>
> Dan B.
>
>
>
>



-- 
Peter M. Wolanin, Ph.D.
Momentum Specialist,  Acquia. Inc.
peter.wola...@acquia.com

96 matches

Mail list logo