Re: Free Webinar - Apache Lucene 2.9: Technical Overview of New Features

2009-09-19 Thread Tricia Williams
Is there a way to sign up without handing over so much personal 
information to www.eventsvc.com?


Thanks,
Tricia


Erik Hatcher wrote:

Free Webinar: Apache Lucene 2.9: Discover the Powerful New Features
---

Join us for a free and in-depth technical webinar with Grant 
Ingersoll, co-founder of Lucid Imagination and chair of the Apache 
Lucene PMC.


Thursday, September 24th 2009
11:00AM - 12 NOON PDT / 2:00 - 3:00PM EDT
Click on the link below to sign up
http://www.eventsvc.com/lucidimagination/092409?trk=WR-SEP2009B-AP

Lucene 2.9 offers a rich set of new features and performance 
improvements alongside plentiful fixes and optimizations. If you are a 
Java developer building search applications with the Lucene search 
library, this webinar provides the insights you need to harness this 
important update to Apache Lucene.


Grant will present and discuss key technical features and innovations 
including:

o Real time/Per segment searching and caching
o Built in numeric range support with trie structure for speed and 
simplified programming

o Reduced search latency and improved index efficiency

Join us for a free webinar.
Thursday, September 24th 2009
11:00 AM - NOON PDT / 2:00 - 3:00 PM EDT
http://www.eventsvc.com/lucidimagination/092409?trk=WR-SEP2009B-AP





Why isn't the DateField implementation of ISO 8601 broader?

2009-10-01 Thread Tricia Williams

Hi All,

   I'm working with data that has multiple date precisions most of 
which do not have a time associated with them, rather centuries (like 
1800's),  years (like 1867),  and years/months (like  1918-11).  I'm 
able to sort and search using a workaround where we store the date as a 
string CCYYMM where YYMM are optional.


   I was hoping to be able to tie this into the DateField type so that 
it becomes possible to facet on them without much work and duplication 
of data.  Unfortunately it requires the "cannonical representation of 
dateTime" which means the time part of the string is mandatory.


   My question is why isn't the DateField implementation of ISO 8601 
broader so that it could include  and MM as acceptable date 
strings?  What would it take to do so?  Are there any work-arounds for 
faceting by century, year, month without creating new fields in my 
schema?  The last resort would be to create these new fields but I'm 
hoping to leverage the power of the DateField and the trie to replace 
range stuff.


Thanks,
Tricia

Some interesting observations from tinkering with the DateFieldTest:

   * 2003-03-00T00:00:00Z becomes 2003-02-28T00:00:00Z
   * 2008-03-00T00:00:00Z becomes 2008-02-29T00:00:00Z
   * 2003-00-00T00:00:00Z becomes 2002-11-30T00:00:00Z
   * 2000-00-00T00:00:00Z becomes 1999-11-30T00:00:00Z
   * 1979-00-31T00:00:00Z becomes 1978-12-31T00:00:00Z
   * 2005-04-00T00:00:00Z becomes 2005-03-31T00:00:00Z
   * 1850-10-00T00:00:00Z becomes 1850-09-30T00:00:00Z

The rounding /YEAR, /MONTH, etc artificially imposes extra precision 
that the original data wouldn't have.  In any case where months are zero 
weird rounding happens.


ExtractingRequestHandler unknown field 'stream_source_info'

2009-10-01 Thread Tricia Williams

Hi All,

   I'm trying Solr CEL outside of the example and running into trouble 
because I can't refer to the 
http://wiki.apache.org/solr/ExtractingRequestHandler (the wiki's down).  
After realizing I needed to copy all the jars from /example/solr/lib to 
my indexes /lib dir, I am now hitting this particular wall:


INFO: [] webapp=/solr path=/update/extract 
params={myfile=MHGL016341T.pdf&commit=true&literal.id=MHGL.1634} 
status=0 QTime=5967
1-Oct-2009 10:06:34 AM 
org.apache.solr.update.processor.LogUpdateProcessor finish

INFO: {} 0 260248
1-Oct-2009 10:06:38 AM org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException: ERROR:unknown field 
'stream_source_info'
at 
org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:289)
at 
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:60)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.doAdd(ExtractingDocumentLoader.java:118)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.addDoc(ExtractingDocumentLoader.java:123)
at 
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:192)
at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1089)
at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:365)
at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:712)
at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:405)
at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:211)
at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:139)

at org.mortbay.jetty.Server.handle(Server.java:285)
at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:502)
at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:835)

at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:641)
at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:202)
at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:378)
at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:226)
at 
org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:442)

while running:
curl 
"http://localhost:8983/solr/update/extract?literal.id=MHGL.1634&commit=true"; 
-F "myfi...@mhgl016341t.pdf"


It feels like I'm not mapping something correctly either in my POST 
request or in solrconfig.xml/schema.xml.  I can see that 
STREAM_SOURCE_INFO is supposed to be an internal field from the code but 
I'm not following why it would cause this error.


Any suggestions would be appreciated.

Many Thanks,
Tricia


Re: ExtractingRequestHandler unknown field 'stream_source_info'

2009-10-01 Thread Tricia Williams

If the wiki isn't working
https://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2 


gave me more information.  The LucidImagination article helps too.

Now that the wiki is up again it is more obvious that I need to add:

fulltext
text

to my solrconfig.xml

Tricia


Re: ExtractingRequestHandler unknown field 'stream_source_info'

2009-10-01 Thread Tricia Williams

Thanks Lance,

  I have lucid's search as one of my open search tools in my browser.  
Generally pretty useful (especially the ability to filter) but it's not 
of much help when the tool points out that the best info is on the wiki 
and the link to the wiki reveals that it can't be reached.  This is the 
second time in a couple of weeks I've seen the wiki down.  Is there an 
ongoing problem?  I do appreciate the tip though.


Tricia

Lance Norskog wrote:

For future reference, the Solr & Lucene wikis and mailing lists are
indexed on http://www.lucidimagination.com/search/

On Thu, Oct 1, 2009 at 11:40 AM, Tricia Williams
 wrote:
  

If the wiki isn't working


https://www.packtpub.com/article/indexing-data-solr-1.4-enterprise-search-server-2
  

gave me more information.  The LucidImagination article helps too.

Now that the wiki is up again it is more obvious that I need to add:

fulltext
text

to my solrconfig.xml

Tricia






  




Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-06 Thread Tricia Williams
Thanks for making me think about this a little bit deeper, Hoss.  
Comments in-line.


Chris Hostetter wrote:
because those would be ambiguous.  if you just indexed field:2001-03 would 
you expect it to match field:[2001-02-28T00:00:00Z TO 
2001-03-13T00:00:00Z] ... what about date faceting, what should the 
counts be if you facet per day?
  


I would expect field:2001-03 to be a hit on a partial match such as 
field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z].  I suppose that my 
expectation would be that field:2001-03 would be counted once per day 
for each day in its range. It would follow that a user looking for 
documents relating to 1919 might also be interested in 1910.  But 
conversely a user looking for documents relating to 1919 might really 
only want documents specifically related to 1919.  Maybe the 
implementation would be smart (or configurable) about precision so that 
it wouldn't be counted when the precision asked to be represented by 
facets had more significant figures that the indexed/stored value.  
Maybe there would be another facet category at each precision for 
"others" -- the documents that have less precision than the current date 
facet precision.  I'm envisioning a hierarchical system that starts 
general with century with click-throughs drilling down eventually to days.


...your expectations may be different then everyone elses.  by requiring 
that the dates be explicit there is no ambiguity, you are in control of 
the behavior.
  


I can see your point but surely there are others out there with non 
explicit data regarding dates out there?  Does my use case makes sense 
to anyone else?


in can always just index the first date of whatever block of time (month, 
yera, century, etc..) and then facet normally.


  
Until a better solution presents itself we've gone the route of creating 
more fields for faceting on different blocks of time.  So fields for 
century, decade, year, month, and day will let us facet on each of these 
time periods as needed.  Documents with dates with less precision will 
not show up in date facets with more precision.  I was hoping there was 
an elegant hack for faceting on prefix of a defined number of characters 
(prefix=*, prefix=**, prefix=***, ...) without having to explicitly 
specify ..., prefix=188, prefix=189, prefix=190, prefix=191, ...


Regards,
Tricia


Re: Why isn't the DateField implementation of ISO 8601 broader?

2009-10-07 Thread Tricia Williams

Chris Hostetter wrote:

: I would expect field:2001-03 to be a hit on a partial match such as
: field:[2001-02-28T00:00:00Z TO 2001-03-13T00:00:00Z].  I suppose that my
: expectation would be that field:2001-03 would be counted once per day for each
: day in its range. It would follow that a user looking for documents relating

...meanwhile someone else might expect that unless the ambiguous date must 
be entirely contained within the range being queried on.
  
If implemented in DateField I guess this behaviour would need to be 
configurable.
(your implication of counting once per day would have pretty weird results 
on faceting by the way)
  
I agree.  It would be possible to have one document hit on a query but 
have hundreds of facet categories with a count of one under this 
scheme.  I'm leaning towards the scenario I described where the document 
would be counted once in an "other" facet category if it is relevant 
through rounding.
with unambiguous dates, you can have exactly what you want just by being a 
little more verbose when indexing/quering, (and somoene else can have 
exactly what they want by being equally verbose using slightly differnet 
options/queries


in your case: i would suggest that you use two fields: date_low and 
date_high ... when you have an exact date (down to the smallest level of 
granularity you care about) you put the same value in both fields, when 
you have an ambiguous value (like 2001-03) you put the largest value 
possible in date_high and the lowest value possible in date_low (ie: 
date_low:2001-03-01T00:00:00Z & date_high:2001-03-31T23:59:59.999Z) then a 
query for anything *overlapping* the range from feb28 to march 13 would 
be...


+date_low:[* TO 2001-03-13T00:00:00Z] +date_high:[2001-02-28T00:00:00Z TO *]

...it works for ambiguous dates, and it works for exact dates.

(someone else who only wants to see matches if the ranges *completely* 
overlap would just swap which end point they queried against which field)
  
We've had a really similar solution in place for range queries for a 
while.  Our current problem is really faceting.


Thanks,
Tricia


Highlighting Output

2008-08-11 Thread Tricia Williams

Martin,

I've been over some of the same thoughts you present here in the last 
few years.  The path of least resistance ended up being to deal with the 
highlighting portion of OCRed images outside of Solr.  That's not to say 
it couldn't or shouldn't be done differently.  I briefly even pursued a 
similar course of action evident in 
https://issues.apache.org/jira/browse/SOLR-386.  This would make it 
easier if you wanted to write your own highlighter.


I'm interested to see what others think of your suggestions.  I've 
forwarded this to the solr-user list.


Tricia

 Original Message 
Subject:Highlighting Output
Date:   Mon, 11 Aug 2008 17:21:55 -0400
From:   Martin Owens <[EMAIL PROTECTED]>
To: 	Tricia Williams <[EMAIL PROTECTED]>, 
[EMAIL PROTECTED]




Hello Solr Users,

I've been thinking about the highlighting functionality in Solr. I
recently had th good fortune to be helped by Tricia Williams with
payload issues relating to highlighting.

What I see though is that the highlighting functionality is heavily tied
to the fragment (highlight context) functionality. This actually makes
it interesting to write a plane highlight method that just returns meta
data (so some other process can do the actual highlighting in some
custom fashion).

So is it worth while to make sure that solr is able to do multiple
different kinds of highlighting, even if it means passing meta data back
in the request? Should we have standard ways to index and read back
payload information if we're dealing with pages, books, co-ordinates
(for highlighting images) and other meta data which is used for
highlights (chat offset, term offset eccettera). I also noticed much of
the highlighting code to do with fragments being duplicated in custom
code.

Other thoughts? does this make things more complex for normal
highlighting?

Best Regards, Martin Owens




Re: Highlighting Output

2008-08-12 Thread Tricia Williams

Martin,

You may want to follow Mark Miller's effort 
https://issues.apache.org/jira/browse/LUCENE-1286 as it develops -- 
perhaps even help with it.  He's developing a Lucene highlighter which 
would "run through query terms by using their offsets" making 
highlighting large documents much more time efficient.  I would be 
interested to see something like this end up as a Solr highlighting option.


Revisiting some of your original thoughts:

What I see though is that the highlighting functionality is heavily tied
to the fragment (highlight context) functionality. This actually makes
it interesting to write a plane highlight method that just returns meta
data (so some other process can do the actual highlighting in some
custom fashion).

So is it worth while to make sure that solr is able to do multiple
different kinds of highlighting, even if it means passing meta data back
in the request? Should we have standard ways to index and read back
payload information if we're dealing with pages, books, co-ordinates
(for highlighting images) and other meta data which is used for
highlights (chat offset, term offset eccettera). I also noticed much of
the highlighting code to do with fragments being duplicated in custom
code.
My idea for highlighting based on 
https://issues.apache.org/jira/browse/SOLR-380 was to include the 
coordinates for highlighting images as just another attribute in the 
input xml.  Then the PayloadComponent will give the coordinates 
associated with a given query as part of the xpath.  I have written some 
code beyond what is posted there that takes some extra parameters and 
reconstructs the xpath into useful results based on the granularity of 
the information that is requested (roughly based on xquery).  Is that a 
"standard" enough way or is there something else you're thinking about?


If you find anything thing I've contributed useful feel free to improve 
it for the benefit of those that use Solr and Lucene.


Tricia


Re: solr on ubuntu 8.04

2008-10-02 Thread Tricia Williams
I haven't tried installing the ubuntu package, but the releases from 
apache.org come with an example that contains a directory called "solr" 
which contains a directory called "conf" where schema.xml and 
solrconfig.xml are important.   Is it possible these files do not exist 
in the path?


Tricia

Jack Bates wrote:

No sweat - did you install the Ubuntu solr package or the solr.war from
http://lucene.apache.org/solr/?

When you say it doesn't work, what exactly do you mean?

On Thu, 2008-10-02 at 07:43 -0700, [EMAIL PROTECTED] wrote:
  

Hi Jack,
Really I would love if you could help me about it ... and tell me what you have 
in your file
./var/lib/tomcat5.5/webapps
./usr/share/tomcat5.5/webapps

It doesn't work I dont know why :(
Thanks a lot
Johanna

Jack Bates-2 wrote:


Thanks for your suggestions. I have now tried installing Solr on two
different machines. On one machine I installed the Ubuntu solr-tomcat5.5
package, and on the other I simply dropped "solr.war"
into /var/lib/tomcat5.5/webapps

Both machines are running Tomcat 5.5

I get the same error message on both machines:

SEVERE: Exception starting filter SolrRequestFilter
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.solr.core.SolrConfig

The full error message is attached.

I can confirm that the /usr/share/solr/WEB-INF/lib/apache-solr-1.2.0.jar
jar file contains: org/apache/solr/core/SolrConfig.class 

- however I do not know why Tomcat does not find it. 


Thanks again, Jack

  

Hardy has solr packages already. You might want to look how they packaged
solr if you cannot move to that version.
Did you just drop the war file? Or did you use JNDI? You probably need to
configure solr/home, and maybe fiddle with
securitymanager stuff.

Albert

On Thu, May 1, 2008 at 6:46 PM, Jack Bates  freezone.co.uk>
wrote:



I am trying to evaluate Solr for an open source records management
project to which I contribute: http://code.google.com/p/qubit-toolkit/

I installed the Ubuntu solr-tomcat5.5 package:
http://packages.ubuntu.com/hardy/solr-tomcat5.5

- and pointed my browser at: http://localhost:8180/solr/admin (The
Ubuntu and Debian Tomcat packages run on port 8180)

However, in response I get a Tomcat 404: The requested
resource(/solr/admin) is not available.

This differs from the response I get accessing a random URL:
http://localhost:8180/foo/bar

- which displays a blank page.

From this I gather that the solr-tomcat5.5 package installed
*something*, but that it's misconfigured or missing something.
Unfortunately I lack the Java / Tomcat experience to track down this
problem. Can someone recommend where to look, to learn why the Ubuntu
solr-tomcat5.5 package is not working?

I started an Ubuntu wiki page to eventually describe the process of
installing Solr on Ubuntu: https://wiki.ubuntu.com/Solr

Thanks, Jack
  

Apr 25, 2008 4:46:41 PM org.apache.catalina.core.StandardContext
filterStart
SEVERE: Exception starting filter SolrRequestFilter
java.lang.NoClassDefFoundError: Could not initialize class
org.apache.solr.core.SolrConfig
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:74)
at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:221)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:302)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:78)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:3635)
at
org.apache.catalina.core.StandardContext.start(StandardContext.java:4222)
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:760)
at
org.apache.catalina.core.ContainerBase.access$0(ContainerBase.java:744)
at
org.apache.catalina.core.ContainerBase$PrivilegedAddChild.run(ContainerBase.java:144)
at java.security.AccessController.doPrivileged(Native Method)
at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:738)
at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:544)
at
org.apache.catalina.startup.HostConfig.deployDescriptor(HostConfig.java:626)
at
org.apache.catalina.startup.HostConfig.deployDescriptors(HostConfig.java:553)
at 
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:488)
at org.apache.catalina.startup.HostConfig.start(HostConfig.java:1138)
at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:311)
at
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:120)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1022)
at org.apache.catalina.core.StandardHost.start(StandardHost.java:736)
at org.apache.catalina.core.ContainerBase.start(ContainerBase.java:1014)
at 
org.apache.catalina.core.StandardEngine.start(StandardEng

SolrPluginRepository

2008-10-02 Thread Tricia Williams

Hi All,

   I didn't see anywhere to share the plugin I created for my multipart 
work (see https://issues.apache.org/jira/browse/SOLR-380 for more).  So 
I created one here: http://wiki.apache.org/solr/SolrPluginRepository.  
I'm open to other ways of sharing plugins.


Tricia


Re: [VOTE] Community Logo Preferences

2008-11-23 Thread Tricia Williams

https://issues.apache.org/jira/secure/attachment/12394282/solr2_maho_impression.png
https://issues.apache.org/jira/secure/attachment/12394366/solr3_maho.png
https://issues.apache.org/jira/secure/attachment/12394264/apache_solr_a_red.jpg
https://issues.apache.org/jira/secure/attachment/12394266/apache_solr_b_red.jpg
https://issues.apache.org/jira/secure/attachment/12394218/solr-solid.png


Dates in Solr

2008-12-10 Thread Tricia Williams

Hi All,

  I'm curious about what people have done with dates.

We Require:

 1. multiple granularities to query and facet on: by year, by
year/month, by year/month/day
 2. sortability: sort/order by date
 3. time typically isn't important to us
 4. some of these items don't have a day or month associated with them
 5. possibly consider seasonal like publications with "FALL" as a date

This is the bulk of what I found documented in the mailing list and wiki:

  * http://www.nabble.com/dates---times-td10417533.html#a10421952
  * http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
  * 
http://wiki.apache.org/solr/SimpleFacetParameters#head-068dc96b0dac1cfc7264fe85528d7df5bf391acd 



o 
http://lucene.apache.org/solr/api/org/apache/solr/util/DateMathParser.html


o 
http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html


  * any queries on those fields (typically range queries) should use
either the Complete ISO 8601 Date syntax that field supports, or
the DateMath?
 Syntax
to get relative dates

This is great and valuable.  I would like to be able to use the existing 
functionality but I'm not sure how I can use the DateField to specify a 
year without a time (what I guess would actually be a range of time) for 
a document.  Any ideas?


Tricia


Re: Dates in Solr

2008-12-10 Thread Tricia Williams

Hi Otis,

   Absolutely, I missed that nugget.  I didn't think of using prefix 
filters/queries.  This works really well with how we had already stored 
dates in a MMDD string.  Thanks for pointing me in the right direction.


Tricia

Otis Gospodnetic wrote:

Tricia,

I think you might have missed the key nugget at the bottom of 
http://wiki.apache.org/jakarta-lucene/DateRangeQueries

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
  

From: Tricia Williams <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, December 10, 2008 12:12:11 PM
Subject: Dates in Solr

Hi All,

  I'm curious about what people have done with dates.

We Require:

1. multiple granularities to query and facet on: by year, by
year/month, by year/month/day
2. sortability: sort/order by date
3. time typically isn't important to us
4. some of these items don't have a day or month associated with them
5. possibly consider seasonal like publications with "FALL" as a date

This is the bulk of what I found documented in the mailing list and wiki:

  * http://www.nabble.com/dates---times-td10417533.html#a10421952
  * http://wiki.apache.org/jakarta-lucene/LargeScaleDateRangeProcessing
  * 
http://wiki.apache.org/solr/SimpleFacetParameters#head-068dc96b0dac1cfc7264fe85528d7df5bf391acd 



o 
http://lucene.apache.org/solr/api/org/apache/solr/util/DateMathParser.html


o 
http://lucene.apache.org/solr/api/org/apache/solr/schema/DateField.html


  * any queries on those fields (typically range queries) should use
either the Complete ISO 8601 Date syntax that field supports, or
the DateMath?
Syntax
to get relative dates

This is great and valuable.  I would like to be able to use the existing 
functionality but I'm not sure how I can use the DateField to specify a year 
without a time (what I guess would actually be a range of time) for a document.  
Any ideas?


Tricia




  




Scheduling DIH

2009-03-25 Thread Tricia Williams

Hello,

   Is there a best way to schedule the DataImportHandler?  The idea 
being to schedule a delta-import every Sunday morning at 7am or perhaps 
every hour without human intervention.  Writing a cron job to do this 
wouldn't be difficult.  I'm just wondering is this a built in feature?


Tricia


DIH: using variables in nested entities

2010-03-12 Thread Tricia Williams

Hi All,

   The DataImportHandler is the most fantastic thing that has recently 
come to Solr.  Thank you.


   I'm noticing that when I use variables in nested entities that 
square brackets are wrapped around the variable value when they are 
used.  For example ${x.url} used in the "tika" entity below resolves as 
[http://publicdomain.ca/content/Sample.pdf] (note the square brackets) 
so I get the error in my log:



SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: 
[http://publicdomain.ca/content/Sample.pdf]

at java.net.URL.(URL.java:567)
at java.net.URL.(URL.java:464)
at java.net.URL.(URL.java:413)
at 
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat

aSource.java:78)
at 
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat

aSource.java:38)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn

tityProcessor.java:98)


   I encountered this previously when I tried to concatenate fields 
from different entities into one field.  I worked around this by 
gathering fields with an xsl.  Not being able to resolve the url for 
Tika is a little more problematic.


   *Is this a bug?  If not, how do I remove the brackets so that I can 
use my variable as it was meant?*




   

   

   

   

   

...

   http://privatedomain:8080/content/"; replaceWith="http://publicdomain.ca/content/"/>

   

   



   

   

   




Many thanks,
Tricia


Re: DIH: using variables in nested entities

2010-03-13 Thread Tricia Williams
For anyone interested, my issue (I think) was because I had specified 
the url field as a multivalued field.   I wasn't able to create a test 
case that emulated my problem.  This guess is based on gradual fiddling 
with my configs.


My concern is no longer pressing but I do have a couple questions for 
the devs to think about:


  1. How should a multivalued field be treated in a child entity?  The
 use case would be the one I presented where I intend url to be
 multivalued.  I'm thinking a for-each type construct should apply.
  2. How should a multivalued field be formatted or custom formatted if
 you intend to use the content of a field in another field,
 possibly nested?



Tricia Williams wrote:

Hi All,

   The DataImportHandler is the most fantastic thing that has recently 
come to Solr.  Thank you.


   I'm noticing that when I use variables in nested entities that 
square brackets are wrapped around the variable value when they are 
used.  For example ${x.url} used in the "tika" entity below resolves 
as [http://publicdomain.ca/content/Sample.pdf] (note the square 
brackets) so I get the error in my log:



SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: 
[http://publicdomain.ca/content/Sample.pdf]

at java.net.URL.(URL.java:567)
at java.net.URL.(URL.java:464)
at java.net.URL.(URL.java:413)
at 
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat

aSource.java:78)
at 
org.apache.solr.handler.dataimport.BinURLDataSource.getData(BinURLDat

aSource.java:38)
at 
org.apache.solr.handler.dataimport.TikaEntityProcessor.nextRow(TikaEn

tityProcessor.java:98)


   I encountered this previously when I tried to concatenate fields 
from different entities into one field.  I worked around this by 
gathering fields with an xsl.  Not being able to resolve the url for 
Tika is a little more problematic.


   *Is this a bug?  If not, how do I remove the brackets so that I can 
use my variable as it was meant?*




   

   

   

   baseDir="/home/pgwillia/content" dataSource="null" fileName=".*xml" 
rootEntity="false">


   dataSource="fileReader" 
transformer="TemplateTransformer,RegexTransformer" 
forEach="/RDF/Description" url="${f.fileAbsolutePath}">


...

   regex="http://privatedomain:8080/content/"; 
replaceWith="http://publicdomain.ca/content/"/>


   url="${x.url}" dataSource="bin" format="text">


   

  

   

   

   




Many thanks,
Tricia





Re: [jira] Commented: (SOLR-380) There's no way to convert search results into page-level hits of a "structured document".

2007-10-19 Thread Tricia Williams


I echo the apology for using JIRA to work out ideas on this.

Just thinking out loud here:

  * Is there any reason why the page id should be an integer?  I mean
could the page identifier be an alphanumeric string? 
  * Ideally our project would like to store some page level meta-data

(especially a URL link to page content).  Would this be contorting
the use of a field too much?  If we stored the URL in a dynamic
field URL_*, how would we retrieve this at query time? 
  * Is there a way to alter FieldType to use the Composite design

pattern?  (http://en.wikipedia.org/wiki/Composite_pattern)  In
this way a document could be composed of fields, which could be
composed of fields.  For example: The monograph is a document, a
page in the monograph is a field in the document, the text on the
page is a field in the field, a single piece of metadata for the
page is a field in the field, etc. ( monograph
( page ( fulltext, page_metadata_1, page_metadata_2, etc ),
monograph_metadata_1, monograph_metadata_2, etc ) ).  Maybe what
I'm trying to describe is that Documents can contain Documents?

Following the path of least resistance, I think the first step is to 
create a highlighter which returns positions instead of highlighted 
text.  The next step would be to create an Analyzer and/or Filter and/or 
Tokenizer, as well as a FieldType which creates the page mappings. The 
last step (and the one I am least certain of how it could work) is to 
evolve the position highlighter to get the position to page mapping and 
group the positions by page (number or id) or alternately just write out 
the page (number or id) and drop the position.


Tricia

Binkley, Peter wrote:

(I'm taking this discussion to solr-user, as Mike Klaas suggested; sorry
for using JIRA for it. Previous discussion is at
https://issues.apache.org/jira/browse/SOLR-380).

I think the requirements I mentioned in a comment
(https://issues.apache.org/jira/browse/SOLR-380#action_12535296) justify
abandoning the one-page-per-document approach. The increment-gap
approach would break the cross-page searching, and would involve about
as much work as the stored map, since the gap would have to vary
depending on the number of terms on each page, wouldn't it? (if there
are 100 terms on page one, the gap has to be 900 to get page two to
start at 1000 - or can you specify the absolute position you want for a
term?). 


I think the problem of indexing books (or any text with arbitrary
subdivisions) is common enough that a generic approach like this would
be useful to more people than just me, and justifies some enhancements
within Solr to make the solution easy to reuse; but maybe when we've
figured out the best approach it will become clear how much of it is
worth packing into Solr.

Assuming the two-field approach works
(https://issues.apache.org/jira/browse/SOLR-380#action_12535755), then
what we're really talking about is two things: a token filter to
generate and store the map, and a process like the highlighter to
generate the output. Suppose the map is stored in tokens with the
starting term position for each page, like this:

0:1
345:2
827:3

The output function would then imitate the highlighter to discover term
positions, use the map (either by loading all its terms or by doing
lookups) to convert them to page positions, and generate the appropriate
output. I'm not clear where that output process should live, but we can
just imitate the highlighter.

(and just to clarify roles: Tricia's the one who'll actually be coding
this, if it's feasible; I'm just helping to think out requirements and
approaches based on a project in hand.)

Peter
  






Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Tricia Williams

Hi Dave,

This sounds like what I've been trying to work out with 
https://issues.apache.org/jira/browse/SOLR-380.  The idea that I'm 
running with right now is indexing the xml and storing the data in the 
xml tags as a Payload.  Payload is a relatively new idea from  Lucene.  
A custom SolrHighlighter provides position hits (our need for this is 
highlighting on an image while searching the OCR text of the image) and 
some context to where they appear in the document using the stored Payload.


Tricia

David Neubert wrote:

Chris

I'll try to track down your Jira issue.

(2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but know 
what I need -- and basically its to search by the main granules in an xml document, 
with usually turn out to be for books" book (rarley), chapter (more often), 
paragraph: (often) sentence: (often).  Then there are niceties like chapter title, 
headings, etc. but I can live without that -- but it seems like if you can exploit 
the text nodes of arbitrary XML you are looking good, if not, you gotta a lot of 
machination in front of you.

Seems like Lucene/SOLR is geared to take record and non-xml-oriented content 
and put it into XML format for ingest -- but really can't digest XML content 
itself at all without significant setup and constraints.  I am surprised -- but 
I could really use it for my project big time.

Another problem I am having related (which I will probably repost separately) 
is boolean searches across fields with multiple values.  At this point, because 
of my work arounds for Lucene (to this point) I am indexing paragraphs as 
single documents with multiple fields, thinking I could copy the sentences to 
text.  In that way, I can search field text (for the paragraph) -- and search 
field sentence -- for sentence granularity.  The problem is that a search for 
sentence:foo AND sentence:bar is matching if foo matches in any sentence of the 
paragraph, and bar also matches in any sentence of the paragraph.  I need it to 
match only if foo and bar are found in the same sentence. If this can't be do, 
looks like I will have to index paragraphs as documents, and redundantly index 
sentences as unique documents. Again, I will post this question separately 
immediately.

Thanks,

Dave
  




Payloads in Solr

2007-11-17 Thread Tricia Williams

Hi All,

   I was wondering how Solr people feel about the inclusion of Payload 
functionality in the Solr codebase?


   From a recent message to the [EMAIL PROTECTED] mailing list:
  I'm working on the issue 
https://issues.apache.org/jira/browse/SOLR-380 which is a feature 
request that allows one to index a "Structured Document" which is 
anything that can be represented by XML in order to provide more 
context to hits in the result set.  This allows us to do things like 
query the index for "Canada" and be able to not only say that that 
query matched a document titled "Some Nonsense" but also that the 
query term appeared on page 7 of chapter 1.  We can then take this one 
step further and markup/highlight the image of this page based on our 
OCR and position hit.

For example:

Some 
text from page one of a book.Some more text from 
page seven of a book. Oh and I'm from Canada.


  I accomplished this by creating a custom Tokenizer which strips the 
xml elements and stores them as a Payload at each of the Tokens 
created from the character data in the input.  The payload is the 
string that describes the XPath at that location.  So for  the 
payload is "/book[title='Some 
Nonsense']/chapter[title='One']/page[name='7']"


  The other part of this work is the SolrHighlighter which is less 
important to this list.  I retrieve the TermPositions for the Query's 
Terms and use the TermPosition functionality to get back the payload 
for the hits and build output which shows hit positions categorized by 
the payload they are associated with. 
   Using Payloads requires me to include lucene-core-2.3-dev.jar  which 
might be a barrier.  Also, using my Tokenizer with Solr specific 
TokenFilter(s) looses the Payload at modified tokens.  I probably 
shouldn't generalize this but I suspect it is true.  My only issue has 
come from the WordDelimiterFilter so far.


In the following example I will denote a token by {pos,text>,}:


input: Dog, and Cat

XmlPayloadTokenizer:
{1,,},{2,,},{3,,} 


StopFilter:
{1,,},{2,,} 


WordDelimiterFilter:
{1,,<>} {2,,}
LowerCaseFilter:
{1,,<>} {2,,}

   Should I create an JIRA issue about the Filters and post a patch?

Thanks,
Tricia


Re: Payloads in Solr

2007-11-18 Thread Tricia Williams

Thanks for your comments, Yonik!

All for it... depending on what one means by "payload functionality" of course.
We should probably hold off on adding a new lucene version to Solr
until the Payload API has stabilized (it will most likely be changing
very soon).

  
It sounds like Lucene 2.3 is going to be released soonish 
(http://www.nabble.com/How%27s-2.3-doing--tf4802426.html#a13740605).  As 
best I can tell it will include the Payload stuff marked experimental.  
The new Lucene version will have many improvements besides Payloads 
which would benefit Solr (examples galore in CHANGES.txt 
http://svn.apache.org/viewvc/lucene/java/trunk/CHANGES.txt?view=log).  
So I find it hard to believe that the new release will not be included.  
I recognize that the experimental status would be worrisome.  What will 
it take to get Payloads to the place that they would be excepted for use 
in the Solr community?  You probably know more about the projected 
changes to the API than I.  Care to fill me in or suggest who I should 
ask?  On the [EMAIL PROTECTED] list Grant Ingersoll 
suggested that the Payload object would be done away with and the API 
would just deal with byte arrays directly.

That's a lot of data to associate with every token... I wonder how
others have accomplished this?
One could compress it with a dictionary somewhere.
I wonder if one could index special begin_tag and end_tag tokens, and
somehow use span queries?

  
I agree that is a lot of data to associate with every token - especially 
since the data is repetitive in nature.  Erik Hatcher suggested I store 
a representation of the structure of the document in a separate field, 
store a numeric representation of the mapping of the token to the 
structure as the payload for each token, and do a lookup at query time 
based on the numeric mapping in the payload at the position hit to get 
the structure/context back for the token.


I'm also wondering how others have accomplished this.  Grant Ingersoll 
noted that one of the original use cases was XPath queries so I'm 
particularly interested in finding out if anyone has implemented that, 
and how.

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

  
I suppose that it is only fair to take this on a case by case basis.  
Maybe we will have to write new TokenFilters for each Tokenzier that 
uses Payloads (but I sure hope not!).  Maybe we can build some optional 
configuration options into the TokenFilter constructor that guide their 
behavior with regard to Payloads.  Maybe there is something stored in 
the TokenStream that dictates how the Payloads are handled by the 
TokenFilters.  Maybe there is no case where identical payloads would not 
be created for new tokens and we can just change the TokenFilter to deal 
with payloads directly in a uniform way.


Tricia


Re: Payloads, Tokenizers, and Filters. Oh My!

2007-11-18 Thread Tricia Williams
I apologize for cross-posting but  I believe both Solr and Lucene users 
and developers should be concerned with this.  I am not aware of a 
better way to reach both communities.


In this email I'm looking for comments on:

   * Do TokenFilters belong in the Solr code base at all?
   * How to deal with TokenFilters that add new Tokens to the stream?
   * How to patch TokenFilters and Tokenizers using the model of
 LUCENE-969 in the Solr code base and in Lucene contrib?

Earlier in this thread I identified that at least one TokenFilter is 
eating Payloads (WordDelimiterFilter).


Yonik pointed out:

Yes, this will be an issue for many custom tokenizers that don't yet
know about payloads but that create tokens.  It's not clear what to do
in some cases when multiple tokens are created from one... should
identical payloads be created for the new tokens... it depends on what
the semantics of those payloads are.

And I responded: 
I suppose that it is only fair to take this on a case by case basis.  
Maybe we will have to write new TokenFilters for each Tokenzier that 
uses Payloads (but I sure hope not!).  Maybe we can build some 
optional configuration options into the TokenFilter constructor that 
guide their behavior with regard to Payloads.  Maybe there is 
something stored in the TokenStream that dictates how the Payloads are 
handled by the TokenFilters.  Maybe there is no case where identical 
payloads would not be created for new tokens and we can just change 
the TokenFilter to deal with payloads directly in a uniform way. 


I thought it might be useful to figure out which existing TokenFilters 
need to know about Payloads.  To this end I have taken an inventory of 
the TokenFilters out there.  I think it is fair to categorize them by 
Add (A), Delete (D), Modify (M), Observe (O):


*org.apache.solr.analysis.*HyphenatedWordsFilter, DM
*org.apache.solr.analysis.*KeepWordFilter, D
*org.apache.solr.analysis.*LengthFilter, D
*org.apache.solr.analysis.*PatternReplaceFilter, M
*org.apache.solr.analysis.*PhoneticFilter, AM
*org.apache.solr.analysis.*RemoveDuplicatesTokenFilter, D
*org.apache.solr.analysis.*SynonymFilter, ADM
*org.apache.solr.analysis.*TrimFilter, M
*org.apache.solr.analysis.*WordDelimiterFilter, AM
*org.apache.lucene.analysis.*CachingTokenFilter, O
*org.apache.lucene.analysis.*ISOLatin1AccentFilter, M
*org.apache.lucene.analysis.*LengthFilter, D
*org.apache.lucene.analysis.*LowerCaseFilter, M
*org.apache.lucene.analysis.*PorterStemFilter, M
*org.apache.lucene.analysis.*StopFilter, D
*org.apache.lucene.analysis.standard*.StandardFilter, M*
org.apache.lucene.analysis.br.*BrazilianStemFilter, M
*org.apache.lucene.analysis.cn.*ChineseFilter, D*
org.apache.lucene.analysis.de.*GermanStemFilter, M
*org.apache.lucene.analysis.el.*GreekLowerCaseFilter, M
*org.apache.lucene.analysis.fr.*ElisionFilter, M
*org.apache.lucene.analysis.fr.*FrenchStemFilter, M
*org.apache.lucene.analysis.ngram.*EdgeNGramTokenFilter, AM
*org.apache.lucene.analysis.ngram.*NGramTokenFilter, AM
*org.apache.lucene.analysis.nl.*DutchStemFilter, M
*org.apache.lucene.analysis.ru.*RussianLowerCaseFilter, M
*org.apache.lucene.analysis.ru.*RussianStemFilter, M
*org.apache.lucene.analysis.th.*ThaiWordFilter, AM
*org.apache.lucene.analysis.snowball.*SnowballFilter, M

Some characteristics of Add (A), Delete (D), Modify (M), Observe (O)
Add: new Token() and buffer of Tokens to consider before addressing 
input.next()

Delete: loop ignoring tokens based on some criteria
Modify: new Token(), or use of Token set methods
Observe: rare CachingTokenFilter

The categories of TokenFilters that are affected by Payloads are add and 
modify.  The default behavior of TokenFilters which only delete or 
observe return the Token fed through intact, hence the Payload will 
remain intact.


Maybe the Lucene community has thought about this problem?  I noticed 
that the org.apache.lucene.analysis TokenFilters in the modify category 
(there are none in the add category) refrain from using new Token().  
That led me to the comment in the JavaDocs:


*NOTE:* As of 2.3, Token stores the term text internally as a 
malleable char[] termBuffer instead of String termText. The indexing 
code and core tokenizers have been changed re-use a single Token 
instance, changing its buffer and other fields in-place as the Token 
is processed. This provides substantially better indexing performance 
as it saves the GC cost of new'ing a Token and String for every term. 
The APIs that accept String termText are still available but a warning 
about the associated performance cost has been added (below). The 
|termText()| 
 
method has been deprecated.


Tokenizers and filters should try to re-use a Token instance when 
possible for best performance, by implementing the 
|TokenStream.next(Token)| 


Re: Payloads in Solr

2007-11-19 Thread Tricia Williams

Yonik Seeley wrote:


http://www.nabble.com/Payload-API-tf4828837.html#a13815548

  
http://www.nabble.com/new-Token-API-tf4828894.html#a13815702


  
Thanks for these links.  I didn't even realize you had started these 
conversations. 


Thank you!
Tricia


ResponseBuilder public flags

2008-03-17 Thread Tricia Williams

Hi,

   I'm working on a custom SearchComponent to display context stored in 
payloads.  I noticed that both the FacetComponent and the 
HighlightComponent are tightly coupled with the ResponseBuilder through 
the frequent use of doFacet and doHighlight.  If I am building a 
component with similar functionality to highlighting/faceting that will 
need to check a similar flag how can I do this as a plugin (ie without 
making any modification to the ResponseBuilder)?


   How are people feeling about the stability of this API?  Is this the 
right way to approach this?


Thanks,
Tricia



Re: Extending XmlRequestHandler

2008-05-08 Thread Tricia Williams

I frequently use the Solr API: http://lucene.apache.org/solr/api/index.html

Tricia

Alexander Ramos Jardim wrote:

Sorry for the stupid question, but I could not find Solr API code. Could
anyone point me where do I find it?

2008/5/8 Alexander Ramos Jardim <[EMAIL PROTECTED]>:

  

Nice,
Thank you. I will try this out.

2008/5/8 Ryan McKinley <[EMAIL PROTECTED]>:




The XML format is fixed, and there is not a good way to change it.  If you
can transform your custom docs via XSLT, down the line this may be possible
 (it currently is not)

If you really need to index your custom XML format, write your own
RequestHandler modeled on XmlRequestHandler, but (most likely) not extending
it.




On May 8, 2008, at 5:29 PM, Alexander Ramos Jardim wrote:

  

Hello,

I want to know how do I set the xml file format that XmlRequestHandler
understands. Should I extend it, or it can be done via some
configuration,
maybe some xml file describing the template it should understand?

I understand the easiest way to do that is getting the original xml file
and
converting it the expected format via XQuery or XSLT. After that I would
post the file. I could extend XmlRequestHandler, call the apropriate
method
and  run a  the correct method from the original XmlRequestHandler right?

--
Alexander Ramos Jardim


  

--
Alexander Ramos Jardim






  




Re: Fwd: Grouping products

2008-05-14 Thread Tricia Williams
Perhaps the Synonym Filter would work for this.  
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters will tell 
you more.


Tricia

Otis Gospodnetic wrote:

Hi Vender,

Solr can't do the grouping for you.  Solr can do the searching/finding for you, 
but it won't be able to recognize different model names and figure out which 
ones represent the same product.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
  

From: Vender Livre <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, May 14, 2008 5:01:01 PM
Subject: Fwd: Grouping products

-- Forwarded message --
From: Vender Livre 
Date: Wed, May 14, 2008 at 5:59 PM

Subject: Grouping products
To: [EMAIL PROTECTED]


Hi, I'm working in a software that must group similar products.

For example:

CANON IP1300

PRINTER CANON IP 1300

IP1300 CANON PRINTER BLACK

the app should group these three names, because they are the same product.
Someone told me SOLR should solve my problem. Is this true? Where could I
learn more about it?

Thanks




  




Re: An unusual question for the experts -- *term* boosting for individual documents?

2008-06-06 Thread Tricia Williams
Payloads could be the answer but I don't think there is any cross over 
into what I've been working on with Payloads 
(https://issues.apache.org/jira/browse/SOLR-380 has what I last posted 
which is pretty much what we're using now.  I've also posted related 
SOLR-532 and SOLR-522).


What you would have to do is write a custom Tokenizer or TokenFilter 
which takes your input, breaks into tokens and then adds the numeric 
value as a payload.  Assuming your input is actually something like:

cat:0.99 dog:0.42 car:0.00
you could write a TokenFilter which builds on the WhitespaceTokenizer to 
break each token on ":" using the first part as the token value and the 
second part as the token's payload.  I think the APIs are pretty clear 
if you are looking for help.


I haven't looked at all at how you can query/boost using payloads, but 
if Grant says that integrating the BoostingTermQuery isn't all that hard 
I would believe him.


Good Luck,
Tricia

Grant Ingersoll wrote:
Hmmm, if I understand your question correctly, I think Lucene's 
payloads are what you are after.


Lucene does support Payloads (i.e. per term storage in the index.  See 
the BoostingTermQuery in Lucene and the Token class setPayload() 
method).  However, this doesn't do much for you in Solr as of yet 
without some work on your own.  I think Tricia Williams has been 
working on payloads and Solr, but I don't know that anything has been 
posted.  The tricky part, I believe, is how to handle indexing, 
integrating the BoostingTermQuery isn't all that hard, I don't 
think.   Also note, there isn't anything in Solr preventing the use of 
payloads, but there probably is a decent amount to do to hook them in.


HTH,
Grant



On Jun 5, 2008, at 4:52 PM, Andreas von Hessling wrote:


Hi there!
As a Solr newbie who has however worked with Lucene before, I have an 
unusual question for the experts:


Question:

Can I, and if so, how do I perform index-time term boosting in 
documents where each boost-value is not the same for all documents 
(no global boosting of a given term) but instead can be 
per-document?  In other words:  I understand there's a way to specify 
term boost values for search queries, but is that also possible for 
indexed documents?



Here's what I'm fundamentally trying to do:

I want to index and search over documents that have a special, 
associative-array-like property:
Each document has a list of unique words, and each word has a numeric 
value between 0 and 1.  These values express similarity in the 
dimensions with this word/name.  For example "cat": 0.99 is similar 
to "cat: 0.98", but not to "cat": 0.21.  All documents have the same 
set of words, and there are lots of them: about 1 million.  (If 
necessary, I can reduce the number of words to tens of thousands,  
but then the documents would not share the same set of words any 
more).  Most of the word values for a typical document are 0.00.

Example:
Documents in the index:
d1:
cat: 0.99
dog: 0.42
car: 0.00

d2:
cat: 0.02
dog: 0.00
car: 0.00

Incoming search query (with these numeric term-boosts):
q:
cat: 0.99
dog: 0.11
car: 0.00 (not specified in query)

The ideal result would be that q matches d1 much more than d2.


Here's my analysis of my situation and potential solutions:

- because I have so many words, I cannot use a separate field for 
each word, this would overload Solr/Lucene.  This is unfortunate, 
because I know there is index-time boosting on a per-field basis 
(reference: 
http://wiki.apache.org/solr/SolrRelevancyFAQ#head-d846ae0059c4e6b7f0d0bb2547ac336a8f18ac2f), 
and because I could have used Function Queries (reference: 
http://wiki.apache.org/solr/FunctionQuery).
- As a (stupid) workaround, I could convert my documents to into pure 
text: the numeric values would be translated from e.g. "cat": 0.99 to 
repeat the word "cat" 99 times.  This would be done for a particular 
document for all words and the text would be then used for regular 
scoring in Solr.  This approach seems doable, but inefficient and far 
from elegant.



Am I reinventing the wheel here or is what I'm trying to do something 
fundamentally different than what Solr and Lucene has to offer?


Any comments highly appreciated.  What can I do about this?


Thanks,

Andreas


--
Grant Ingersoll
http://www.lucidimagination.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ












Using Lucene index in Solr

2006-06-21 Thread Tricia Williams

Hi,

  I was wondering if there are any major differences in building an index 
using Lucene and Solr.  If there is no substantial differences, how would one 
go about using an existing index created using Lucene in Solr?


Thanks,
Tricia



Re: Using Lucene index in Solr

2006-06-21 Thread Tricia Williams
So I've modified schema.xml to account for my lucene index.  I've created 
a field type for my custom analyzer "text_lu", created fields for those in 
my index, and changed the defaultSearchField.  The index I want to use is 
in the data/index folder.


Now I want to use the admin page to query my old index.  I fill in the 
Query text box and press the search button.  I recieve the following 
message:

XML Parsing Error: syntax error
Location: 
http://localhost:8080/solr/select/?stylesheet=&q=alberta&version=2.1&start=0&rows=10&indent=on

Line Number 1, Column 1:java.lang.NullPointerException

When I try to PING:
HTTP Status 500 - java.lang.NullPointerException at 
org.apache.solr.search.SolrQueryParser.(SolrQueryParser.java:38) at 
org.apache.solr.search.QueryParsing.parseQuery(QueryParsing.java:47) at 
org.apache.solr.request.StandardRequestHandler.handleRequest(StandardRequestHandler.java:90) 
at org.apache.solr.core.SolrCore.execute(SolrCore.java:592) at 
org.apache.jsp.admin.ping_jsp._jspService(ping_jsp.java:70) at

etc

Does anyone have an intuitive notion as to if these exceptions are 
generated because of the custom analyzer that I am using or because of the 
changes I have made to schema.xml?  What is the best way to debug my 
instance of Solr?


Any help is much appreciated,
Tricia

On Wed, 21 Jun 2006, Yonik Seeley wrote:


On 6/21/06, Tricia Williams <[EMAIL PROTECTED]> wrote:

   I was wondering if there are any major differences in building an index
using Lucene and Solr.  If there is no substantial differences, how would 
one

go about using an existing index created using Lucene in Solr?


You can definitely do that for the majority of indicies w/o writing
any code... you just need to make sure the schema matches what is in
the index (make the analyzers for the field types compatible, etc).

If you have access to the source code that built the index, start
there.  If you don't then open up the index with Luke to see what you
can find out.

-Yonik



Using Lucene index in Solr

2006-06-21 Thread Tricia Williams

Hi,

   I was wondering if there are any major differences in building an index 
using Lucene and Solr.  If there is no substantial differences, how would one 
go about using an existing index created using Lucene in Solr?


Thanks,
Tricia


Secure Solr

2006-06-21 Thread Tricia Williams

Hi All,

   It seems to me that the way that documents are indexed and managed via 
Solr using http get requests leaves your index open to malicious attacks 
as anyone with the right syntax and some information about your index 
could commit changes to your index.  Is there some mechanism in solr that 
prevents this kind of attack?


Thanks,
Tricia



Using alternate index directory

2006-07-11 Thread Tricia Williams

Hi All,

   I would like to use two lucene indexes built outside of solr in a solr 
project.  The indexes do not live in the same path as the solr project, 
nor do the resulting files live in a folder called 'index'.  Does solr 
allow for this?  If not where should I look to adapt to my desired 
behavior?


Thanks,
Tricia


Cyrillic characters

2006-07-18 Thread Tricia Williams

Hi all,

   I'm trying to adapt our old cocoon/lucene based web search application 
to one that is more solrish.  Our old web app was capable of searching for 
queries with cyrillic characters in them.  I'm finding that using the 
packaged example admin interface entering a query with a string of 
cyrillic characters causes a java.lang.ArrayIndexOutOfBoundsException. 
I've also noted that the url built from the search form is not utf-8 
encoded.  So obviously if I try to manipulate the query string by 
inserting a utf-8 encoded string in the q= parameter the values are 
interpreted incorrectly and as such I cannot use this approach as a 
work-around.  My sample query is: .. (the english word _canada_ 
translated into russian) or 
%D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or 
%26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B 
(solr url encoding)


   I would appreciate any advice or suggestions that would allow me 
to search for cyrillics in solr.  If anyone knows why solr is behaving as 
it does with the strange encoding, a brief explanation of what causes this 
behaviour could be helpful and what the encoding is (unicode?).  If anyone 
else has force solr to accept utf-8 encoded q= parameters with success I 
would love to know how you did it.


Thanks in advance!
Tricia

ps.  I am using mozilla firefox as my main browser which leads to the 
behaviour I reported above.  IE 6.0 works fine for cyrillics although 
there is still a strange but different encoding (%CA%E0%ED%E0%E4%E0 for 
the same query as before).


Re: Cyrillic characters

2006-07-19 Thread Tricia Williams

Hi Yonik,

   I was incorrect to describe it as _solr encoding_.  Hoss suggested that 
it might be a form error - I haven't checked this yet but it sound 
plausible.  What I called the _solr url encoding_ was the q= parameter 
translated into  encoding in the url.  As I mention in 
my ps this translated value is not the same as when I use IE to post the 
same form values.


   You mentioned in another earlier post that q=h%c3%e9 would find 
matching hits.  My experience shows that while the UTF-8 encoded query 
doesn't generate any exceptions, no results are matched.  However 
q=h%e9llo would find matching results (the result set I'd match in Luke). 
So assuming that I can fix the form encoding errors so that the characters 
are encoded as UTF-8, I believe that I would continue to return incorrect 
results.  Will cyrillic characters be treated any differently than the 
diacritic in your example?


   I have solr running in tomcat 5.5.17.

Thanks for all you help,
Tricia


On Tue, 18 Jul 2006, Yonik Seeley wrote:


On 7/18/06, Tricia Williams <[EMAIL PROTECTED]> wrote:

 My sample query is: .. (the english word _canada_
translated into russian) or
%D0%9A%D0%B0%D0%BD%D0%B0%D0%B4%D0%B0 (utf-8) or
%26%231050%3B%26%231072%3B%26%231085%3B%26%231072%3B%26%231076%3B%26%231072%3B
(solr url encoding)


Hi Tricia,
Could you clarify what you mean by "solr url encoding"?  Where do you see 
this?

The servlet container decodes URLs, and I'm not sure where in Solr
that URLs are encoded.

-Yonik



add/update index

2006-07-27 Thread Tricia Williams

Hi,

   I have created a process which uses xsl to convert my data to the form 
indicated in the examples so that it can be added to the index as the solr 
tutorial indicates:


  
value
...
  


   In some cases the xsl process will create a field element with no data. 
(ie )  Is this considered bad input and will not be 
accepted?  Or is this something that solr should deal with?  Currently for 
each field element with no data I receive the message:

java.lang.NullPointerException
 at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:78)
 at org.apache.solr.update.DocumentBuilder.addField(DocumentBuilder.java:74)
 at org.apache.solr.core.SolrCore.readDoc(SolrCore.java:917)
 at org.apache.solr.core.SolrCore.update(SolrCore.java:685)
 at org.apache.solr.servlet.SolrUpdateServlet.doPost(SolrUpdateServlet.java:52)
 ...


   Just curious if the gurus out there think I should deal with the null 
values in my xsl process or if this can be dealt with in solr itself?


Thanks,
Tricia

ps.  Thanks for the timely fix for the UTF-8 issue!


Re: add/update index

2006-07-27 Thread Tricia Williams

Thanks Yonik,

   That's exactly what I needed to know.  I'll adapt my xsl process to 
omit null values.


Tricia

On Thu, 27 Jul 2006, Yonik Seeley wrote:


On 7/27/06, Tricia Williams <[EMAIL PROTECTED]> wrote:

Hi,

I have created a process which uses xsl to convert my data to the form
indicated in the examples so that it can be added to the index as the solr
tutorial indicates:

   
 value
 ...
   


In some cases the xsl process will create a field element with no data.
(ie )  Is this considered bad input and will not be
accepted?


If the desired semantics are "the field doesn't exist" or "null value"
then yes.  There isn't a way to represent a field without a value in
Lucene except to not add the field for that document.  If it's totally
ignored, it probably shouldn't be in the XML.

Now, one might think we could drop fields with no value, but that's
problematic because it goes against the XML standard:

http://www.w3.org/TR/REC-xml/#sec-starttags
[Definition: An element with no content is said to be empty.] The
representation of an empty element is either a start-tag immediately
followed by an end-tag, or an empty-element tag. [Definition: An
empty-element tag takes a special form:]

So  and  are supposed to be equivalent.  Given that, it
does look like Solr should treat  like a
zero-length string (but that's not what you wanted, right?)

-Yonik



Re: Indexing UTF-8

2006-08-10 Thread Tricia Williams
I no longer remember when or where this came up, but when using Tomcat 
there is a known character encoding problem when you expect utf-8.  In 
Tomcat's $TOMCAT_HOME/conf/server.xml on the port you're running Solr on 
ensure URIEncoding="UTF-8" is in



This has solved some of my encoding problems.

Hope this helps,
Tricia

On Thu, 10 Aug 2006, Andrew May wrote:


Hi,

I'm trying to index some UTF-8 data, but I'm experiencing some problems.

I'm using the 28th July nightly build, which I believe contains all the 
recent fixes for making the administration webapp use UTF-8. I've tried 
running in both the provided Jetty instance and Tomcat 5.5.17.


I've indexed both using the post.sh script (i.e. curl) and HttpClient both 
with the same results.


I'm specifically concentrating on one author name that has been causing 
problems:

Ayy??ld??z, Turhan
(I'm encoding this email as UTF-8 in the hope that comes through OK)

What I'm seeing coming back from Solr is:
Ayyldz, Turhan
The undotted lowercase i Turkish character (U+0131) is instead appearing as a 
latin capital A with diaeresis (U+00C4) and a plus-minus character (U+00B1).


Using Luke to look at the index directly the field appears as:
Ayy??±ld??±z, Turhan
Which assuming Luke is displaying this correctly (± is ??) means 
something happened in the posting of the data or the indexing.


I'm completely out of my depth when it comes to character encodings, so I 
don't know whether I'm doing something stupid, mis-configuring something, or 
whether this is a genuine problem not of my own making.


Any thoughts?

Thanks,

Andrew


Re: possible FAQ - lucene interop

2007-01-17 Thread Tricia Williams

Hi Michael,

   What Solr is really doing is building a Lucene index.  In most cases a 
Java developer should be able to access the index that Solr built through 
the IndexReader/IndexSearcher classes and the location of the index that 
Solr built.  See the Lucene API for details on these and other classes. 
The default index location is in solr/data/index relative to where you 
start the servlet which is running Solr.


Hope you find that helpful,
Tricia


On Wed, 17 Jan 2007, Michael Kimsal wrote:


Hello all:

We've got one java-based project at work using lucene.  I'm looking to use
solr as a search system for some other projects at work.  Once data is
indexed in solr, can we get at it using standard lucene libraries?  I know
how I want to use solr, but if the java devs need to get at the data as
well, I'd rather that 1) they be able to use their existing tech and skills
and 2) I not have to reindex everything in lucene-only indexes.

I've read the FAQs and some of the mailing list and couldn't find this
question addressed.

Thanks.

--
Michael Kimsal
http://webdevradio.com