PatternTokenizer failure

2011-11-28 Thread Jay Luker
Hi all,

I'm trying to use PatternTokenizer and not getting expected results.
Not sure where the failure lies. What I'm trying to do is split my
input on whitespace except in cases where the whitespace is preceded
by a hyphen character. So to do this I'm using a negative look behind
assertion in the pattern, e.g. "(? ["foo","bar"] - OK
"foo \n bar" -> ["foo","bar"] - OK
"foo- bar" -> ["foo- bar"] - OK
"foo-\nbar" -> ["foo-\nbar"] - OK
"foo- \n bar" -> ["foo- \n bar"] - FAILS

Here's a test case that demonstrates the failure:

public void testPattern() throws Exception {
Map args = new HashMap();
args.put( PatternTokenizerFactory.GROUP, "-1" );
args.put( PatternTokenizerFactory.PATTERN, "(? but was:"

Am I doing something wrong? Incorrect expectations? Or could this be a bug?

Thanks,
--jay


Re: InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting

2011-11-30 Thread Jay Luker
I am having a similar issue with OffsetExceptions during highlighting.
In all of the explanations and bug reports I'm reading there is a
mention this is all the result of a problem with HTMLStripCharFilter.
But my analysis chains don't (that I'm aware of) make use of
HTMLStripCharFilter, so can someone explain what else might be going
on? Or is it acknowledged that the bug may exist elsewhere?

Thanks,
--jay

On Fri, Nov 11, 2011 at 4:37 AM, Vadim Kisselmann
 wrote:
> Hi Edwin, Chris
>
> it´s an old bug. I have big problems too with OffsetExceptions when i use
> Highlighting, or Carrot.
> It looks like a problem with HTMLStripCharFilter.
> Patch doesn´t work.
>
> https://issues.apache.org/jira/browse/LUCENE-2208
>
> Regards
> Vadim
>
>
>
> 2011/11/11 Edwin Steiner 
>
>> I just entered a bug: https://issues.apache.org/jira/browse/SOLR-2891
>>
>> Thanks & regards, Edwin
>>
>> On Nov 7, 2011, at 8:47 PM, Chris Hostetter wrote:
>>
>> >
>> > : finally I want to use Solr highlighting. But there seems to be a
>> problem
>> > : if I combine the char filter and the compound word filter in
>> combination
>> > : with highlighting (an
>> > : org.apache.lucene.search.highlight.InvalidTokenOffsetsException is
>> > : raised).
>> >
>> > Definitely sounds like a bug somwhere in dealing with the offsets.
>> >
>> > can you please file a Jira, and include all of the data you have provided
>> > here?  it would also be helpful to know what the analysis tool says about
>> > the various attributes of your tokens at each stage of the analysis?
>> >
>> > : SEVERE: org.apache.solr.common.SolrException:
>> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall
>> exceeds length of provided text sized 12
>> > :     at
>> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469)
>> > :     at
>> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378)
>> > :     at
>> org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116)
>> > :     at
>> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194)
>> > :     at
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>> > :     at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360)
>> > :     at
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
>> > :     at
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>> > :     at
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>> > :     at
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>> > :     at
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>> > :     at
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
>> > :     at
>> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462)
>> > :     at
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164)
>> > :     at
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100)
>> > :     at
>> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851)
>> > :     at
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>> > :     at
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405)
>> > :     at
>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278)
>> > :     at
>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515)
>> > :     at
>> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302)
>> > :     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>> > :     at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>> > :     at java.lang.Thread.run(Thread.java:680)
>> > : Caused by:
>> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall
>> exceeds length of provided text sized 12
>> > :     at
>> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228)
>> > :     at
>> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462)
>> > :     ... 23 more
>> >
>> >
>> > -Hoss
>>
>>
>


RegexQuery performance

2011-12-08 Thread Jay Luker
Hi,

I am trying to provide a means to search our corpus of nearly 2
million fulltext astronomy and physics articles using regular
expressions. A small percentage of our users need to be able to
locate, for example, certain types of identifiers that are present
within the fulltext (grant numbers, dataset identifers, etc).

My straightforward attempts to do this using RegexQuery have been
successful only in the sense that I get the results I'm looking for.
The performance, however, is pretty terrible, with most queries taking
five minutes or longer. Is this the performance I should expect
considering the size of my index and the massive number of terms? Are
there any alternative approaches I could try?

Things I've already tried:
  * reducing the sheer number of terms by adding a LengthFilter,
min=6, to my index analysis chain
  * swapping in the JakartaRegexpCapabilities

Things I intend to try if no one has any better suggestions:
  * chunk up the index and search concurrently, either by sharding or
using a RangeQuery based on document id

Any suggestions appreciated.

Thanks,
--jay


Re: RegexQuery performance

2011-12-10 Thread Jay Luker
Hi Erick,

On Fri, Dec 9, 2011 at 12:37 PM, Erick Erickson  wrote:
> Could you show us some examples of the kinds of things
> you're using regex for? I.e. the raw text and the regex you
> use to match the example?

Sure!

An example identifier would be "IRAS-A-FPA-3-RDR-IMPS-V6.0", which
identifies a particular Planetary Data System data set. Another
example is "ULY-J-GWE-8-NULL-RESULTS-V1.0". These kind of strings
frequently appear in the references section of the articles, so the
context looks something like,

" ... rvey. IRAS-A-FPA-3-RDR-IMPS-V6.0, NASA Planetary Data System
Tholen, D. J. 1989, in Asteroids II, ed ... "

The simple & straightforward regex I've been using is
/[A-Z0-9:\-]+V\d+\.\d+/. There may be a smarter regex approach but I
haven't put my mind to it because I assumed the primary performance
issue was elsewhere.

> The reason I ask is that perhaps there are other approaches,
> especially thinking about some clever analyzing at index time.
>
> For instance, perhaps NGrams are an option. Perhaps
> just making WordDelimiterFilterFactory do its tricks. Perhaps.

WordDelimiter does help in the sense that if you search for a specific
identifier you will usually find fairly accurate results, even for
cases where the hyphens resulted in the term being broken up. But I'm
not sure how WordDelimiter can help if I want to search for a pattern.

I tried a few tweaks to the index, like putting a minimum character
count for terms, making sure WordDelimeter's preserveOriginal is
turned on, indexing without lowercasing so that I don't have to use
Pattern.CASE_INSENSITIVE. Performance was not improved significantly.

The new RegexpQuery mentioned by R. Muir looks promising, but I
haven't built an instance of trunk yet to try it out. Any ohter
suggestions appreciated.

Thanks!
--jay


> In other words, this could be an "XY problem"
>
> Best
> Erick
>
> On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir  wrote:
>> On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker  wrote:
>>> Hi,
>>>
>>> I am trying to provide a means to search our corpus of nearly 2
>>> million fulltext astronomy and physics articles using regular
>>> expressions. A small percentage of our users need to be able to
>>> locate, for example, certain types of identifiers that are present
>>> within the fulltext (grant numbers, dataset identifers, etc).
>>>
>>> My straightforward attempts to do this using RegexQuery have been
>>> successful only in the sense that I get the results I'm looking for.
>>> The performance, however, is pretty terrible, with most queries taking
>>> five minutes or longer. Is this the performance I should expect
>>> considering the size of my index and the massive number of terms? Are
>>> there any alternative approaches I could try?
>>>
>>> Things I've already tried:
>>>  * reducing the sheer number of terms by adding a LengthFilter,
>>> min=6, to my index analysis chain
>>>  * swapping in the JakartaRegexpCapabilities
>>>
>>> Things I intend to try if no one has any better suggestions:
>>>  * chunk up the index and search concurrently, either by sharding or
>>> using a RangeQuery based on document id
>>>
>>> Any suggestions appreciated.
>>>
>>
>> This RegexQuery is not really scalable in my opinion, its always
>> linear to the number of terms except in super-rare circumstances where
>> it can compute a "common prefix" (and slow to boot).
>>
>> You can try svn trunk's RegexpQuery <-- don't forget the "p", instead
>> from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
>> etc)
>>
>> The performance is faster, but keep in mind its only as good as the
>> regular expressions, if the regular expressions are like /.*foo.*/,
>> then
>> its just as slow as wildcard of *foo*.
>>
>> --
>> lucidimagination.com


Re: RegexQuery performance

2011-12-12 Thread Jay Luker
On Sat, Dec 10, 2011 at 9:25 PM, Erick Erickson  wrote:
> My off-the-top-of-my-head notion is you implement a
> Filter whose job is to emit some "special" tokens when
> you find strings like this that allow you to search without
> regexes. For instance, in the example you give, you could
> index something like...oh... I don't know, ###VER### as
> well as the "normal" text of "IRAS-A-FPA-3-RDR-IMPS-V6.0".
> Now, when searching for docs with the pattern you used
> as an example, you look for ###VER### instead. I guess
> it all depends on how many regexes you need to allow.
> This wouldn't work at all if you allow users to put in arbitrary
> regexes, but if you have a small enough number of patterns
> you'll allow, something like this could work.

This is a great suggestion. I think the number of users that need this
feature, as well as the variety of regexs that would be used, is small
enough that it could definitely work. I turns it into a problem of
collecting the necessary regexes, plus the UI details.

Thanks!
--jay


NumericRangeQuery: what am I doing wrong?

2011-12-14 Thread Jay Luker
I can't get NumericRangeQuery or TermQuery to work on my integer "id"
field. I feel like I must be missing something obvious.

I have a test index that has only two documents, id:9076628 and
id:8003001. The id field is defined like so:



A MatchAllDocsQuery will return the 2 documents, but any queries I try
on the id field return no results. For instance,

public void testIdRange() throws IOException {
Query q = NumericRangeQuery.newIntRange("id", 1, 1000, true, true);
System.out.println("query: " + q);
assertEquals(2, searcher.search(q, 5).totalHits);
}

public void testIdSearch() throws IOException {
Query q = new TermQuery(new Term("id", "9076628"));
System.out.println("query: " + q);
assertEquals(1, searcher.search(q, 5).totalHits);
}

Both tests fail with totalHits being 0. This is using solr/lucene
trunk, but I tried also with 3.2 and got the same results.

What could I be doing wrong here?

Thanks,
--jay


Re: NumericRangeQuery: what am I doing wrong?

2011-12-14 Thread Jay Luker
On Wed, Dec 14, 2011 at 2:04 PM, Erick Erickson  wrote:
> Hmmm, seems like it should work, but there are two things you might try:
> 1> just execute the query in Solr. id:1 TO 100]. Does that work?

Yep, that works fine.

> 2> I'm really grasping at straws here, but it's *possible* that you
>     need to use the same precisionstep as tint (8?)? There's a
>     constructor that takes precisionStep as a parameter, but the
>     default is 4 in the 3.x code.

Ah-ha, that was it. I did not notice the alternate constructor. The
field was originally indexed with solr's default "int" type, which has
precisionStep="0" (i.e., don't index at different precision levels).
The equivalent value for the NumericRangeQuery constructor is 32. This
isn't exactly inuitive, but I was able to figure it out with a careful
reading of the javadoc.

Thanks!
--jay


Re: NumericRangeQuery: what am I doing wrong?

2011-12-15 Thread Jay Luker
On Wed, Dec 14, 2011 at 5:02 PM, Chris Hostetter
 wrote:
>
> I'm a little lost in this thread ... if you are programaticly construction
> a NumericRangeQuery object to execute in the JVM against a Solr index,
> that suggests you are writting some sort of SOlr plugin (or uembedding
> solr in some way)

It's not you; it's me. I'm just doing weird things, partly, I'm sure,
due to ignorance, but sometimes out of expediency. I was experimenting
with ways to do a NumericRangeFilter, and the tests I was trying used
the Lucene api to query a Solr index, therefore I didn't have access
to the IndexSchema. Also my question might have been better directed
at the lucene-general list to avoid confusion.

Thanks,
--jay


Re: Autocommit not happening

2010-07-23 Thread Jay Luker
For the sake of any future googlers I'll report my own clueless but
thankfully brief struggle with autocommit.

There are two parts to the story: Part One is where I realize my
 config was not contained within my . In
Part Two I realized I had typed "" rather than
"".

--jay

On Fri, Jul 23, 2010 at 2:35 PM, John DeRosa  wrote:
> On Jul 23, 2010, at 9:37 AM, John DeRosa wrote:
>
>> Hi! I'm a Solr newbie, and I don't understand why autocommits aren't 
>> happening in my Solr installation.
>>
>
> [snip]
>
> "Never mind"... I have discovered my boneheaded mistake. It's so silly, I 
> wish I could retract my question from the archives.
>
>


documentCache clarification

2010-10-27 Thread Jay Luker
Hi all,

The solr wiki says this about the documentCache: "The more fields you
store in your documents, the higher the memory usage of this cache
will be."

OK, but if i have enableLazyFieldLoading set to true and in my request
parameters specify "fl=id", then the number of fields per document
shouldn't affect the memory usage of the document cache, right?

Thanks,
--jay


Re: documentCache clarification

2010-10-27 Thread Jay Luker
(btw, I'm running 1.4.1)

It looks like my assumption was wrong. Regardless of the fields
selected using the "fl" parameter and the enableLazyFieldLoading
setting, solr apparently fetches from disk and caches all the fields
in the document (or maybe just those that are stored="true" in my
schema.) My evidence for this is the documentCache stats reported by
solr/admin. If I request "rows=10&fl=id" followed by
"rows=10&fl=id,title" I would expect to see the 2nd request result in
a 2nd insert to the cache, but instead I see that the 2nd request hits
the cache from the 1st request. "rows=10&fl=*" does the same thing.
i.e., the first request, even though I have
enableLazyFieldLoading=true and I'm only asking for the ids, fetches
the entire document from disk and inserts into the documentCache.
Subsequent requests, regardless of which fields I actually select,
don't hit the disk but are loaded from the documentCache. Is this
really the expected behavior and/or am I misunderstanding something?

A 2nd question: while watching these stats I noticed something else
weird with the queryResultCache. It seems that inserts to the
queryResultCache depend on the number of rows requested. For example,
an initial request (solr restarted, clean cache, etc) with rows=10
will result in a insert. A 2nd request of the same query with
rows=1000 will result in a cache hit. However if you reverse that
order, starting with a clean cache, an initial request for rows=1000
will *not* result in an insert to queryResultCache. I have tried
various increments--10, 100, 200, 500--and it seems the magic number
is somewhere between 200 (cache insert) and 500 (no insert). Can
someone explain this?

Thanks,
--jay

On Wed, Oct 27, 2010 at 10:54 AM, Markus Jelsma
 wrote:
> I've been wondering about this too some time ago. I've found more 
> informationenableLazyFieldLoading
> in SOLR-52 and some correspondence on this one but it didn't give me a
> definitive answer..
>
> [1]: https://issues.apache.org/jira/browse/SOLR-52
> [2]: http://www.mail-archive.com/solr-...@lucene.apache.org/msg01185.html
>
> On Wednesday 27 October 2010 16:39:44 Jay Luker wrote:
>> Hi all,
>>
>> The solr wiki says this about the documentCache: "The more fields you
>> store in your documents, the higher the memory usage of this cache
>> will be."
>>
>> OK, but if i have enableLazyFieldLoading set to true and in my request
>> parameters specify "fl=id", then the number of fields per document
>> shouldn't affect the memory usage of the document cache, right?
>>
>> Thanks,
>> --jay
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>


Re: documentCache clarification

2010-10-28 Thread Jay Luker
On Wed, Oct 27, 2010 at 9:13 PM, Chris Hostetter
 wrote:
>
> : schema.) My evidence for this is the documentCache stats reported by
> : solr/admin. If I request "rows=10&fl=id" followed by
> : "rows=10&fl=id,title" I would expect to see the 2nd request result in
> : a 2nd insert to the cache, but instead I see that the 2nd request hits
> : the cache from the 1st request. "rows=10&fl=*" does the same thing.
>
> your evidence is correct, but your interpretation is incorrect.
>
> the objects in the documentCache are lucene Documents, which contain a
> List of Field refrences.  when enableLazyFieldLoading=true is set, and
> there is a documentCache Document fetched from the IndexReader only
> contains the Fields specified in the fl, and all other Fields are marked
> as "LOAD_LAZY".
>
> When there is a cache hit on that uniqueKey at a later date, the Fields
> allready loaded are used directly if requested, but the Fields marked
> LOAD_LAZY are (you guessed it) lazy loaded from the IndexReader and then
> the Document updates the refrence to the newly actualized fields (which
> are no longer marked LOAD_LAZY)
>
> So with different "fl" params, the same Document Object is continually
> used, but the Fields in that Document grow as the fields requested (using
> the "fl" param) change.

Great stuff. Makes sense. Thanks for the clarification, and if no one
objects I'll update the wiki with some of this info.

I'm still not clear on this statement from the wiki's description of
the documentCache: "(Note: This cache cannot be used as a source for
autowarming because document IDs will change when anything in the
index changes so they can't be used by a new searcher.)"

Can anyone elaborate a bit on that. I think I've read it at least 10
times and I'm still unable to draw a mental picture. I'm wondering if
the document IDs referred to are the ones I'm defining in my schema,
or are they the underlying lucene ids, i.e. the ones that, according
to the Lucene in Action book, are "relative within each segment"?


> : will *not* result in an insert to queryResultCache. I have tried
> : various increments--10, 100, 200, 500--and it seems the magic number
> : is somewhere between 200 (cache insert) and 500 (no insert). Can
> : someone explain this?
>
> In addition to the  config option already
> mentioned (which controls wether a DocList is cached based on it's size)
> there is also the  config option which may confuse
> your cache observations.  if the window size is "50" and you ask for
> start=0&rows=10 what actually gets cached is "0-50" (assuming there are
> more then 50 results) so a subsequent request for start=10&rows=10 will be
> a cache hit.

Just so I'm clear, does the queryResultCache operate in a similar
manner as the documentCache as to what is actually cached? In other
words, is it the caching of the docList object that is reported in the
cache statistics hits/inserts numbers? And that object would get
updated with a new set of ordered doc ids on subsequent, larger
requests. (I'm flailing a bit to articulate the question, I know). For
example, if my queryResultMaxDocsCached is set to 200 and I issue a
request with rows=500, then I won't get a docList object entry in the
queryResultCache. However, if I issue a request with rows=10, I will
get an insert, and then a later request for rows=500 would re-use and
update that original cached docList object. Right? And would it be
updated with the full list of 500 ordered doc ids or only 200?

Thanks,
--jay


Re: documentCache clarification

2010-10-29 Thread Jay Luker
On Thu, Oct 28, 2010 at 7:27 PM, Chris Hostetter
 wrote:

> The queryResultCache is keyed on  and the
> value is a "DocList" object ...
>
> http://lucene.apache.org/solr/api/org/apache/solr/search/DocList.html
>
> Unlike the Document objects in the documentCache, the DocLists in the
> queryResultCache never get modified (techincally Solr doesn't actually
> modify the Documents either, the Document just keeps track of it's fields
> and updates itself as Lazy Load fields are needed)
>
> if a DocList containing results 0-10 is put in the cache, it's not
> going to be of any use for a query with start=50.  but if it contains 0-50
> it *can* be used if start < 50 and rows < 50 -- that's where the
> queryResultWindowSize comes in.  if you use start=0&rows=10, but your
> window size is 50, SolrIndexSearcher will (under the covers) use
> start=0&rows=50 and put that in the cache, returning a "slice" from 0-10
> for your query.  the next query asking for 10-20 will be a cache hit.

This makes sense but still doesn't explain what I'm seeing in my cache
stats. When I issue a request with rows=10 the stats show an insert
into the queryResultCache. If I send the same query, this time with
rows=1000, I would not expect to see a cache hit but I do. So it seems
like there must be something useful in whatever gets cached on the
first request for rows=10 for it to be re-used by the request for
rows=1000.

--jay


Using jetty's GzipFilter in the example solr.war

2010-11-13 Thread Jay Luker
Hi,

I thought I'd try turning on gzip compression but I can't seem to get
jetty's GzipFilter to actually compress my responses. I unpacked the
example solr.war and tried adding variations of the following to the
web.xml (and then rejar-ed), but as far as I can tell, jetty isn't
actually compressing anything.


 GZipFilter
 Jetty's GZip Filter
 Filter that zips all the content on-the-fly
 org.mortbay.servlet.GzipFilter
 
  mimeTypes
  *
 



 GZipFilter
 *


I've also tried explicitly listing mime-types and assigning the
filter-mapping using . I can see that the GzipFilter is
being loaded when I add -DDEBUG to the jetty startup command. But as
far as I can tell from looking at the response headers nothing is
being gzipped. I'm expecting to see "Content-Encoding: gzip" in the
response headers.

Anyone successfully gotten this to work?

Thanks,
--jay


Re: Using jetty's GzipFilter in the example solr.war

2010-11-15 Thread Jay Luker
On Sun, Nov 14, 2010 at 12:49 AM, Kiwi de coder  wrote:
> try to put u filter on top of web.xml (instead of middle or bottom), i try
> this few day and it just only a simple solution (not sure is a spec to put
> on top or is a bug)

Thank you.

An explanation of why this worked is probably better explored on the
jetty list, but, for the record, it did.

--jay


Sending binary data as part of a query

2011-01-28 Thread Jay Luker
Hi all,

Here is what I am interested in doing: I would like to send a
compressed integer bitset as a query to solr. The bitset integers
represent my document ids and the results I want to get back is the
facet data for those documents.

I have successfully created a QueryComponent class that, assuming it
has the integer bitset, can turn that into the necessary DocSetFilter
to pass to the searcher, get back the facets, etc. That part all works
right now because I'm using either canned or randomly generated
bitsets on the server side.

What I'm unsure how to do is actually send this compressed bitset from
a client to solr as part of the query. From what I can tell, the Solr
API classes that are involved in handling binary data as part of a
request assume that the data is a document to be added. For instance,
extending ContentStreamHandlerBase requires implementing some kind of
document loader and an UpdateRequestProcessorChain and a bunch of
other stuff that I don't really think I should need. Is there a
simpler way? Anyone tried or succeeded in doing anything similar to
this?

Thanks,
--jay


Re: Sending binary data as part of a query

2011-02-01 Thread Jay Luker
On Mon, Jan 31, 2011 at 9:22 PM, Chris Hostetter
 wrote:

> that class should probably have been named ContentStreamUpdateHandlerBase
> or something like that -- it tries to encapsulate the logic that most
> RequestHandlers using COntentStreams (for updating) need to worry about.
>
> Your QueryComponent (as used by SearchHandler) should be able to access
> the ContentStreams the same way that class does ... call
> req.getContentStreams().
>
> Sending a binary stream from a remote client depends on how the client is
> implemented -- you can do it via HTTP using the POST body (with or w/o
> multi-part mime) in any langauge you want. If you are using SolrJ you may
> again run into an assumption that using ContentStreams means you are doing
> an "Update" but that's just a vernacular thing ... something like a
> ContentStreamUpdateRequest should work just as well for a query (as long
> as you set the neccessary params and/or request handler path)

Thanks for the help. I was just about to reply to my own question for
the benefit of future googlers when I noticed your response. :)

I actually got this working, much the way you suggest. The client is
python. I created a gist with the script I used for testing [1].

On the solr side my QueryComponent grabs the stream, uses
jzlib.ZInputStream to do the deflating, then translates the incoming
integers in the bitset (my solr schema.xml integer ids) to the lucene
ids and creates a docSetFilter with them.

Very relieved to get this working as it's the basis of a talk I'm
giving next week [2]. :-)

--jay

[1] https://gist.github.com/806397
[2] http://code4lib.org/conference/2011/luker


Help with parsing configuration using SolrParams/NamedList

2011-02-16 Thread Jay Luker
Hi,

I'm trying to use a CustomSimilarityFactory and pass in per-field
options from the schema.xml, like so:

 

500
1
0.5


500
2
0.5

 

My problem is I am utterly failing to figure out how to parse this
nested option structure within my CustomSimilarityFactory class. I
know that the settings are available as a SolrParams object within the
getSimilarity() method. I'm convinced I need to convert to a NamedList
using params.toNamedList(), but my java fu is too feeble to code the
dang thing. The closest I seem to get is the top level as a NamedList
where the keys are "field_a" and "field_b", but then my values are
strings, e.g., "{min=500,max=1,steepness=0.5}".

Anyone who could dash off a quick example of how to do this?

Thanks,
--jay


Highlight snippets for a set of known documents

2011-03-31 Thread Jay Luker
Hi all,

I'm trying to get highlight snippets for a set of known documents and
I must being doing something wrong because it's only sort of working.

Say my query is "foobar" and I already know that docs 1, 5 and 11 are
matches. Now I want to retrieve the highlight snippets for the term
"foobar" for docs 1, 5 and 11. What I assumed would work was something
like: "...&q=foobar&fq={!q.op=OR}id:1 id:5 id:11...". This returns
numfound=3 in the response, but I only get the highlight snippets for
document id:1. What am I doing wrong?

Thanks,
--jay


Re: Highlight snippets for a set of known documents

2011-04-01 Thread Jay Luker
It turns out the answer is I'm a moron; I had an unnoticed "&rows=1"
nestled in the querystring I was testing with.

Anyway, thanks for replying!

--jay

On Fri, Apr 1, 2011 at 4:25 AM, Stefan Matheis
 wrote:
> Jay,
>
> i'm not sure, but did you try it w/ brackets?
> q=foobar&fq={!q.op=OR}(id:1 id:5 id:11)
>
> Regards
> Stefan
>
> On Thu, Mar 31, 2011 at 6:40 PM, Jay Luker  wrote:
>> Hi all,
>>
>> I'm trying to get highlight snippets for a set of known documents and
>> I must being doing something wrong because it's only sort of working.
>>
>> Say my query is "foobar" and I already know that docs 1, 5 and 11 are
>> matches. Now I want to retrieve the highlight snippets for the term
>> "foobar" for docs 1, 5 and 11. What I assumed would work was something
>> like: "...&q=foobar&fq={!q.op=OR}id:1 id:5 id:11...". This returns
>> numfound=3 in the response, but I only get the highlight snippets for
>> document id:1. What am I doing wrong?
>>
>> Thanks,
>> --jay
>>
>


UIMA example setup w/o OpenCalais

2011-04-07 Thread Jay Luker
Hi,

I'd would like to experiment with the UIMA contrib package, but I have
issues with the OpenCalais service's ToS and would rather not use it.
Is there a way to adapt the UIMA example setup to use only the
AlchemyAPI service? I tried simply leaving out the OpenCalais api key
but i get exceptions thrown during indexing.

Thanks,
--jay


Re: UIMA example setup w/o OpenCalais

2011-04-08 Thread Jay Luker
Thank you, that worked.

For the record, my objection to the OpenCalais service is that their
ToS states that they will "retain a copy of the metadata submitted by
you", and that by submitting data to the service you "grant Thomson
Reuters a non-exclusive perpetual, sublicensable, royalty-free license
to that metadata." The AlchemyAPI service Tos states only that they
retain the *generated* metadata.

Just a warning to anyone else thinking of experimenting with Solr & UIMA.

--jay

On Fri, Apr 8, 2011 at 6:45 AM, Tommaso Teofili
 wrote:
> Hi Jay,
> you should be able to do so by simply removing the OpenCalaisAnnotator from
> the execution pipeline commenting the line 124 of the file:
> solr/contrib/uima/src/main/resources/org/apache/uima/desc/OverridingParamsExtServicesAE.xml
> Hope this helps,
> Tommaso
>
> 2011/4/7 Jay Luker 
>
>> Hi,
>>
>> I'd would like to experiment with the UIMA contrib package, but I have
>> issues with the OpenCalais service's ToS and would rather not use it.
>> Is there a way to adapt the UIMA example setup to use only the
>> AlchemyAPI service? I tried simply leaving out the OpenCalais api key
>> but i get exceptions thrown during indexing.
>>
>> Thanks,
>> --jay
>>
>


tika/pdfbox knobs & levers

2011-04-13 Thread Jay Luker
Hi all,

I'm wondering if there are any knobs or levers i can set in
solrconfig.xml that affect how pdfbox text extraction is performed by
the extraction handler. I would like to take advantage of pdfbox's
ability to normalize diacritics and ligatures [1], but that doesn't
seem to be the default behavior. Is there a way to enable this?

Thanks,
--jay

[1] 
http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNormalize.html


Re: Text Only Extraction Using Solr and Tika

2011-05-05 Thread Jay Luker
Hi Emyr,

You could try using the "extractOnly=true" parameter [1]. Of course,
you'll need to repost the extracted text manually.

--jay

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


On Thu, May 5, 2011 at 9:36 AM, Emyr James  wrote:
> Hi All,
>
> I have solr and tika installed and am happily extracting and indexing
> various files.
> Unfortunately on some word documents it blows up since it tries to
> auto-generate a 'title' field but my title field in the schema is single
> valued.
>
> Here is my config for the extract handler...
>
>  class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
> 
> ignored_
> 
> 
>
> Is there a config option to make it only extract text, or ideally to allow
> me to specify which metadata fields to accept ?
>
> E.g. I'd like to use any author metadata it finds but to not use any title
> metadata it finds as I want title to be single valued and set explicitly
> using a literal.title in the post request.
>
> I did look around for some docs but all i can find are very basic examples.
> there's no comprehensive configuration documentation out there as far as I
> can tell.
>
>
> ALSO...
>
> I get some other bad responses coming back such as...
>
> Apache Tomcat/6.0.28 - Error report
> HTTP Status 500 - org.ap
> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>
> java.lang.NoSuchMethodError:
> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>    at
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>    at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>    at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>    at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>    at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>    at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>    at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>    at
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>    at
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>    at
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>    at
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>    at
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>    at
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>    at
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>    at
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>    at
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>    at
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>    at
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>    at java.lang.Thread.run(Thread.java:636)
> type Status
> reportmessage
> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>
> For the above my url was...
>
>  http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>
> I guess there's something special I need to be able to process power point
> files ? Maybe I need to get the latest apache POI ? Any suggestions
> welcome...
>
>
> Regards,
>
> Emyr
>


Re: Solr performance

2011-05-11 Thread Jay Luker
On Wed, May 11, 2011 at 7:07 AM, javaxmlsoapdev  wrote:
> I have some 25 odd fields with "stored=true" in schema.xml. Retrieving back
> 5,000 records back takes a few secs. I also tried passing "fl" and only
> include one field in the response but still response time is same. What are
> the things to look to tune the performance.


Confirm that you have enableLazyFieldLoading set to true in
solrconfig.xml. I suspect it is since that's the default.

Is the request taking a few seconds the first time, but returns
quickly on subsequent requests?

Also, may or may not be relevant, but you might find a few bits of
info in this thread enlightening:
http://lucene.472066.n3.nabble.com/documentCache-clarification-td1780504.html

--jay


Re: Document has fields with different update frequencies: how best to model

2011-06-10 Thread Jay Luker
Take a look at ExternalFileField [1]. It's meant for exactly what you
want to do here.

FYI, there is an issue with caching of the external values introduced
in v1.4 but, thankfully, resolved in v3.2 [2]

--jay

[1] 
http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
[2] https://issues.apache.org/jira/browse/SOLR-2536


On Fri, Jun 10, 2011 at 12:54 PM, lee carroll
 wrote:
> Hi,
> We have a document type which has fields which are pretty static. Say
> they change once every 6 month. But the same document has a field
> which changes hourly
> What are the best approaches to index this document ?
>
> Eg
> Hotel ID (static) , Hotel Description (static and costly to get from a
> url etc), FromPrice (changes hourly)
>
> Option 1
> Index hourly as a single document and don't worry about the unneeded
> field updates
>
> Option 2
> Split into 2 document types and index independently. This would
> require the front end application to query multiple times?
> doc1
> ID,Description,DocType
> doc2
> ID,HotelID,Price,DocType
>
> application performs searches based on hotel attributes
> for each hotel match issue query to get price
>
>
> Any other options ? Can you query across documents ?
>
> We run 1.4.1, we could maybe update to 3.2 but I don't think I could
> swing to trunk for JOIN feature (if that indeed is JOIN's use case)
>
> Thanks in advance
>
> PS Am I just worrying about de-normalised data and should sort the
> source data out maybe by caching and get over it ...?
>
> cheers Lee c
>


Re: Document has fields with different update frequencies: how best to model

2011-06-11 Thread Jay Luker
You are correct that ExternalFileField values can only be used in
query functions (i.e. scoring, basically). Sorry for firing off that
answer without reading your use case more carefully.

I'd be inclined towards giving your Option #1 a try, but that's
without knowing much about the scale of your app, size of your index,
documents, etc. Unneeded field updates are only a problem if they're
causing performance problems, right? Otherwise, trying to avoid seems
like premature optimization.

--jay

On Sat, Jun 11, 2011 at 5:26 AM, lee carroll
 wrote:
> Hi Jay
> I thought external file field could not be returned as a field but
> only used in scoring.
> trunk has pseudo field which can take a function value but we cant
> move to trunk.
>
> also its a more general question around schema design, what happens if
> you have several fields with different update frequencies. It does not
> seem external file field is the use case for this.
>
>
>
> On 10 June 2011 20:13, Jay Luker  wrote:
>> Take a look at ExternalFileField [1]. It's meant for exactly what you
>> want to do here.
>>
>> FYI, there is an issue with caching of the external values introduced
>> in v1.4 but, thankfully, resolved in v3.2 [2]
>>
>> --jay
>>
>> [1] 
>> http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html
>> [2] https://issues.apache.org/jira/browse/SOLR-2536
>>
>>
>> On Fri, Jun 10, 2011 at 12:54 PM, lee carroll
>>  wrote:
>>> Hi,
>>> We have a document type which has fields which are pretty static. Say
>>> they change once every 6 month. But the same document has a field
>>> which changes hourly
>>> What are the best approaches to index this document ?
>>>
>>> Eg
>>> Hotel ID (static) , Hotel Description (static and costly to get from a
>>> url etc), FromPrice (changes hourly)
>>>
>>> Option 1
>>> Index hourly as a single document and don't worry about the unneeded
>>> field updates
>>>
>>> Option 2
>>> Split into 2 document types and index independently. This would
>>> require the front end application to query multiple times?
>>> doc1
>>> ID,Description,DocType
>>> doc2
>>> ID,HotelID,Price,DocType
>>>
>>> application performs searches based on hotel attributes
>>> for each hotel match issue query to get price
>>>
>>>
>>> Any other options ? Can you query across documents ?
>>>
>>> We run 1.4.1, we could maybe update to 3.2 but I don't think I could
>>> swing to trunk for JOIN feature (if that indeed is JOIN's use case)
>>>
>>> Thanks in advance
>>>
>>> PS Am I just worrying about de-normalised data and should sort the
>>> source data out maybe by caching and get over it ...?
>>>
>>> cheers Lee c
>>>
>>
>


WordDelimiterFilter preserveOriginal & position increment

2012-10-23 Thread Jay Luker
Hi,

I'm having an issue with the WDF preserveOriginal="1" setting and the
matching of a phrase query. Here's an example of the text that is
being indexed:

"...obtained with the Southern African Large Telescope,SALT..."

A lot of our text is extracted from PDFs, so this kind of formatting
junk is very common.

The phrase query that is failing is:

"Southern African Large Telescope"

>From looking at the analysis debugger I can see that the WDF is
getting the term "Telescope,SALT" and correctly splitting on the
comma. The problem seems to be that the original term is given the 1st
position, e.g.:

Pos  Term
1  Southern
2  African
3  Large
4  Telescope,SALT  <-- original term
5  Telescope
6  SALT

Only by adding a phrase slop of "~1" do I get a match.

I realize that the WDF is behaving correctly in this case (or at least
I can't imagine a rational alternative). But I'm curious if anyone can
suggest an way to work around this issue that doesn't involve adding
phrase query slop.

Thanks,
--jay


Re: WordDelimiterFilter preserveOriginal & position increment

2012-10-23 Thread Jay Luker
Bah... While attempting to duplicate this on our 4.x instance I
realized I was mis-reading the analysis output. In the example I
mentioned it was actually a SynonymFilter in the analysis chain that
was affecting the term position (we have several synonyms for
"telescope").

Regardless, it seems to not be a problem in 4.x.

Thanks,
--jay

On Tue, Oct 23, 2012 at 10:45 AM, Shawn Heisey  wrote:
> On 10/23/2012 8:16 AM, Jay Luker wrote:
>>
>>  From looking at the analysis debugger I can see that the WDF is
>> getting the term "Telescope,SALT" and correctly splitting on the
>> comma. The problem seems to be that the original term is given the 1st
>> position, e.g.:
>>
>> Pos  Term
>> 1  Southern
>> 2  African
>> 3  Large
>> 4  Telescope,SALT  <-- original term
>> 5  Telescope
>> 6  SALT
>
>
> Jay, I have WDF with preserveOriginal turned on.  I get the following from
> WDF parsing in the analysis page on either 3.5 or 4.1-SNAPSHOT, and the
> analyzer shows that all four of the query words are found in consecutive
> fields.  On the new version, I had to slide a scrollbar to the right to see
> the last term.  Visually they were not in consecutive fields on the new
> version (they were on 3.5), but the position number says otherwise.
>
>
> 1Southern
> 2African
> 3Large
> 4Telescope,SALT
> 4Telescope
> 5SALT
> 5TelescopeSALT
>
> My full WDF parameters:
> index: {preserveOriginal=1, splitOnCaseChange=1, generateNumberParts=1,
> catenateWords=1, splitOnNumerics=1, stemEnglishPossessive=1,
> luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0,
> catenateNumbers=1}
> query: {preserveOriginal=1, splitOnCaseChange=1, generateNumberParts=1,
> catenateWords=0, splitOnNumerics=1, stemEnglishPossessive=1,
> luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0,
> catenateNumbers=0}
>
> I understand from other messages on the mailing list that I should not have
> preserveOriginal on the query side, but I have not yet changed it.
>
> If your position numbers really are what you indicated, you may have found a
> bug.  I have not tried the released 4.0.0 version, I expect to deploy from
> the 4.x branch under development.
>
> Thanks,
> Shawn
>