PatternTokenizer failure
Hi all, I'm trying to use PatternTokenizer and not getting expected results. Not sure where the failure lies. What I'm trying to do is split my input on whitespace except in cases where the whitespace is preceded by a hyphen character. So to do this I'm using a negative look behind assertion in the pattern, e.g. "(? ["foo","bar"] - OK "foo \n bar" -> ["foo","bar"] - OK "foo- bar" -> ["foo- bar"] - OK "foo-\nbar" -> ["foo-\nbar"] - OK "foo- \n bar" -> ["foo- \n bar"] - FAILS Here's a test case that demonstrates the failure: public void testPattern() throws Exception { Map args = new HashMap(); args.put( PatternTokenizerFactory.GROUP, "-1" ); args.put( PatternTokenizerFactory.PATTERN, "(? but was:" Am I doing something wrong? Incorrect expectations? Or could this be a bug? Thanks, --jay
Re: InvalidTokenOffsetsException when using MappingCharFilterFactory, DictionaryCompoundWordTokenFilterFactory and Highlighting
I am having a similar issue with OffsetExceptions during highlighting. In all of the explanations and bug reports I'm reading there is a mention this is all the result of a problem with HTMLStripCharFilter. But my analysis chains don't (that I'm aware of) make use of HTMLStripCharFilter, so can someone explain what else might be going on? Or is it acknowledged that the bug may exist elsewhere? Thanks, --jay On Fri, Nov 11, 2011 at 4:37 AM, Vadim Kisselmann wrote: > Hi Edwin, Chris > > it´s an old bug. I have big problems too with OffsetExceptions when i use > Highlighting, or Carrot. > It looks like a problem with HTMLStripCharFilter. > Patch doesn´t work. > > https://issues.apache.org/jira/browse/LUCENE-2208 > > Regards > Vadim > > > > 2011/11/11 Edwin Steiner > >> I just entered a bug: https://issues.apache.org/jira/browse/SOLR-2891 >> >> Thanks & regards, Edwin >> >> On Nov 7, 2011, at 8:47 PM, Chris Hostetter wrote: >> >> > >> > : finally I want to use Solr highlighting. But there seems to be a >> problem >> > : if I combine the char filter and the compound word filter in >> combination >> > : with highlighting (an >> > : org.apache.lucene.search.highlight.InvalidTokenOffsetsException is >> > : raised). >> > >> > Definitely sounds like a bug somwhere in dealing with the offsets. >> > >> > can you please file a Jira, and include all of the data you have provided >> > here? it would also be helpful to know what the analysis tool says about >> > the various attributes of your tokens at each stage of the analysis? >> > >> > : SEVERE: org.apache.solr.common.SolrException: >> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall >> exceeds length of provided text sized 12 >> > : at >> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:469) >> > : at >> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlighting(DefaultSolrHighlighter.java:378) >> > : at >> org.apache.solr.handler.component.HighlightComponent.process(HighlightComponent.java:116) >> > : at >> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:194) >> > : at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) >> > : at org.apache.solr.core.SolrCore.execute(SolrCore.java:1360) >> > : at >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356) >> > : at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252) >> > : at >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) >> > : at >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) >> > : at >> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) >> > : at >> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) >> > : at >> org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:462) >> > : at >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:164) >> > : at >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:100) >> > : at >> org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:851) >> > : at >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) >> > : at >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:405) >> > : at >> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:278) >> > : at >> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:515) >> > : at >> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:302) >> > : at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) >> > : at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) >> > : at java.lang.Thread.run(Thread.java:680) >> > : Caused by: >> org.apache.lucene.search.highlight.InvalidTokenOffsetsException: Token fall >> exceeds length of provided text sized 12 >> > : at >> org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:228) >> > : at >> org.apache.solr.highlight.DefaultSolrHighlighter.doHighlightingByHighlighter(DefaultSolrHighlighter.java:462) >> > : ... 23 more >> > >> > >> > -Hoss >> >> >
RegexQuery performance
Hi, I am trying to provide a means to search our corpus of nearly 2 million fulltext astronomy and physics articles using regular expressions. A small percentage of our users need to be able to locate, for example, certain types of identifiers that are present within the fulltext (grant numbers, dataset identifers, etc). My straightforward attempts to do this using RegexQuery have been successful only in the sense that I get the results I'm looking for. The performance, however, is pretty terrible, with most queries taking five minutes or longer. Is this the performance I should expect considering the size of my index and the massive number of terms? Are there any alternative approaches I could try? Things I've already tried: * reducing the sheer number of terms by adding a LengthFilter, min=6, to my index analysis chain * swapping in the JakartaRegexpCapabilities Things I intend to try if no one has any better suggestions: * chunk up the index and search concurrently, either by sharding or using a RangeQuery based on document id Any suggestions appreciated. Thanks, --jay
Re: RegexQuery performance
Hi Erick, On Fri, Dec 9, 2011 at 12:37 PM, Erick Erickson wrote: > Could you show us some examples of the kinds of things > you're using regex for? I.e. the raw text and the regex you > use to match the example? Sure! An example identifier would be "IRAS-A-FPA-3-RDR-IMPS-V6.0", which identifies a particular Planetary Data System data set. Another example is "ULY-J-GWE-8-NULL-RESULTS-V1.0". These kind of strings frequently appear in the references section of the articles, so the context looks something like, " ... rvey. IRAS-A-FPA-3-RDR-IMPS-V6.0, NASA Planetary Data System Tholen, D. J. 1989, in Asteroids II, ed ... " The simple & straightforward regex I've been using is /[A-Z0-9:\-]+V\d+\.\d+/. There may be a smarter regex approach but I haven't put my mind to it because I assumed the primary performance issue was elsewhere. > The reason I ask is that perhaps there are other approaches, > especially thinking about some clever analyzing at index time. > > For instance, perhaps NGrams are an option. Perhaps > just making WordDelimiterFilterFactory do its tricks. Perhaps. WordDelimiter does help in the sense that if you search for a specific identifier you will usually find fairly accurate results, even for cases where the hyphens resulted in the term being broken up. But I'm not sure how WordDelimiter can help if I want to search for a pattern. I tried a few tweaks to the index, like putting a minimum character count for terms, making sure WordDelimeter's preserveOriginal is turned on, indexing without lowercasing so that I don't have to use Pattern.CASE_INSENSITIVE. Performance was not improved significantly. The new RegexpQuery mentioned by R. Muir looks promising, but I haven't built an instance of trunk yet to try it out. Any ohter suggestions appreciated. Thanks! --jay > In other words, this could be an "XY problem" > > Best > Erick > > On Thu, Dec 8, 2011 at 11:14 AM, Robert Muir wrote: >> On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker wrote: >>> Hi, >>> >>> I am trying to provide a means to search our corpus of nearly 2 >>> million fulltext astronomy and physics articles using regular >>> expressions. A small percentage of our users need to be able to >>> locate, for example, certain types of identifiers that are present >>> within the fulltext (grant numbers, dataset identifers, etc). >>> >>> My straightforward attempts to do this using RegexQuery have been >>> successful only in the sense that I get the results I'm looking for. >>> The performance, however, is pretty terrible, with most queries taking >>> five minutes or longer. Is this the performance I should expect >>> considering the size of my index and the massive number of terms? Are >>> there any alternative approaches I could try? >>> >>> Things I've already tried: >>> * reducing the sheer number of terms by adding a LengthFilter, >>> min=6, to my index analysis chain >>> * swapping in the JakartaRegexpCapabilities >>> >>> Things I intend to try if no one has any better suggestions: >>> * chunk up the index and search concurrently, either by sharding or >>> using a RangeQuery based on document id >>> >>> Any suggestions appreciated. >>> >> >> This RegexQuery is not really scalable in my opinion, its always >> linear to the number of terms except in super-rare circumstances where >> it can compute a "common prefix" (and slow to boot). >> >> You can try svn trunk's RegexpQuery <-- don't forget the "p", instead >> from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/ >> etc) >> >> The performance is faster, but keep in mind its only as good as the >> regular expressions, if the regular expressions are like /.*foo.*/, >> then >> its just as slow as wildcard of *foo*. >> >> -- >> lucidimagination.com
Re: RegexQuery performance
On Sat, Dec 10, 2011 at 9:25 PM, Erick Erickson wrote: > My off-the-top-of-my-head notion is you implement a > Filter whose job is to emit some "special" tokens when > you find strings like this that allow you to search without > regexes. For instance, in the example you give, you could > index something like...oh... I don't know, ###VER### as > well as the "normal" text of "IRAS-A-FPA-3-RDR-IMPS-V6.0". > Now, when searching for docs with the pattern you used > as an example, you look for ###VER### instead. I guess > it all depends on how many regexes you need to allow. > This wouldn't work at all if you allow users to put in arbitrary > regexes, but if you have a small enough number of patterns > you'll allow, something like this could work. This is a great suggestion. I think the number of users that need this feature, as well as the variety of regexs that would be used, is small enough that it could definitely work. I turns it into a problem of collecting the necessary regexes, plus the UI details. Thanks! --jay
NumericRangeQuery: what am I doing wrong?
I can't get NumericRangeQuery or TermQuery to work on my integer "id" field. I feel like I must be missing something obvious. I have a test index that has only two documents, id:9076628 and id:8003001. The id field is defined like so: A MatchAllDocsQuery will return the 2 documents, but any queries I try on the id field return no results. For instance, public void testIdRange() throws IOException { Query q = NumericRangeQuery.newIntRange("id", 1, 1000, true, true); System.out.println("query: " + q); assertEquals(2, searcher.search(q, 5).totalHits); } public void testIdSearch() throws IOException { Query q = new TermQuery(new Term("id", "9076628")); System.out.println("query: " + q); assertEquals(1, searcher.search(q, 5).totalHits); } Both tests fail with totalHits being 0. This is using solr/lucene trunk, but I tried also with 3.2 and got the same results. What could I be doing wrong here? Thanks, --jay
Re: NumericRangeQuery: what am I doing wrong?
On Wed, Dec 14, 2011 at 2:04 PM, Erick Erickson wrote: > Hmmm, seems like it should work, but there are two things you might try: > 1> just execute the query in Solr. id:1 TO 100]. Does that work? Yep, that works fine. > 2> I'm really grasping at straws here, but it's *possible* that you > need to use the same precisionstep as tint (8?)? There's a > constructor that takes precisionStep as a parameter, but the > default is 4 in the 3.x code. Ah-ha, that was it. I did not notice the alternate constructor. The field was originally indexed with solr's default "int" type, which has precisionStep="0" (i.e., don't index at different precision levels). The equivalent value for the NumericRangeQuery constructor is 32. This isn't exactly inuitive, but I was able to figure it out with a careful reading of the javadoc. Thanks! --jay
Re: NumericRangeQuery: what am I doing wrong?
On Wed, Dec 14, 2011 at 5:02 PM, Chris Hostetter wrote: > > I'm a little lost in this thread ... if you are programaticly construction > a NumericRangeQuery object to execute in the JVM against a Solr index, > that suggests you are writting some sort of SOlr plugin (or uembedding > solr in some way) It's not you; it's me. I'm just doing weird things, partly, I'm sure, due to ignorance, but sometimes out of expediency. I was experimenting with ways to do a NumericRangeFilter, and the tests I was trying used the Lucene api to query a Solr index, therefore I didn't have access to the IndexSchema. Also my question might have been better directed at the lucene-general list to avoid confusion. Thanks, --jay
Re: Autocommit not happening
For the sake of any future googlers I'll report my own clueless but thankfully brief struggle with autocommit. There are two parts to the story: Part One is where I realize my config was not contained within my . In Part Two I realized I had typed "" rather than "". --jay On Fri, Jul 23, 2010 at 2:35 PM, John DeRosa wrote: > On Jul 23, 2010, at 9:37 AM, John DeRosa wrote: > >> Hi! I'm a Solr newbie, and I don't understand why autocommits aren't >> happening in my Solr installation. >> > > [snip] > > "Never mind"... I have discovered my boneheaded mistake. It's so silly, I > wish I could retract my question from the archives. > >
documentCache clarification
Hi all, The solr wiki says this about the documentCache: "The more fields you store in your documents, the higher the memory usage of this cache will be." OK, but if i have enableLazyFieldLoading set to true and in my request parameters specify "fl=id", then the number of fields per document shouldn't affect the memory usage of the document cache, right? Thanks, --jay
Re: documentCache clarification
(btw, I'm running 1.4.1) It looks like my assumption was wrong. Regardless of the fields selected using the "fl" parameter and the enableLazyFieldLoading setting, solr apparently fetches from disk and caches all the fields in the document (or maybe just those that are stored="true" in my schema.) My evidence for this is the documentCache stats reported by solr/admin. If I request "rows=10&fl=id" followed by "rows=10&fl=id,title" I would expect to see the 2nd request result in a 2nd insert to the cache, but instead I see that the 2nd request hits the cache from the 1st request. "rows=10&fl=*" does the same thing. i.e., the first request, even though I have enableLazyFieldLoading=true and I'm only asking for the ids, fetches the entire document from disk and inserts into the documentCache. Subsequent requests, regardless of which fields I actually select, don't hit the disk but are loaded from the documentCache. Is this really the expected behavior and/or am I misunderstanding something? A 2nd question: while watching these stats I noticed something else weird with the queryResultCache. It seems that inserts to the queryResultCache depend on the number of rows requested. For example, an initial request (solr restarted, clean cache, etc) with rows=10 will result in a insert. A 2nd request of the same query with rows=1000 will result in a cache hit. However if you reverse that order, starting with a clean cache, an initial request for rows=1000 will *not* result in an insert to queryResultCache. I have tried various increments--10, 100, 200, 500--and it seems the magic number is somewhere between 200 (cache insert) and 500 (no insert). Can someone explain this? Thanks, --jay On Wed, Oct 27, 2010 at 10:54 AM, Markus Jelsma wrote: > I've been wondering about this too some time ago. I've found more > informationenableLazyFieldLoading > in SOLR-52 and some correspondence on this one but it didn't give me a > definitive answer.. > > [1]: https://issues.apache.org/jira/browse/SOLR-52 > [2]: http://www.mail-archive.com/solr-...@lucene.apache.org/msg01185.html > > On Wednesday 27 October 2010 16:39:44 Jay Luker wrote: >> Hi all, >> >> The solr wiki says this about the documentCache: "The more fields you >> store in your documents, the higher the memory usage of this cache >> will be." >> >> OK, but if i have enableLazyFieldLoading set to true and in my request >> parameters specify "fl=id", then the number of fields per document >> shouldn't affect the memory usage of the document cache, right? >> >> Thanks, >> --jay > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536600 / 06-50258350 >
Re: documentCache clarification
On Wed, Oct 27, 2010 at 9:13 PM, Chris Hostetter wrote: > > : schema.) My evidence for this is the documentCache stats reported by > : solr/admin. If I request "rows=10&fl=id" followed by > : "rows=10&fl=id,title" I would expect to see the 2nd request result in > : a 2nd insert to the cache, but instead I see that the 2nd request hits > : the cache from the 1st request. "rows=10&fl=*" does the same thing. > > your evidence is correct, but your interpretation is incorrect. > > the objects in the documentCache are lucene Documents, which contain a > List of Field refrences. when enableLazyFieldLoading=true is set, and > there is a documentCache Document fetched from the IndexReader only > contains the Fields specified in the fl, and all other Fields are marked > as "LOAD_LAZY". > > When there is a cache hit on that uniqueKey at a later date, the Fields > allready loaded are used directly if requested, but the Fields marked > LOAD_LAZY are (you guessed it) lazy loaded from the IndexReader and then > the Document updates the refrence to the newly actualized fields (which > are no longer marked LOAD_LAZY) > > So with different "fl" params, the same Document Object is continually > used, but the Fields in that Document grow as the fields requested (using > the "fl" param) change. Great stuff. Makes sense. Thanks for the clarification, and if no one objects I'll update the wiki with some of this info. I'm still not clear on this statement from the wiki's description of the documentCache: "(Note: This cache cannot be used as a source for autowarming because document IDs will change when anything in the index changes so they can't be used by a new searcher.)" Can anyone elaborate a bit on that. I think I've read it at least 10 times and I'm still unable to draw a mental picture. I'm wondering if the document IDs referred to are the ones I'm defining in my schema, or are they the underlying lucene ids, i.e. the ones that, according to the Lucene in Action book, are "relative within each segment"? > : will *not* result in an insert to queryResultCache. I have tried > : various increments--10, 100, 200, 500--and it seems the magic number > : is somewhere between 200 (cache insert) and 500 (no insert). Can > : someone explain this? > > In addition to the config option already > mentioned (which controls wether a DocList is cached based on it's size) > there is also the config option which may confuse > your cache observations. if the window size is "50" and you ask for > start=0&rows=10 what actually gets cached is "0-50" (assuming there are > more then 50 results) so a subsequent request for start=10&rows=10 will be > a cache hit. Just so I'm clear, does the queryResultCache operate in a similar manner as the documentCache as to what is actually cached? In other words, is it the caching of the docList object that is reported in the cache statistics hits/inserts numbers? And that object would get updated with a new set of ordered doc ids on subsequent, larger requests. (I'm flailing a bit to articulate the question, I know). For example, if my queryResultMaxDocsCached is set to 200 and I issue a request with rows=500, then I won't get a docList object entry in the queryResultCache. However, if I issue a request with rows=10, I will get an insert, and then a later request for rows=500 would re-use and update that original cached docList object. Right? And would it be updated with the full list of 500 ordered doc ids or only 200? Thanks, --jay
Re: documentCache clarification
On Thu, Oct 28, 2010 at 7:27 PM, Chris Hostetter wrote: > The queryResultCache is keyed on and the > value is a "DocList" object ... > > http://lucene.apache.org/solr/api/org/apache/solr/search/DocList.html > > Unlike the Document objects in the documentCache, the DocLists in the > queryResultCache never get modified (techincally Solr doesn't actually > modify the Documents either, the Document just keeps track of it's fields > and updates itself as Lazy Load fields are needed) > > if a DocList containing results 0-10 is put in the cache, it's not > going to be of any use for a query with start=50. but if it contains 0-50 > it *can* be used if start < 50 and rows < 50 -- that's where the > queryResultWindowSize comes in. if you use start=0&rows=10, but your > window size is 50, SolrIndexSearcher will (under the covers) use > start=0&rows=50 and put that in the cache, returning a "slice" from 0-10 > for your query. the next query asking for 10-20 will be a cache hit. This makes sense but still doesn't explain what I'm seeing in my cache stats. When I issue a request with rows=10 the stats show an insert into the queryResultCache. If I send the same query, this time with rows=1000, I would not expect to see a cache hit but I do. So it seems like there must be something useful in whatever gets cached on the first request for rows=10 for it to be re-used by the request for rows=1000. --jay
Using jetty's GzipFilter in the example solr.war
Hi, I thought I'd try turning on gzip compression but I can't seem to get jetty's GzipFilter to actually compress my responses. I unpacked the example solr.war and tried adding variations of the following to the web.xml (and then rejar-ed), but as far as I can tell, jetty isn't actually compressing anything. GZipFilter Jetty's GZip Filter Filter that zips all the content on-the-fly org.mortbay.servlet.GzipFilter mimeTypes * GZipFilter * I've also tried explicitly listing mime-types and assigning the filter-mapping using . I can see that the GzipFilter is being loaded when I add -DDEBUG to the jetty startup command. But as far as I can tell from looking at the response headers nothing is being gzipped. I'm expecting to see "Content-Encoding: gzip" in the response headers. Anyone successfully gotten this to work? Thanks, --jay
Re: Using jetty's GzipFilter in the example solr.war
On Sun, Nov 14, 2010 at 12:49 AM, Kiwi de coder wrote: > try to put u filter on top of web.xml (instead of middle or bottom), i try > this few day and it just only a simple solution (not sure is a spec to put > on top or is a bug) Thank you. An explanation of why this worked is probably better explored on the jetty list, but, for the record, it did. --jay
Sending binary data as part of a query
Hi all, Here is what I am interested in doing: I would like to send a compressed integer bitset as a query to solr. The bitset integers represent my document ids and the results I want to get back is the facet data for those documents. I have successfully created a QueryComponent class that, assuming it has the integer bitset, can turn that into the necessary DocSetFilter to pass to the searcher, get back the facets, etc. That part all works right now because I'm using either canned or randomly generated bitsets on the server side. What I'm unsure how to do is actually send this compressed bitset from a client to solr as part of the query. From what I can tell, the Solr API classes that are involved in handling binary data as part of a request assume that the data is a document to be added. For instance, extending ContentStreamHandlerBase requires implementing some kind of document loader and an UpdateRequestProcessorChain and a bunch of other stuff that I don't really think I should need. Is there a simpler way? Anyone tried or succeeded in doing anything similar to this? Thanks, --jay
Re: Sending binary data as part of a query
On Mon, Jan 31, 2011 at 9:22 PM, Chris Hostetter wrote: > that class should probably have been named ContentStreamUpdateHandlerBase > or something like that -- it tries to encapsulate the logic that most > RequestHandlers using COntentStreams (for updating) need to worry about. > > Your QueryComponent (as used by SearchHandler) should be able to access > the ContentStreams the same way that class does ... call > req.getContentStreams(). > > Sending a binary stream from a remote client depends on how the client is > implemented -- you can do it via HTTP using the POST body (with or w/o > multi-part mime) in any langauge you want. If you are using SolrJ you may > again run into an assumption that using ContentStreams means you are doing > an "Update" but that's just a vernacular thing ... something like a > ContentStreamUpdateRequest should work just as well for a query (as long > as you set the neccessary params and/or request handler path) Thanks for the help. I was just about to reply to my own question for the benefit of future googlers when I noticed your response. :) I actually got this working, much the way you suggest. The client is python. I created a gist with the script I used for testing [1]. On the solr side my QueryComponent grabs the stream, uses jzlib.ZInputStream to do the deflating, then translates the incoming integers in the bitset (my solr schema.xml integer ids) to the lucene ids and creates a docSetFilter with them. Very relieved to get this working as it's the basis of a talk I'm giving next week [2]. :-) --jay [1] https://gist.github.com/806397 [2] http://code4lib.org/conference/2011/luker
Help with parsing configuration using SolrParams/NamedList
Hi, I'm trying to use a CustomSimilarityFactory and pass in per-field options from the schema.xml, like so: 500 1 0.5 500 2 0.5 My problem is I am utterly failing to figure out how to parse this nested option structure within my CustomSimilarityFactory class. I know that the settings are available as a SolrParams object within the getSimilarity() method. I'm convinced I need to convert to a NamedList using params.toNamedList(), but my java fu is too feeble to code the dang thing. The closest I seem to get is the top level as a NamedList where the keys are "field_a" and "field_b", but then my values are strings, e.g., "{min=500,max=1,steepness=0.5}". Anyone who could dash off a quick example of how to do this? Thanks, --jay
Highlight snippets for a set of known documents
Hi all, I'm trying to get highlight snippets for a set of known documents and I must being doing something wrong because it's only sort of working. Say my query is "foobar" and I already know that docs 1, 5 and 11 are matches. Now I want to retrieve the highlight snippets for the term "foobar" for docs 1, 5 and 11. What I assumed would work was something like: "...&q=foobar&fq={!q.op=OR}id:1 id:5 id:11...". This returns numfound=3 in the response, but I only get the highlight snippets for document id:1. What am I doing wrong? Thanks, --jay
Re: Highlight snippets for a set of known documents
It turns out the answer is I'm a moron; I had an unnoticed "&rows=1" nestled in the querystring I was testing with. Anyway, thanks for replying! --jay On Fri, Apr 1, 2011 at 4:25 AM, Stefan Matheis wrote: > Jay, > > i'm not sure, but did you try it w/ brackets? > q=foobar&fq={!q.op=OR}(id:1 id:5 id:11) > > Regards > Stefan > > On Thu, Mar 31, 2011 at 6:40 PM, Jay Luker wrote: >> Hi all, >> >> I'm trying to get highlight snippets for a set of known documents and >> I must being doing something wrong because it's only sort of working. >> >> Say my query is "foobar" and I already know that docs 1, 5 and 11 are >> matches. Now I want to retrieve the highlight snippets for the term >> "foobar" for docs 1, 5 and 11. What I assumed would work was something >> like: "...&q=foobar&fq={!q.op=OR}id:1 id:5 id:11...". This returns >> numfound=3 in the response, but I only get the highlight snippets for >> document id:1. What am I doing wrong? >> >> Thanks, >> --jay >> >
UIMA example setup w/o OpenCalais
Hi, I'd would like to experiment with the UIMA contrib package, but I have issues with the OpenCalais service's ToS and would rather not use it. Is there a way to adapt the UIMA example setup to use only the AlchemyAPI service? I tried simply leaving out the OpenCalais api key but i get exceptions thrown during indexing. Thanks, --jay
Re: UIMA example setup w/o OpenCalais
Thank you, that worked. For the record, my objection to the OpenCalais service is that their ToS states that they will "retain a copy of the metadata submitted by you", and that by submitting data to the service you "grant Thomson Reuters a non-exclusive perpetual, sublicensable, royalty-free license to that metadata." The AlchemyAPI service Tos states only that they retain the *generated* metadata. Just a warning to anyone else thinking of experimenting with Solr & UIMA. --jay On Fri, Apr 8, 2011 at 6:45 AM, Tommaso Teofili wrote: > Hi Jay, > you should be able to do so by simply removing the OpenCalaisAnnotator from > the execution pipeline commenting the line 124 of the file: > solr/contrib/uima/src/main/resources/org/apache/uima/desc/OverridingParamsExtServicesAE.xml > Hope this helps, > Tommaso > > 2011/4/7 Jay Luker > >> Hi, >> >> I'd would like to experiment with the UIMA contrib package, but I have >> issues with the OpenCalais service's ToS and would rather not use it. >> Is there a way to adapt the UIMA example setup to use only the >> AlchemyAPI service? I tried simply leaving out the OpenCalais api key >> but i get exceptions thrown during indexing. >> >> Thanks, >> --jay >> >
tika/pdfbox knobs & levers
Hi all, I'm wondering if there are any knobs or levers i can set in solrconfig.xml that affect how pdfbox text extraction is performed by the extraction handler. I would like to take advantage of pdfbox's ability to normalize diacritics and ligatures [1], but that doesn't seem to be the default behavior. Is there a way to enable this? Thanks, --jay [1] http://pdfbox.apache.org/apidocs/index.html?org/apache/pdfbox/util/TextNormalize.html
Re: Text Only Extraction Using Solr and Tika
Hi Emyr, You could try using the "extractOnly=true" parameter [1]. Of course, you'll need to repost the extracted text manually. --jay [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only On Thu, May 5, 2011 at 9:36 AM, Emyr James wrote: > Hi All, > > I have solr and tika installed and am happily extracting and indexing > various files. > Unfortunately on some word documents it blows up since it tries to > auto-generate a 'title' field but my title field in the schema is single > valued. > > Here is my config for the extract handler... > > class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> > > ignored_ > > > > Is there a config option to make it only extract text, or ideally to allow > me to specify which metadata fields to accept ? > > E.g. I'd like to use any author metadata it finds but to not use any title > metadata it finds as I want title to be single valued and set explicitly > using a literal.title in the post request. > > I did look around for some docs but all i can find are very basic examples. > there's no comprehensive configuration documentation out there as far as I > can tell. > > > ALSO... > > I get some other bad responses coming back such as... > > Apache Tomcat/6.0.28 - Error report > HTTP Status 500 - org.ap > ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; > > java.lang.NoSuchMethodError: > org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) > at java.lang.Thread.run(Thread.java:636) > type Status > reportmessage > org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; > > For the above my url was... > > http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not > es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten > > I guess there's something special I need to be able to process power point > files ? Maybe I need to get the latest apache POI ? Any suggestions > welcome... > > > Regards, > > Emyr >
Re: Solr performance
On Wed, May 11, 2011 at 7:07 AM, javaxmlsoapdev wrote: > I have some 25 odd fields with "stored=true" in schema.xml. Retrieving back > 5,000 records back takes a few secs. I also tried passing "fl" and only > include one field in the response but still response time is same. What are > the things to look to tune the performance. Confirm that you have enableLazyFieldLoading set to true in solrconfig.xml. I suspect it is since that's the default. Is the request taking a few seconds the first time, but returns quickly on subsequent requests? Also, may or may not be relevant, but you might find a few bits of info in this thread enlightening: http://lucene.472066.n3.nabble.com/documentCache-clarification-td1780504.html --jay
Re: Document has fields with different update frequencies: how best to model
Take a look at ExternalFileField [1]. It's meant for exactly what you want to do here. FYI, there is an issue with caching of the external values introduced in v1.4 but, thankfully, resolved in v3.2 [2] --jay [1] http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html [2] https://issues.apache.org/jira/browse/SOLR-2536 On Fri, Jun 10, 2011 at 12:54 PM, lee carroll wrote: > Hi, > We have a document type which has fields which are pretty static. Say > they change once every 6 month. But the same document has a field > which changes hourly > What are the best approaches to index this document ? > > Eg > Hotel ID (static) , Hotel Description (static and costly to get from a > url etc), FromPrice (changes hourly) > > Option 1 > Index hourly as a single document and don't worry about the unneeded > field updates > > Option 2 > Split into 2 document types and index independently. This would > require the front end application to query multiple times? > doc1 > ID,Description,DocType > doc2 > ID,HotelID,Price,DocType > > application performs searches based on hotel attributes > for each hotel match issue query to get price > > > Any other options ? Can you query across documents ? > > We run 1.4.1, we could maybe update to 3.2 but I don't think I could > swing to trunk for JOIN feature (if that indeed is JOIN's use case) > > Thanks in advance > > PS Am I just worrying about de-normalised data and should sort the > source data out maybe by caching and get over it ...? > > cheers Lee c >
Re: Document has fields with different update frequencies: how best to model
You are correct that ExternalFileField values can only be used in query functions (i.e. scoring, basically). Sorry for firing off that answer without reading your use case more carefully. I'd be inclined towards giving your Option #1 a try, but that's without knowing much about the scale of your app, size of your index, documents, etc. Unneeded field updates are only a problem if they're causing performance problems, right? Otherwise, trying to avoid seems like premature optimization. --jay On Sat, Jun 11, 2011 at 5:26 AM, lee carroll wrote: > Hi Jay > I thought external file field could not be returned as a field but > only used in scoring. > trunk has pseudo field which can take a function value but we cant > move to trunk. > > also its a more general question around schema design, what happens if > you have several fields with different update frequencies. It does not > seem external file field is the use case for this. > > > > On 10 June 2011 20:13, Jay Luker wrote: >> Take a look at ExternalFileField [1]. It's meant for exactly what you >> want to do here. >> >> FYI, there is an issue with caching of the external values introduced >> in v1.4 but, thankfully, resolved in v3.2 [2] >> >> --jay >> >> [1] >> http://lucene.apache.org/solr/api/org/apache/solr/schema/ExternalFileField.html >> [2] https://issues.apache.org/jira/browse/SOLR-2536 >> >> >> On Fri, Jun 10, 2011 at 12:54 PM, lee carroll >> wrote: >>> Hi, >>> We have a document type which has fields which are pretty static. Say >>> they change once every 6 month. But the same document has a field >>> which changes hourly >>> What are the best approaches to index this document ? >>> >>> Eg >>> Hotel ID (static) , Hotel Description (static and costly to get from a >>> url etc), FromPrice (changes hourly) >>> >>> Option 1 >>> Index hourly as a single document and don't worry about the unneeded >>> field updates >>> >>> Option 2 >>> Split into 2 document types and index independently. This would >>> require the front end application to query multiple times? >>> doc1 >>> ID,Description,DocType >>> doc2 >>> ID,HotelID,Price,DocType >>> >>> application performs searches based on hotel attributes >>> for each hotel match issue query to get price >>> >>> >>> Any other options ? Can you query across documents ? >>> >>> We run 1.4.1, we could maybe update to 3.2 but I don't think I could >>> swing to trunk for JOIN feature (if that indeed is JOIN's use case) >>> >>> Thanks in advance >>> >>> PS Am I just worrying about de-normalised data and should sort the >>> source data out maybe by caching and get over it ...? >>> >>> cheers Lee c >>> >> >
WordDelimiterFilter preserveOriginal & position increment
Hi, I'm having an issue with the WDF preserveOriginal="1" setting and the matching of a phrase query. Here's an example of the text that is being indexed: "...obtained with the Southern African Large Telescope,SALT..." A lot of our text is extracted from PDFs, so this kind of formatting junk is very common. The phrase query that is failing is: "Southern African Large Telescope" >From looking at the analysis debugger I can see that the WDF is getting the term "Telescope,SALT" and correctly splitting on the comma. The problem seems to be that the original term is given the 1st position, e.g.: Pos Term 1 Southern 2 African 3 Large 4 Telescope,SALT <-- original term 5 Telescope 6 SALT Only by adding a phrase slop of "~1" do I get a match. I realize that the WDF is behaving correctly in this case (or at least I can't imagine a rational alternative). But I'm curious if anyone can suggest an way to work around this issue that doesn't involve adding phrase query slop. Thanks, --jay
Re: WordDelimiterFilter preserveOriginal & position increment
Bah... While attempting to duplicate this on our 4.x instance I realized I was mis-reading the analysis output. In the example I mentioned it was actually a SynonymFilter in the analysis chain that was affecting the term position (we have several synonyms for "telescope"). Regardless, it seems to not be a problem in 4.x. Thanks, --jay On Tue, Oct 23, 2012 at 10:45 AM, Shawn Heisey wrote: > On 10/23/2012 8:16 AM, Jay Luker wrote: >> >> From looking at the analysis debugger I can see that the WDF is >> getting the term "Telescope,SALT" and correctly splitting on the >> comma. The problem seems to be that the original term is given the 1st >> position, e.g.: >> >> Pos Term >> 1 Southern >> 2 African >> 3 Large >> 4 Telescope,SALT <-- original term >> 5 Telescope >> 6 SALT > > > Jay, I have WDF with preserveOriginal turned on. I get the following from > WDF parsing in the analysis page on either 3.5 or 4.1-SNAPSHOT, and the > analyzer shows that all four of the query words are found in consecutive > fields. On the new version, I had to slide a scrollbar to the right to see > the last term. Visually they were not in consecutive fields on the new > version (they were on 3.5), but the position number says otherwise. > > > 1Southern > 2African > 3Large > 4Telescope,SALT > 4Telescope > 5SALT > 5TelescopeSALT > > My full WDF parameters: > index: {preserveOriginal=1, splitOnCaseChange=1, generateNumberParts=1, > catenateWords=1, splitOnNumerics=1, stemEnglishPossessive=1, > luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0, > catenateNumbers=1} > query: {preserveOriginal=1, splitOnCaseChange=1, generateNumberParts=1, > catenateWords=0, splitOnNumerics=1, stemEnglishPossessive=1, > luceneMatchVersion=LUCENE_35, generateWordParts=1, catenateAll=0, > catenateNumbers=0} > > I understand from other messages on the mailing list that I should not have > preserveOriginal on the query side, but I have not yet changed it. > > If your position numbers really are what you indicated, you may have found a > bug. I have not tried the released 4.0.0 version, I expect to deploy from > the 4.x branch under development. > > Thanks, > Shawn >