Re: Unable to get offsets using AtomicReader.termPositionsEnum(Term)

2014-03-10 Thread Robert Muir
Hello, I think you are confused between two different index
structures, probably because of the name of the options in solr.

1. indexing term vectors: this means given a document, you can go
lookup a miniature "inverted index" just for that document. That means
each document has "term vectors" which has a term dictionary of the
terms in that one document, and optionally things like positions and
character offsets. This can be useful if you are examining *many
terms* for just a few documents. For example: the MoreLikeThis use
case. In solr this is activated with termVectors=true. To additionally
store positions/offsets information inside the term vectors its
termPositions and termOffsets, respectively.

2. indexing character offsets: this means given a term, you can get
the offset information "along with" each position that matched. So
really you can think of this as a special form of a payload. This is
useful if you are examining *many documents* for just a few terms. For
example, many highlighting use cases. In solr this is activated with
storeOffsetsWithPositions=true. It is unrelated to term vectors.

Hopefully this helps.

On Mon, Mar 10, 2014 at 9:32 PM, Jefferson French  wrote:
> This looks like a codec issue, but I'm not sure how to address it. I've
> found that a different instance of DocsAndPositionsEnum is instantiated
> between my code and Solr's TermVectorComponent.
>
> Mine:
> org.apache.lucene.codecs.lucene41.Lucene41PostingsReader$EverythingEnum
> Solr: 
> org.apache.lucene.codecs.compressing.CompressingTermVectorsReader$TVDocsEnum
>
> As far as I can tell, I've only used Lucene/Solr 4.6, so I'm not sure where
> the Lucene 4.1 reference comes from. I've searched through the Solr config
> files and can't see where to change the codec, but shouldn't the reader use
> the same codec as used when the index was created?
>
>
> On Fri, Mar 7, 2014 at 1:37 PM, Jefferson French wrote:
>
>> We have an API on top of Lucene 4.6 that I'm trying to adapt to running
>> under Solr 4.6. The problem is although I'm getting the correct offsets
>> when the index is created by Lucene, the same method calls always return -1
>> when the index is created by Solr. In the latter case I can see the
>> character offsets via Luke, and I can even get them from Solr when I access
>> the /tvrh search handler, which uses the TermVectorComponent class.
>>
>> This is roughly how I'm reading character offsets in my Lucene code:
>>
>>> AtomicReader reader = ...
>>> Term term = ...
>>> DocsAndPositionsEnum postings = reader.termPositionsEnum(term);
>>> while (postings.nextDoc() != DocsAndPositionsEnum.NO_MORE_DOCS) {
>>>   for (int i = 0; i < postings.freq(); i++) {
>>> System.out.println("start:" + postings.startOffset());
>>> System.out.println("end:" + postings.endOffset());
>>>   }
>>> }
>>
>>
>> Notice that I want the values for a single term. When run against an index
>> created by Solr, the above calls to startOffset() and endOffset() return
>> -1. Solr's TermVectorComponent prints the correct offsets like this
>> (paraphrased):
>>
>> IndexReader reader = searcher.getIndexReader();
>>> Terms vector = reader.getTermVector(docId, field);
>>> TermsEnum termsEnum = vector.iterator(termsEnum);
>>> int freq = (int) termsEnum.totalTermFreq();
>>> DocsAndPositionsEnum dpEnum = null;
>>> while((text = termsEnum.next()) != null) {
>>>   String term = text.utf8ToString();
>>>   dpEnum = termsEnum.docsAndPositions(null, dpEnum);
>>>   dpEnum.nextDoc();
>>>   for (int i = 0; i < freq; i++) {
>>> final int pos = dpEnum.nextPosition();
>>> System.out.println("start:" + dpEnum.startOffset());
>>> System.out.println("end:" + dpEnum.endOffset());
>>>   }
>>> }
>>
>>
>> but in this case it is getting the offsets per doc ID, rather than a
>> single term, which is what I want.
>>
>> Could anyone tell me:
>>
>>1. Why I'm not able to get the offsets using my first example, and/or
>>2. A better way to get the offsets for a given term?
>>
>> Thanks.
>>
>>Jeff
>>
>>
>>
>>
>>
>>
>>
>>
>>


[ANNOUNCE] Apache Solr 4.7.2 released.

2014-04-15 Thread Robert Muir
April 2014, Apache Solr™ 4.7.2 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.7.2

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.7.2 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.7.2 includes 2 bug fixes, as well as Lucene 4.7.2 and its bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 4.8.1 released

2014-05-20 Thread Robert Muir
May 2014, Apache Solr™ 4.8.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.8.1

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search. Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.8.1 is available for immediate download at:

http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

Solr 4.8.1 includes 10 bug fixes, as well as Lucene 4.8.1 and its bug fixes.

See the CHANGES.txt file included with the release for a full list of
changes and further details.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring
network for distributing releases. It is possible that the mirror you
are using may not have replicated the release yet. If that is the
case, please try another mirror. This also goes for Maven access.


[ANNOUNCE] Apache Solr 4.9.0 released

2014-06-25 Thread Robert Muir
25 June 2014, Apache Solr™ 4.9.0 available

The Lucene PMC is pleased to announce the release of Apache Solr 4.9.0

Solr is the popular, blazing fast, open source NoSQL search platform
from the Apache Lucene project. Its major features include powerful
full-text search, hit highlighting, faceted search, dynamic
clustering, database integration, rich document (e.g., Word, PDF)
handling, and geospatial search.  Solr is highly scalable, providing
fault tolerant distributed search and indexing, and powers the search
and navigation features of many of the world's largest internet sites.

Solr 4.9.0 is available for immediate download at:
  http://lucene.apache.org/solr/mirrors-solr-latest-redir.html

See the CHANGES.txt file included with the release for a full list of
details.

Solr 4.9.0 Release Highlights:

* Numerous optimizations for doc values search-time performance

* Allow a client application to request the minium achieved replication
  factor for an update request (single or batch) by sending an optional
  parameter "min_rf".

* Query re-ranking support with the new ReRankingQParserPlugin.

* A new [child ...] DocTransformer for optionally including Block-Join
  decendent documents inline in the results of a search.

* A new (default) Lucene49NormsFormat to better compress certain cases
  such as very short fields.


Solr 4.9.0 also includes many other new features as well as numerous
optimizations and bugfixes of the corresponding Apache Lucene release.

Please report any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases.  It is possible that the mirror you are using
may not have replicated the release yet.  If that is the case, please
try another mirror.  This also goes for Maven access.

On behalf of the Lucene PMC,
Happy Searching


Re: Background merge errors with Solr 4.4.0 on Optimize call

2013-10-29 Thread Robert Muir
I think its a bug, but thats just my opinion. i sent a patch to dev@
for thoughts.

On Tue, Oct 29, 2013 at 6:09 PM, Erick Erickson  wrote:
> Hmmm, so you're saying that merging indexes where a field
> has been removed isn't handled. So you have some documents
> that do have a "what" field, but your schema doesn't have it,
> is that true?
>
> It _seems_ like you could get by by putting the _what_ field back
> into your schema, just not sending any data to it in new docs.
>
> I'll let others who understand merging better than me chime in on
> whether this is a case that should be handled or a bug. I pinged the
> dev list to see what the opinion is
>
> Best,
> Erick
>
>
> On Mon, Oct 28, 2013 at 6:39 PM, Matthew Shapiro  wrote:
>
>> Sorry for reposting after I just sent in a reply, but I just looked at the
>> error trace closer and noticed
>>
>>
>>1. Caused by: java.lang.IllegalArgumentException: no such field what
>>
>>
>> The 'what' field was removed by request of the customer as they wanted the
>> logic behind what gets queried in the "what" field to be code side instead
>> of solr side (for easier changing without having to re-index everything.  I
>> didn't feel strongly either way and since they are paying me, I took it
>> out).
>>
>> This makes me wonder if its crashing while merging because a field that
>> used to be there is now gone.  However, this seems odd to me as Solr
>> doesn't even let me delete the old data and instead its leaving my
>> collection in an extremely bad state, with the only remedy I can think of
>> is to nuke the index at the filesystem level.
>>
>> If this is indeed the cause of the crash, is the only way to delete a field
>> to first completely empty your index first?
>>
>>
>> On Mon, Oct 28, 2013 at 6:34 PM, Matthew Shapiro  wrote:
>>
>> > Thanks for your response.
>> >
>> > You were right, solr is logging to the catalina.out file for tomcat.
>>  When
>> > I click the optimize button in solr's admin interface the following logs
>> > are written: http://apaste.info/laup
>> >
>> > About JVM memory, solr's admin interface is listing JVM memory at 3.1%
>> > (221.7MB is dark grey, 512.56MB light grey and 6.99GB total).
>> >
>> >
>> > On Mon, Oct 28, 2013 at 6:29 AM, Erick Erickson > >wrote:
>> >
>> >> For Tomcat, the Solr is often put into catalina.out
>> >> as a default, so the output might be there. You can
>> >> configure Solr to send the logs most anywhere you
>> >> please, but without some specific setup
>> >> on your part the log output just goes to the default
>> >> for the servlet.
>> >>
>> >> I took a quick glance at the code but since the merges
>> >> are happening in the background, there's not much
>> >> context for where that error is thrown.
>> >>
>> >> How much memory is there for the JVM? I'm grasping
>> >> at straws a bit...
>> >>
>> >> Erick
>> >>
>> >>
>> >> On Sun, Oct 27, 2013 at 9:54 PM, Matthew Shapiro 
>> wrote:
>> >>
>> >> > I am working at implementing solr to work as the search backend for
>> our
>> >> web
>> >> > system.  So far things have been going well, but today I made some
>> >> schema
>> >> > changes and now things have broken.
>> >> >
>> >> > I updated the schema.xml file and reloaded the core (via the admin
>> >> > interface).  No errors were reported in the logs.
>> >> >
>> >> > I then pushed 100 records to be indexed.  A call to Commit afterwards
>> >> > seemed fine, however my next call for Optimize caused the following
>> >> errors:
>> >> >
>> >> > java.io.IOException: background merge hit exception:
>> >> > _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37
>> >> > [maxNumSegments=1]
>> >> >
>> >> > null:java.io.IOException: background merge hit exception:
>> >> > _2n(4.4):C4263/154 _30(4.4):C134 _32(4.4):C10 _31(4.4):C10 into _37
>> >> > [maxNumSegments=1]
>> >> >
>> >> >
>> >> > Unfortunately, googling for background merge hit exception came up
>> >> > with 2 thing: a corrupt index or not enough free space.  The host
>> >> > machine that's hosting solr has 227 out of 229GB free (according to df
>> >> > -h), so that's not it.
>> >> >
>> >> >
>> >> > I then ran CheckIndex on the index, and got the following results:
>> >> > http://apaste.info/gmGU
>> >> >
>> >> >
>> >> > As someone who is new to solr and lucene, as far as I can tell this
>> >> > means my index is fine. So I am coming up at a loss. I'm somewhat sure
>> >> > that I could probably delete my data directory and rebuild it but I am
>> >> > more interested in finding out why is it having issues, what is the
>> >> > best way to fix it, and what is the best way to prevent it from
>> >> > happening when this goes into production.
>> >> >
>> >> >
>> >> > Does anyone have any advice that may help?
>> >> >
>> >> >
>> >> > As an aside, i do not have a stacktrace for you because the solr admin
>> >> > page isn't giving me one.  I tried looking in my logs file in my solr
>> >> > directory, but it does not contain any logs.  I opened up my
>> >> > ~/tomcat/lib/log4

Re: Why do people want to deploy to Tomcat?

2013-11-13 Thread Robert Muir
which example? there are so many.

On Wed, Nov 13, 2013 at 1:00 PM, Mark Miller  wrote:
> RE: the example folder
>
> It’s something I’ve been pushing towards moving away from for a long time - 
> see https://issues.apache.org/jira/browse/SOLR-3619 Rename 'example' dir to 
> 'server' and pull examples into an 'examples’ directory
>
> Part of a push I’ve been on to own the Container level (people are now on 
> board with that for 5.0), add start scripts, and other niceties that we 
> should have but don’t yet.
>
> Even our config files should move away from being an “example” and end up 
> more like a default starting template. Like a database, it should be simple 
> to create a collection without needing to deal with config - you want to deal 
> with the config when you need to, not face it all up front every time it is 
> time to create a new collection.
>
> IMO, the name example is historical - most people already use it this way, 
> the name just confuses matters.
>
> - Mark
>
>
> On Nov 13, 2013, at 12:30 PM, Shawn Heisey  wrote:
>
>> On 11/13/2013 5:29 AM, Dmitry Kan wrote:
>>> Reading that people have considered deploying "example" folder is slightly
>>> strange to me. No wonder they are confused and confuse their ops.
>>
>> I do use the stripped jetty included in the example, but my setup is not a 
>> straight copy of the example directory. I removed a lot of it and changed 
>> how jars get loaded.  I built my own init script from scratch, tailored for 
>> my setup.
>>
>> I'll start a new thread with my init script and some info about how I 
>> installed Solr.
>>
>> Thanks,
>> Shawn
>>
>


Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Robert Muir
Your analyzer needs to set positionIncrement correctly: sounds like its broken.

On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh  wrote:
> Hi,
> we implemented a morphologic analyzer, which stems words on index time.
> For some reasons, we index both the original word and the stem (on the same
> position, of course).
> The stemming is done on a specific language, so other languages are not
> stemmed at all.
>
> Because of that, two documents with the same amount of terms, may have
> different termVector size. document which contains many words that being
> stemmed, will have a double sized termVector. This behaviour affects the
> relevance score in a BAD way. the fieldNorm of these documents reduces
> thier score. This is NOT the wanted behaviour in our case.
>
> We are looking for a way to "mark" the stemmed words (on index time, of
> course) so they won't affect the fieldNorm. Do such a way exist?
>
> Do you have another idea?


Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Robert Muir
termvectors have nothing to do with any of this.

please, fix your analyzer first. if you want to add a synonym, it
should be position increment of zero.

i bet exact phrase queries aren't working correctly either.

On Fri, Dec 6, 2013 at 12:50 AM, Isaac Hebsh  wrote:
> 1) positions look all right (for me).
> 2) fieldNorm is determined by the size of the termVector, isn't it? the
> termVector size isn't affected by the positions.
>
>
> On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir  wrote:
>
>> Your analyzer needs to set positionIncrement correctly: sounds like its
>> broken.
>>
>> On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh  wrote:
>> > Hi,
>> > we implemented a morphologic analyzer, which stems words on index time.
>> > For some reasons, we index both the original word and the stem (on the
>> same
>> > position, of course).
>> > The stemming is done on a specific language, so other languages are not
>> > stemmed at all.
>> >
>> > Because of that, two documents with the same amount of terms, may have
>> > different termVector size. document which contains many words that being
>> > stemmed, will have a double sized termVector. This behaviour affects the
>> > relevance score in a BAD way. the fieldNorm of these documents reduces
>> > thier score. This is NOT the wanted behaviour in our case.
>> >
>> > We are looking for a way to "mark" the stemmed words (on index time, of
>> > course) so they won't affect the fieldNorm. Do such a way exist?
>> >
>> > Do you have another idea?
>>


Re: Bad fieldNorm when using morphologic synonyms

2013-12-08 Thread Robert Muir
its accurate, you are wrong.

please, look at setDiscountOverlaps in your similarity. This is really
easy to understand.

On Sun, Dec 8, 2013 at 7:23 AM, Manuel Le Normand
 wrote:
> Robert, you last reply is not accurate.
> It's true that the field norms and termVectors are independent. But this
> issue of higher norms for this case is expected with well assigned
> positions. The LengthNorm is assigned as FieldInvertState.length which is
> the count of incrementToken and not num of positions! It is the case for
> wordDelimiterFilter or ReversedWildcardFilter which do change the norm when
> expanding a term.


Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Robert Muir
no, its turned on by default in the default similarity.

as i said, all that is necessary is to fix your analyzer to emit the
proper position increments.

On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
 wrote:
> In order to set discountOverlaps to true you must have added the
>  to the schema.xml, which
> is commented out by default!
>
> As by default this param is false, the above situation is expected with
> correct positioning, as said.
>
> In order to fix the field norms you'd have to reindex with the similarity
> class which initializes the param to true.
>
> Cheers,
> Manu


Re: Tracking down the input that hits an analysis chain bug

2014-01-03 Thread Robert Muir
This exception comes from OffsetAttributeImpl (e.g. you dont need to
index anything to reproduce it).

Maybe you have a missing clearAttributes() call (your tokenizer
'returns true' without calling that first)? This could explain it, if
something like a StopFilter is also present in the chain: basically
the offsets overflow.

the test stuff in BaseTokenStreamTestCase should be able to detect
this as well...

On Fri, Jan 3, 2014 at 1:56 PM, Benson Margulies  wrote:
> Using Solr Cloud with 4.3.1.
>
> We've got a problem with a tokenizer that manifests as calling
> OffsetAtt.setOffsets() with invalid inputs. OK, so, we want to figure out
> what input provokes our code into getting into this pickle.
>
> The problem happens on SolrCloud nodes.
>
> The problem manifests as this sort of thing:
>
> Jan 3, 2014 6:05:33 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.IllegalArgumentException: startOffset must be
> non-negative, and endOffset must be >= startOffset,
> startOffset=-1811581632,endOffset=-1811581632
>
> How could we get a document ID so that we can tell which document was being
> processed?


[ANNOUNCE] Apache Solr 4.6.1 released.

2014-01-28 Thread Robert Muir
January 2014, Apache Solr™ 4.6.1 available The Lucene PMC is pleased
to announce the release of Apache Solr 4.6.1Solr is the popular,
blazing fast, open source NoSQL search platform from the Apache Lucene
project. Its major features include powerful full-text search, hit
highlighting, faceted search, dynamic clustering, database
integration, rich document (e.g., Word, PDF) handling, and geospatial
search. Solr is highly scalable, providing fault tolerant distributed
search and indexing, and powers the search and navigation features of
many of the world's largest internet sites.Solr 4.6.1 is available for
immediate download at:
http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr
4.6.1 includes 29 bug fixes and one optimization as well as Lucene
4.6.1 and its bug fixes.See the CHANGES.txt file included with the
release for a full list of changes and further details.Please report
any feedback to the mailing lists
(http://lucene.apache.org/solr/discussion.html)Note: The Apache
Software Foundation uses an extensive mirroring network for
distributing releases. It is possible that the mirror you are using
may not have replicated the release yet. If that is the case, please
try another mirror. This also goes for Maven access.


Re: Problems with ICUCollationField

2014-02-19 Thread Robert Muir
you need the solr analysis-extras jar in your classpath, too.



On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer  wrote:

> Hello,
>
> I'm migrating to solr 4.6.1 and have problems with the ICUCollationField
> (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100).
>
> I get consistently the error message
> Error loading class 'solr.ICUCollationField'.
> even after
> INFO: Adding
> 'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to
> classloader
> and
> INFO: Adding
> 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar'
> to classloader.
>
> Am I missing something?
>
> I solr's subversion I found
>
> /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java
> but no corresponding class in solr4.6.1's contrib folder.
>
> Best
> Thomas
>
>


Re: Problems with ICUCollationField

2014-02-19 Thread Robert Muir
you need the solr analysis-extras jar itself, too.



On Wed, Feb 19, 2014 at 8:25 AM, Thomas Fischer  wrote:

> Hello Robert,
>
> I already added
> contrib/analysis-extras/lib/
> and
> contrib/analysis-extras/lucene-libs/
> via lib directives in solrconfig, this is why the classes mentioned are
> loaded.
>
> Do you know which jar is supposed to contain the ICUCollationField?
>
> Best regards
> Thomas
>
>
>
> Am 19.02.2014 um 13:54 schrieb Robert Muir:
>
> > you need the solr analysis-extras jar in your classpath, too.
> >
> >
> >
> > On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer 
> wrote:
> >
> >> Hello,
> >>
> >> I'm migrating to solr 4.6.1 and have problems with the ICUCollationField
> >> (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100).
> >>
> >> I get consistently the error message
> >> Error loading class 'solr.ICUCollationField'.
> >> even after
> >> INFO: Adding
> >> 'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to
> >> classloader
> >> and
> >> INFO: Adding
> >>
> 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar'
> >> to classloader.
> >>
> >> Am I missing something?
> >>
> >> I solr's subversion I found
> >>
> >>
> /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java
> >> but no corresponding class in solr4.6.1's contrib folder.
> >>
> >> Best
> >> Thomas
> >>
> >>
>
>


Re: Problems with ICUCollationField

2014-02-19 Thread Robert Muir
Hmm, for standardization of text fields, collation might be a little
awkward.

For your german umlauts, what do you mean by standardize? is this to
achieve equivalency of e.g. oe to ö in your search terms?

In that case, a simpler approach would be to put
GermanNormalizationFilterFactory in your chain:
http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html


On Wed, Feb 19, 2014 at 9:16 AM, Thomas Fischer  wrote:

> Thanks, that helps!
>
> I'm trying to migrate from the now deprecated ICUCollationKeyFilterFactory
> I used before to the ICUCollationField.
> Is there any description how to achieve this?
>
> First tries now yield
>
> ICUCollationField does not support specifying an analyzer.
>
> which makes it complicated since I used the ICUCollationKeyFilterFactory
> to standardize my text fields (in particular because of German Umlauts).
> But an ICUCollationField without LowerCaseFilter, a WhitespaceTokenizer, a
> LetterTokenizer, etc. doesn't do me much good, I'm afraid.
> Or is this somehow wrapped into the ICUCollationField?
>
> I didn't find ICUCollationField  in the solr wiki and not much information
> in the reference.
> And the hint
>
> "solr.ICUCollationField is included in the Solr analysis-extras contrib -
> see solr/contrib/analysis-extras/README.txt for instructions on which jars
> you need to add to your SOLR_HOME/lib in order to use it."
>
> is misleading insofar as this README.txt doesn't mention the
> solr-analysis-extras-4.6.1.jar in dist.
>
> Best
> Thomas
>
>
> Am 19.02.2014 um 14:27 schrieb Robert Muir:
>
> > you need the solr analysis-extras jar itself, too.
> >
> >
> >
> > On Wed, Feb 19, 2014 at 8:25 AM, Thomas Fischer 
> wrote:
> >
> >> Hello Robert,
> >>
> >> I already added
> >> contrib/analysis-extras/lib/
> >> and
> >> contrib/analysis-extras/lucene-libs/
> >> via lib directives in solrconfig, this is why the classes mentioned are
> >> loaded.
> >>
> >> Do you know which jar is supposed to contain the ICUCollationField?
> >>
> >> Best regards
> >> Thomas
> >>
> >>
> >>
> >> Am 19.02.2014 um 13:54 schrieb Robert Muir:
> >>
> >>> you need the solr analysis-extras jar in your classpath, too.
> >>>
> >>>
> >>>
> >>> On Wed, Feb 19, 2014 at 6:45 AM, Thomas Fischer 
> >> wrote:
> >>>
> >>>> Hello,
> >>>>
> >>>> I'm migrating to solr 4.6.1 and have problems with the
> ICUCollationField
> >>>> (apache-solr-ref-guide-4.6.pdf, pp. 31 and 100).
> >>>>
> >>>> I get consistently the error message
> >>>> Error loading class 'solr.ICUCollationField'.
> >>>> even after
> >>>> INFO: Adding
> >>>> 'file:/srv/solr4.6.1/contrib/analysis-extras/lib/icu4j-49.1.jar' to
> >>>> classloader
> >>>> and
> >>>> INFO: Adding
> >>>>
> >>
> 'file:/srv/solr4.6.1/contrib/analysis-extras/lucene-libs/lucene-analyzers-icu-4.6.1.jar'
> >>>> to classloader.
> >>>>
> >>>> Am I missing something?
> >>>>
> >>>> I solr's subversion I found
> >>>>
> >>>>
> >>
> /SVN/solr/contrib/analysis-extras/src/java/org/apache/solr/schema/ICUCollationField.java
> >>>> but no corresponding class in solr4.6.1's contrib folder.
> >>>>
> >>>> Best
> >>>> Thomas
> >>>>
> >>>>
> >>
> >>
>
>


Re: Problems with ICUCollationField

2014-02-19 Thread Robert Muir
On Wed, Feb 19, 2014 at 10:33 AM, Thomas Fischer  wrote:

>
> > Hmm, for standardization of text fields, collation might be a little
> > awkward.
>
> I arrived there after using custom rules for a while (see
> "RuleBasedCollator" on http://wiki.apache.org/solr/UnicodeCollation) and
> then being told
> "For better performance, less memory usage, and support for more locales,
> you can add the analysis-extras contrib and use
> ICUCollationKeyFilterFactory instead." (on the same page under "ICU
> Collation").
>
> > For your german umlauts, what do you mean by standardize? is this to
> > achieve equivalency of e.g. oe to ö in your search terms?
>
> That is the main point, but I might also need the additional normalization
> of combined characters like
> o+  ̈ = ö and probably similar constructions for other languages (like
> Hungarian).
>

Sure but using collation to get normalization is pretty overkill too. Maybe
try ICUNormalizer2Filter? This gives you better control over the
normalization anyway.


>
> > In that case, a simpler approach would be to put
> > GermanNormalizationFilterFactory in your chain:
> >
> http://lucene.apache.org/core/4_6_1/analyzers-common/org/apache/lucene/analysis/de/GermanNormalizationFilter.html
>
> I'll see how far I get with this, but from the description
> • 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
> • 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
> this seems to be too far-reaching a reduction: while the identification
> "ä=ae" is not very serious and rarely misleading, "ä=a" might pack words
> together that shouldn't be, "Äsen" and "Asen" are quite different concepts,
>

I'm not sure thats a mainstream opinion: not only do the default german
collation rules conflate these two characters as equivalent at primary
level, but so do many german stemming algorithms. Similar arguments could
be made for 'résumé' versus 'resume' and so on. Search isn't an exact
science.


Re: ANNOUNCE: Apache Solr Reference Guide for 4.7

2014-03-05 Thread Robert Muir
I debugged the PDF a little. FWIW, the following code (using iText)
takes it to 9MB:

  public static void main(String args[]) throws Exception {
Document document = new Document();
PdfSmartCopy copy = new PdfSmartCopy(document, new
FileOutputStream("/home/rmuir/Downloads/test.pdf"));
//copy.setCompressionLevel(9);
//copy.setFullCompression();
document.open();
PdfReader reader = new
PdfReader("/home/rmuir/Downloads/apache-solr-ref-guide-4.7.pdf");
int pages = reader.getNumberOfPages();
for (int i = 0; i < pages; i++) {
  PdfImportedPage page = copy.getImportedPage(reader, i+1);
  copy.addPage(page);
}
copy.freeReader(reader);
reader.close();
document.close();
  }


On Wed, Mar 5, 2014 at 10:17 AM, Steve Rowe  wrote:
> Not sure if it’s relevant anymore, but a few years ago Atlassian resolved as 
> "won’t fix” a request to configure exported PDF compression ratio: 
> .  Their suggestion: zip the 
> PDF.  I tried that - the resulting zip size is roughly 9MB, so it’s 
> definitely compressible.
>
> Steve
>
> On Mar 5, 2014, at 10:03 AM, Cassandra Targett  wrote:
>
>> You know, I didn't even notice that. It did go up to 30M.
>>
>> I've made a note to look into that before we release the 4.8 version to see
>> if it can be reduced at all. I suspect the screenshots are causing it to
>> balloon - we made some changes to the way they appear in the PDF for 4.7
>> which may be the cause, but also the software was upgraded and maybe the
>> newer version is handling them differently.
>>
>> Thanks for pointing that out.
>>
>>
>> On Tue, Mar 4, 2014 at 6:43 PM, Alexandre Rafalovitch 
>> wrote:
>>
>>> Has it really gone up in size from 5Mb for 4.6 version to 30Mb for 4.7
>>> version? Or some mirrors are playing tricks (mine is:
>>> http://www.trieuvan.com/apache/lucene/solr/ref-guide/ )
>>>
>>> Regards,
>>>   Alex.
>>> Personal website: http://www.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all
>>> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
>>> book)
>>>
>>>
>>> On Wed, Mar 5, 2014 at 1:39 AM, Cassandra Targett 
>>> wrote:
 The Lucene PMC is pleased to announce that we have a new version of the
 Solr Reference Guide available for Solr 4.7.

 The 395 page PDF serves as the definitive user's manual for Solr 4.7. It
 can be downloaded from the Apache mirror network:

 https://www.apache.org/dyn/closer.cgi/lucene/solr/ref-guide/

 Cassandra
>>>
>


Re: Using per-segment FieldCache or DocValues in custom component?

2013-07-02 Thread Robert Muir
Where do you get the docid from? Usually its best to just look at the whole
algorithm, e.g. docids come from per-segment readers by default anyway so
ideally you want to access any per-document things from that same
segmentreader.

As far as supporting docvalues, FieldCache API "passes thru" to docvalues
transparently if its enabled for the field.

On Mon, Jul 1, 2013 at 4:55 PM, Michael Ryan  wrote:

> I have some custom code that uses the top-level FieldCache (e.g.,
> FieldCache.DEFAULT.getLongs(reader, "foobar", false)). I'd like to redesign
> this to use the per-segment FieldCaches so that re-opening a Searcher is
> fast(er). In most cases, I've got a docId and I want to get the value for a
> particular single-valued field for that doc.
>
> Is there a good place to look to see example code of per-segment
> FieldCache use? I've been looking at PerSegmentSingleValuedFaceting, but
> hoping there might be something less confusing :)
>
> Also thinking DocValues might be a better way to go for me... is there any
> documentation or example code for that?
>
> -Michael
>


Re: WikipediaTokenizer for Removing Unnecesary Parts

2013-07-23 Thread Robert Muir
If you use wikipediatokenizer it will tag different wiki elements with
different types (you can see it in the admin UI).

so then followup with typetokenfilter to only filter the types you care
about, and i think it will do what you want.

On Tue, Jul 23, 2013 at 7:53 AM, Furkan KAMACI wrote:

> Hi;
>
> I have indexed wikipedia data with Solr DIH. However when I look data that
> is indexed at Solr I something like that as well:
>
> {| style="text-align: left; width: 50%; table-layout: fixed;" border="0"
> |- valign="top"
> | style="width: 50%"|
> :*[[Ubuntu]]
> :*[[Fedora]]
> :*[[Mandriva]]
> :*[[Linux Mint]]
> :*[[Debian]]
> :*[[OpenSUSE]]
> |
> *[[Red Hat]]
> *[[Mageia]]
> *[[Arch Linux]]
> *[[PCLinuxOS]]
> *[[Slackware]]
> |}
>
> However I want to remove them before indexing. I know that there is a
> WikipediaTokenizer in Lucene but how can I remove unnecessary parts ( as
> like links, style, etc..) with Solr?
>


Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Robert Muir
On Mon, Aug 5, 2013 at 11:42 AM, Chris Hostetter
 wrote:
>
> : I agree with you, 0xfffe is a special character, that is why I was asking
> : how it's handled in solr.
> : In my document, 0xfffe does not appear at the beginning, it's in the
> : content.
>
> Unless i'm missunderstanding something (and it's very likely that i am)...
>
> 0xfffe is not a special character -- it is explicitly *not* a character in
> Unicode at all, it is set asside as "not a character." specifically so
> that the character 0xfeff can be used as a BOM, and if the BOM is read
> incorrectly, it will cause an error.

XML doesnt allow control character like this, it defines character as:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x1-#x10] /* any Unicode character, excluding the surrogate
blocks, FFFE, and . */


Re: Invalid UTF-8 character 0xfffe during shard update

2013-08-05 Thread Robert Muir
On Mon, Aug 5, 2013 at 3:03 PM, Chris Hostetter
 wrote:
>
> : > 0xfffe is not a special character -- it is explicitly *not* a character in
> : > Unicode at all, it is set asside as "not a character." specifically so
> : > that the character 0xfeff can be used as a BOM, and if the BOM is read
> : > incorrectly, it will cause an error.
> :
> : XML doesnt allow control character like this, it defines character as:
>
> But is that even relevant?  I thought FFFE was *not* a control character?
> I thought it was completely invaid in Unicode.
>

its totally relevant. FFFE is a unicode codepoint, but its a noncharacter.

Its just that XML disallows FFFE and  noncharacters, but allows
other noncharacters (like 9)
These are "allowed but discouraged": http://www.w3.org/TR/xml11/#charsets


Re: Purging unused segments.

2013-08-09 Thread Robert Muir
On Fri, Aug 9, 2013 at 7:48 PM, Erick Erickson  wrote:
>
> So is there a good way, without optimizing, to purge any segments not
> referenced in the segments file? Actually I doubt that optimizing would
> even do it if I _could_, any phantom segments aren't visible from the
> segments file anyway...
>

I dont know why you have these files (windows? deletion policy?) but
maybe you are interested in this:

http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/index/IndexWriter.html#deleteUnusedFiles%28%29


Re: Is there a way to store binary data (byte[]) in DocValues?

2013-08-12 Thread Robert Muir
On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux  wrote:
> Hi!
>
> I'm basically searching for a method to put byte[] data into Lucene
> DocValues of type BINARY (see [1]). Currently only primitives and
> Strings are supported according to [1].
>
> I know that this can be done with a custom update handler, but I'd
> like to avoid that.
>

Can you describe a little bit what kind of operations you want to do with it?
I don't really know how BinaryField is typically used, but maybe it
could support this option. On the other hand adding it to BinaryField
might not "buy" you much without some additional stuff depending upon
what you need to do. Like if you really want to do sort/facet on the
thing, SORTED(SET) would probably be a better implementation: it
doesnt care that the values are binary.

BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is:
* SORTED: deduplicates/compresses the unique byte[]'s and gives each
document an ordinal number that reflects sort order (for
sorting/faceting/grouping/etc)
* SORTED_SET: similar, except each document has a "set" (which can be
empty), of ordinal numbers (e.g. for faceting multivalued fields)
* BINARY: just stores the byte[] for each document (no deduplication,
no compression, no ordinals, nothing).

So for sorting/faceting: BINARY is generally not very efficient unless
there is something custom going on: for example lucene's faceting
package stores the "values" elsewhere in a separate taxonomy index, so
it uses this type just to encode a delta-compressed ordinal list for
each document.

For scoring factors/function queries: encoding the values inside
NUMERIC(s) [up to 64 bits each] might still be best on average: the
compression applied here is surprisingly efficient.


Re: Is there a way to store binary data (byte[]) in DocValues?

2013-08-12 Thread Robert Muir
On Mon, Aug 12, 2013 at 12:25 PM, Mathias Lux  wrote:
>
> Another thing for not using the the SORTED_SET and SORTED
> implementations is, that Solr currently works with Strings on that and
> I want to have a small memory footprint for millions of images ...
> which does not go well with immutables.

Just as a side note, again these work with byte[]. It happens to be
the case that solr uses these for its StringField (converting the
strings to bytes), but if you wanted to use these with BinaryField you
could (they just take BytesRef).


Re: Split Shard Error - maxValue must be non-negative

2013-08-13 Thread Robert Muir
did you do a (real) commit before trying to use this?
I am not sure how this splitting works, but at least the merge option
requires that.

i can't see this happening unless you are somehow splitting a 0
document index (or, if the splitter is creating 0 document splits)
so this is likely just a symptom of
https://issues.apache.org/jira/browse/LUCENE-5116

On Tue, Aug 13, 2013 at 6:46 AM, Srivatsan  wrote:
> Hi,
>
> I am experimenting with solr 4.4.0 split shard feature. When i split the
> shard i am getting the following exception.
>
> /java.lang.IllegalArgumentException: maxValue must be non-negative (got: -1)
> at
> org.apache.lucene.util.packed.PackedInts.bitsRequired(PackedInts.java:1184)
> at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:140)
> at
> org.apache.lucene.codecs.lucene42.Lucene42DocValuesConsumer.addNumericField(Lucene42DocValuesConsumer.java:92)
> at
> org.apache.lucene.codecs.DocValuesConsumer.mergeNumericField(DocValuesConsumer.java:112)
> at 
> org.apache.lucene.index.SegmentMerger.mergeNorms(SegmentMerger.java:221)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:119)
> at 
> org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2488)
> at
> org.apache.solr.update.SolrIndexSplitter.split(SolrIndexSplitter.java:125)
> at
> org.apache.solr.update.DirectUpdateHandler2.split(DirectUpdateHandler2.java:766)
> at
> org.apache.solr.handler.admin.CoreAdminHandler.handleSplitAction(CoreAdminHandler.java:284)
> at
> org.apache.solr.handler.admin.CoreAdminHandler.handleRequestBody(CoreAdminHandler.java:186)
> at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> at
> org.apache.solr.servlet.SolrDispatchFilter.handleAdminRequest(SolrDispatchFilter.java:611)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:209)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:158)
> at
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
> at
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
> at
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
> at
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
> at
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
> at
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
> at
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
> at
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
> at
> org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
> at
> org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:861)
> at 
> org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
> at
> org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
> at
> org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
> at
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
> at java.lang.Thread.run(Thread.java:679)/
>
>
> How to resolve this problem?
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Split-Shard-Error-maxValue-must-be-non-negative-tp4084220.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Split Shard Error - maxValue must be non-negative

2013-08-13 Thread Robert Muir
Well, i meant before, but i just took a look and this is implemented
differently than the "merge" one.

In any case, i think its the same bug, because I think the only way
this can happen is if somehow this splitter is trying to create a
0-document "split" (or maybe a split containing all deletions).

On Tue, Aug 13, 2013 at 8:22 AM, Srivatsan  wrote:
> Ya i am performing commit after split request is submitted to server.
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Split-Shard-Error-maxValue-must-be-non-negative-tp4084220p4084256.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Split Shard Error - maxValue must be non-negative

2013-08-13 Thread Robert Muir
On Tue, Aug 13, 2013 at 11:39 AM, Shalin Shekhar Mangar
 wrote:
> The splitting code calls commit before it starts the splitting. It creates
> a LiveDocsReader using a bitset created by the split. This reader is merged
> to an index using addIndexes.
>
> Shouldn't the addIndexes code then ignore all such 0-document segments?
>
>

Not in 4.4: https://issues.apache.org/jira/browse/LUCENE-5116


Re: PostingsHighlighter returning fields which don't match

2013-08-14 Thread Robert Muir
On Wed, Aug 14, 2013 at 3:53 AM, ses  wrote:

> We are trying out the new PostingsHighlighter with Solr 4.2.1 and finding
> that the highlighting section of the response includes self-closing tags
> for
> all the fields in hl.fl (by default for edismax it is all fields in qf)
> where there are no highlighting matches. In contrast the same query on Solr
> 4.0.0 without PostingsHighlighter it returns only the fields containing
> highlighting matches.
>
> here is a simplified example of the highlighting response for a document
> with no matches in the fields specified by hl.fl:
> with PostingsHighlighter:
> 
>   ...
>   
> 
>   
>   
>   
>   ...
> 
>   
> 
>
> without PostingsHighlighter:
> 
>   ...
>   
> 
>   
> 
>

Do you want to open a JIRA issue to just change the behavior?


> This is a big problem for us as we have a large number of fields in a
> dynamic field and we believe every time a highlighted response comes back
> it
> is sending us a very large number of self-closing tags which bloats the
> response to an unreasonable size (in some cases 100MB+).
>

Unrelated: If your queries actually go against a large number of fields,
I'm not sure how efficient this highlighter will be. Thats because at some
number of N fields, it will be much more efficient to use a
document-oriented term vector approach (e.g. standard
highlighter/fast-vector-highlighter).


Re: Who's cleaning the Fieldcache?

2013-08-14 Thread Robert Muir
On Wed, Aug 14, 2013 at 5:29 PM, Chris Hostetter
 wrote:
>
> : why? Those are my sort fields and they are occupying a lot of space (doubled
> : in this case but I see that sometimes I have three or four "old" segment
> : references)
> :
> : Is there something I can do to remove those old references? I tried to 
> reload
> : the core and it seems the old references are discarded (i.e. garbage
> : collected) but I believe is not a good workaround, I would avoid to reload 
> the
> : core for every replication cycle.
>
> You don't need to reload the core to get rid of the old FieldCaches -- in
> fact, there is nothing about reloading the core that will garuntee old
> FieldCaches get removed.
>
> FieldCaches are managed using a WeakHashMap - so once the IndexReader's
> associated with those FieldCaches are no logner used, they will be garbage
> collected when and if the JVMs garbage collector get arround to it.
>
> if they sit arround after you are done with them, they might look like the
> ytake upa log of memory, but that just means your JVM Heap has that memory
> to spare and hasn't needed to clean them up yet.

I don't think this is correct.

When you register an entry in the fieldcache, it registers event
listeners on the segment's core so that when its close()d, any entries
are purged rather than waiting on GC.

See FieldCacheImpl.java


Re: Who's cleaning the Fieldcache?

2013-08-14 Thread Robert Muir
On Wed, Aug 14, 2013 at 5:58 PM, Chris Hostetter
 wrote:
>
> : > FieldCaches are managed using a WeakHashMap - so once the IndexReader's
> : > associated with those FieldCaches are no logner used, they will be garbage
> : > collected when and if the JVMs garbage collector get arround to it.
> : >
> : > if they sit arround after you are done with them, they might look like the
> : > ytake upa log of memory, but that just means your JVM Heap has that memory
> : > to spare and hasn't needed to clean them up yet.
> :
> : I don't think this is correct.
> :
> : When you register an entry in the fieldcache, it registers event
> : listeners on the segment's core so that when its close()d, any entries
> : are purged rather than waiting on GC.
> :
> : See FieldCacheImpl.java
>
> Ah ... sweet.  I didn't realize that got added.
>
> (In any case: it looks like a WeakHashMap is still used in case the
> listeners never get called, correct?)
>

I think it might be the other way around: i think it was weakmap
before always, the close listeners were then added sometime in 3.x
series, so we registered purge events "as an optimization".

But one way to look at it is: readers should really get closed, so why
have the weak map and not just a regular hashmap.

Even if we want to keep the weak map (seriously i dont care, and i
dont want to be the guy fielding complaints on this), I'm going to
open with an issue with a patch that removes it and fails tests in
@afterclass if there is any entries. This way its totally clear
if/when/where anything is "relying on GC" today here and we can at
least look at that.


Re: Problems installing Solr4 in Jetty9

2013-08-17 Thread Robert Muir
On Sat, Aug 17, 2013 at 3:59 AM, Chris Collins  wrote:
> I am using 4.4 in an embedded mode and found that it has a dependency on 
> hadoop 2.0.5. alpha that in turn depends on jetty 6.1.26 which I think 
> pre-dates electricity :-}
>

I think this is only a "test dependency" ?


Re: Solr using a ridiculous amount of memory

2013-03-24 Thread Robert Muir
On Sun, Mar 24, 2013 at 4:19 AM, John Nielsen  wrote:

> Schema with DocValues attempt at solving problem:
> http://pastebin.com/Ne23NnW4
> Config: http://pastebin.com/x1qykyXW
>

This schema isn't using docvalues, due to a typo in your config.
it should not be DocValues="true" but docValues="true".

Are you not getting an error? Solr needs to throw exception if you
provide invalid attributes to the field. Nothing is more frustrating
than having a typo or something in your configuration and solr just
ignores this, reports no error, and "doesnt work the way you want".
I'll look into this (I already intend to add these checks to analysis
factories for the same reason).

Separately, if you really want the terms data and so on to remain on
disk, it is not enough to "just enable docvalues" for the field. The
default implementation uses the heap. So if you want that, you need to
set docValuesFormat="Disk" on the fieldtype. This will keep the
majority of the data on disk, and only some key datastructures in heap
memory. This might have significant performance impact depending upon
what you are doing so you need to test that.


Re: Requesting to add into a Contributor Group

2013-05-04 Thread Robert Muir
done. let us know if you have any problems.

On Sat, May 4, 2013 at 10:12 AM, Krunal  wrote:

> Dear Sir,
>
> Kindly add me to the contributor group to help me contribute to the Solr
> wiki.
>
> My Email id: jariwalakru...@gmail.com
> Login Name: Krunal
>
> Specific changes I would like to make to begin with are:
>
> - Correct Link of Ajax Solr here http://wiki.apache.org/solr/SolrJS which
> is wrong, the correct link should be
> https://github.com/evolvingweb/ajax-solr/wiki
>
> - Add our company data here http://wiki.apache.org/solr/Support
>
> We offer Solr integration service on Dot Net Platform at Xcellence-IT.
>
> And business division of ours, i.e. nopAccelerate - offers a Solr
> Integration Plugin for nopCommerce along with other nopCommerce performance
> optimization services.
>
>
> We have been working on Solr since last 1 years and will be happy to
> contribute back by helping community maintain & update Wiki. If this is not
> allowed, then kindly let us know so I will send you our Company details so
> you can make changes too.
>
> Thanks,
>
> Awaiting your response.
>
> Krunal
>
> *Krunal Jariwala*
>
>
> *Cell:* +91-98251-07747
>
> *Best time to Call:* 9am to 7pm (IST) GMT +5.30
>


Re: Are there any plans to change example directory layout?

2013-06-11 Thread Robert Muir
If you have a good idea... Just do it. Open an issue
On Jun 11, 2013 9:34 PM, "Alexandre Rafalovitch"  wrote:

> I think it is quite hard for beginners that basic solr example
> directory is competing for attention with other - nested - examples. I
> see quite a lot of questions on which directory inside 'example' to
> pay attention to and which to ignore, etc.
>
> Actually, this is so confusing, I am not even sure how to put this in
> writing.
>
> Basically, is anybody aware of people looking into example directory
> structure? A JIRA maybe?
>
> Regards,
>Alex.
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all
> at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
>


Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-16 Thread Robert Muir
On Fri, Sep 16, 2011 at 6:53 PM, Burton-West, Tom  wrote:
> Hello,
>
> The TieredMergePolicy has become the default with Solr 3.3, but the 
> configuration in the example uses the mergeFactor setting which applys to the 
> LogByteSizeMergePolicy.
>
> How is the mergeFactor interpreted by the TieredMergePolicy?
>
> Is there an example somewhere showing how to configure the Solr 
> TieredMergePolicy to set the parameters:
> setMaxMergeAtOnce, setSegmentsPerTier, and setMaxMergedSegmentMB?

an example is here:
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/solrconfig-mergepolicy.xml

>
> I tried setting setMaxMergedSegmentMB in Solr 3.3
> 
>      20
>      40
>     
> 2
>    
>
>
> and got this error message
> "SEVERE: java.lang.RuntimeException: no setter corrresponding to 
> 'setMaxMergedSegmentMB' in org.apache.lucene.index.TieredMergePolicy"

Right, i think it should be:

2


-- 
lucidimagination.com


Re: Example setting TieredMergePolicy for Solr 3.3 or 3.4?

2011-09-20 Thread Robert Muir
On Mon, Sep 19, 2011 at 9:57 AM, Burton-West, Tom  wrote:
> Thanks Robert,
>
> Removing "set" from " setMaxMergedSegmentMB" and using "maxMergedSegmentMB" 
> fixed the problem.
> ( Sorry about the multiple posts.  Our mail server was being flaky and the 
> client lied to me about whether the message had been sent.)
>
> I'm still confused about the mergeFactor=10 setting in the example 
> configuration.  Took a quick look at the code, but I'm obviously looking in 
> the wrong place. Is mergeFactor=10 interpreted by TieredMergePolicy as
> segmentsPerTier=10 and maxMergeAtOnce=10?   If I specify values for these is 
> the mergeFactor setting ignored?

Sorry, I just now noticed you responded!

yes, mergeFactory=10 is interpreted as both segmentsPerTier and maxMergeAtOnce.
yes, specifying explicit TieredMP parameters will override whatever
you set in mergeFactor (which is basically only interpreted to be
backwards compatible)

this is why i created this confusing test configuration: to test this
exact case.

-- 
lucidimagination.com


Re: MMapDirectory failed to map a 23G compound index segment

2011-09-21 Thread Robert Muir
On Tue, Sep 20, 2011 at 12:32 PM, Michael McCandless
 wrote:
>
> Or: is it possible you reopened the reader several times against the
> index (ie, after committing from Solr)?  If so, I think 2.9.x never
> unmaps the mapped areas, and so this would "accumulate" against the
> system limit.

In order to unmap in Lucene 2.9.x you must specifically turn this
unmapping on with setUseUnmapHack(true)

-- 
lucidimagination.com


Re: payloads - Inconsistency between the document score and the explain score

2011-09-27 Thread Robert Muir
https://issues.apache.org/jira/browse/LUCENE-3421

Note: if you are using this 'includeSpanScore=false' (which I think
you are, as thats where the bug applies), be aware this means the
score is *only* the result of your payload, boosts, tf, length
normalization, idf, none of this is incorporated into the score.

On Tue, Sep 27, 2011 at 12:33 PM, Jean-Claude Dauphin
 wrote:
> Hello,
>
>
>
> I have implemented payloads at the index and query levels using  specific
> PayloadSimilarity  and  PayloadQparserPlugin classes.
>
> Now I wish to check that the payloads processing is correct and thus I
> inserted the following code to check the document scores of a Solr request:
>
>
>
>     // Display relevance score and explain for debugging and validation
>
>      Iterator dociterator = results.iterator();
>
>      Map explainmap = queryResponse.getExplainMap();
>
>
>
>        *while* (dociterator.hasNext()) {
>
>          SolrDocument doc = dociterator.next();
>
>          String id = (String) doc.getFirstValue("positionID");
>
>          Float relevance = (Float) doc.getFieldValue("score");
>
>          String explanation = explainmap.get(id);
>
>          *LOGGER*.debug("positionID [{}]", id);
>
>          *LOGGER*.debug("Score [{}]", relevance);
>
>          *LOGGER*.debug("explain: [{}]", explanation);
>
>        }
>
>
>
> Here is an extract from the output:
>
>
>
> positionID [441828]
>
> websearch.engine.solr.SolrEnginePosition (1291) Score [6.0416665]
>
> websearch.engine.solr.SolrEnginePosition (1292) explain: [
>
> 0.34901428 = (MATCH) product of:
>
>  0.41881716 = (MATCH) sum of:
>
>    0.08182812 = (MATCH) weight(autocomplete-field:chef de projet in 363),
> product of:
>
>      0.35501713 = queryWeight(autocomplete-field:chef de projet), product
> of:
>
>        3.4769385 = idf(autocomplete-field:  chef de projet=83)
>
>        0.10210624 = queryNorm
>
>      0.23049062 = (MATCH) fieldWeight(autocomplete-field:chef de projet in
> 363), product of:
>
>        0.70710677 = (MATCH) btq, product of:
>
>          0.70710677 = tf(phraseFreq=0.5)
>
>          1.0 = scorePayload(...)
>
>        3.4769385 = idf(autocomplete-field:  chef de projet=83)
>
>        0.09375 = fieldNorm(field=autocomplete-field, doc=363)
>
>
>
> THE EXPLAIN SCORE SEEMS CORRECT BUT I DON’T UNDERSTAND WHY THE DOCUMENT
> SCORE (6.0416665) IS DIFFERENT FROM THE EXPLAIN SCORE (0.34901428)
>
>
>
> I would appreciate any explanation on this issue or any ideas on what could
> be wrong in the code
>
> Best wishes,
>
> --
> Jean-Claude Dauphin
>



-- 
lucidimagination.com


Re: Indexing PDF

2011-10-04 Thread Robert Muir
Your persian pdf problem is different, and already taken care of in pdfbox trunk

https://issues.apache.org/jira/browse/PDFBOX-1127

On Tue, Oct 4, 2011 at 2:04 PM, ahmad ajiloo  wrote:
> I have this problem too, in indexing some of persian pdf files.
>
> 2011/10/4 Héctor Trujillo 
>
>> Hi all, I'm indexing pdf's files with SolrJ, and most of them work. But
>> with
>> some files I’ve got problems because they stored estrange characters. I got
>> stored this content:
>> +++
>>
>> Starting a Search Application
>>
>> 
>> Abstract
>>
>> Starting
>> a Search Application A Lucid Imagination White Paper ¥ April 2009 Page i
>>
>> 
>> Starting a Search Application A Lucid Imagination White Paper ¥ April 2009
>> Page ii Do You Need Full-text Search?
>>
>> ∞
>>
>> ∞
>> ∞
>>
>> Starting
>> a Search Application A Lucid Imagination White Paper ¥ April 2009 Page 1
>>
>> 

Re: New scoring models in LUCENE/SOLR (LUCENE-2959)

2011-10-05 Thread Robert Muir
On Wed, Oct 5, 2011 at 2:23 PM, David Ryan  wrote:
> Hi,
>
> According to the IRA issue 2959,
> https://issues.apache.org/jira/browse/LUCENE-2959
>
> BM25 will be included in the next release of LUCENE.
>
> 1). Will BM25F be included in the next release as well as part
> of LUCENE-2959?

should be in the 4.0 release, as the solr integration has already been
added (https://issues.apache.org/jira/browse/SOLR-2754)
So if you checkout trunk from svn, you can specify these factories in
your schema.xml to use these new models.

> 2).  What's the timeline of the next release that new scoring modules will
> be available?
>

unfortunately, nobody can answer this...

-- 
lucidimagination.com


Re: New scoring models in LUCENE/SOLR (LUCENE-2959)

2011-10-05 Thread Robert Muir
On Wed, Oct 5, 2011 at 3:03 PM, David Ryan  wrote:
> Do you mean both BM25 and BM25F?
>
>

No, BM25F and other "fielded" or structured models are somewhat different.

In these model, if you have two fields (body/title) you are saying
that "dogs" in body is actually the same term as "dogs" in title. This
is only appropriate in these cases, but not for all fields in the
document (e.g. many solr users use copyField and analyzer content in
different ways, so the terms are different, or put different languages
in different fields).

In my opinion, to support models like this we should add a "structured
query" (and ideally queryparser hooks) representing this intent, that
works for a term across multiple fields where you declare this is the
case. This would be a future improvement, and i'm not sure BM25F is
ever a good fit because it wants a document-level idf (not really a
practical thing for lucene, unless we come up with some cool
approximation), but newer models like pl2f/mdl2 use corpus total term
frequency instead which we can compute from the components
efficiently.


-- 
lucidimagination.com


Re: stemEnglishPossessive and contractions

2011-10-19 Thread Robert Muir
The word delimiter filter also does other things, it treats ' as
punctuation by default. So it normally splits on ', except if its 's
(in this case it removes the 's completely if you use this
stemEnglishPossessive).

There are a couple approaches you can use:
1. you can keep worddelimiterfilter with this option on, but disabling
splitting on ' by customize its type table. in this case specify
types=mycustomtypes.txt, and in that file specify ' to be treated as
ALPHANUM or similar. see
https://issues.apache.org/jira/browse/SOLR-2059 for some examples of
this. i would only do this if you want worddelimiterfilter for other
purposes, if you just want to remove possessives and don't need
worddelimiterfilter's other features, look below.
2. you can instead use EnglishPossessiveFilterFactory, which only does
this exact thing (remove 's) and nothing else.

On Wed, Oct 19, 2011 at 5:30 PM, Herman Kiefus  wrote:
> We utilize a comprehensive dictionary of English words, place names, 
> surnames, male and female first names, ... you get the point.  As such, the 
> possessive plural forms of these words are recognized as 'misspelled'.
>
> I simply thought that 'turning on' this option for the WordDelimiterFactory 
> would address my concerns; however, I also got an unintended consequence: 
> Contractions (isn't, wouldn't, shouldn't, he'll, we'll...) also seem to be 
> affected.  Is this intended behavior?  When I read 'English possessive' I 
> hear 'apostrophe s' and not 'apostrophe anything'.  Is there something I'm 
> missing here?
>



-- 
lucidimagination.com


Re: changing omitNorms on an already built index

2011-10-27 Thread Robert Muir
On Thu, Oct 27, 2011 at 6:00 PM, Simon Willnauer
 wrote:
> we are not actively removing norms. if you set omitNorms=true and
> index documents they won't have norms for this field. Yet, other
> segment still have norms until they get merged with a segment that has
> no norms for that field ie. omits norms. omitNorms is anti-viral so
> once you set it to true it will be true for other segment eventually.
> If you optimize you index you should see that norms go away.
>

This is only true in trunk (4.x!)
https://issues.apache.org/jira/browse/LUCENE-2846

-- 
lucidimagination.com


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Robert Muir
On Fri, Oct 28, 2011 at 5:03 PM, Jason Rutherglen
 wrote:

> +1 I suggested it should be backported a while back.  Or that Lucene
> 4.x should be released.  I'm not sure what is holding up Lucene 4.x at
> this point, bulk postings is only needed useful for PFOR.

This is not true, most modern index compression schemes, not just
PFOR-delta read more than one integer at a time.

Thats why its important not only to abstract away the encoding of the
index, but to also ensure that the enumeration apis aren't biased
towards one-at-a-time vInt.

Otherwise we have "flexible indexing" where "flexible" means "slower
if you do anything but the default".

-- 
lucidimagination.com


Re: large scale indexing issues / single threaded bottleneck

2011-10-28 Thread Robert Muir
On Fri, Oct 28, 2011 at 8:10 PM, Jason Rutherglen
 wrote:
>> Otherwise we have "flexible indexing" where "flexible" means "slower
>> if you do anything but the default".
>
> The other encodings should exist as modules since they are pluggable.
> 4.0 can ship with the existing codec.  4.1 with additional codecs and
> the bulk postings at a later time.

you don't know what you are talking about:  go look at the source
code. the whole problem is that encodings aren't pluggable.

>
> Otherwise it will be 6 months before 4.0 ships, that's too long.

sucks for you.

>
> Also it is an amusing contradiction that your argument flies in the
> face of Lucid shipping 4.x today without said functionality.
>

No it doesn't. trunk is open source. you can use it, too, if you want.

-- 
lucidimagination.com


Re: SolrCloud with large synonym files

2011-11-02 Thread Robert Muir
On Wed, Nov 2, 2011 at 8:53 AM, Phil Hoy  wrote:
> It is solr 4.0 and uses the new FSTSynonymFilterFactory i believe but defers 
> to ZkSolrResourceLoader to load the synonym file when in cloud mode.
> Phil
>

FYI: The synonyms implementation supports multiple formats (currently
"solr" and "wordnet") I think.

Its possible that it could have another format "binary" or
"serialized" which is essentially the serialized byte[] of the FST.
This would be smaller on disk and faster to load, since it wouldnt
really have to 'build' itself (it would be pre-built).

In the initial implementation I didnt add this, as I wasn't sure if
this would have any real value, especially since the build time is a
lot faster now, but it is something that is possible.

-- 
lucidimagination.com


Re: [Solr-3.4] Norms file size is large in case of many unique indexed fields in index

2011-11-10 Thread Robert Muir
what is the point of a unique indexed field?

If for all of your fields, there is only one possible document, you
don't need length normalization, scoring, or a search engine at all...
just use a HashMap?

On Thu, Nov 10, 2011 at 7:42 AM, Ivan Hrytsyuk
 wrote:
> Hello everyone,
>
> We have large index size in case norms are enabled.
>
> schema.xml:
>
> type declaration:
>  positionIncrementGap="100" omitNorms="false">
>     
>         
>     
> 
>
> fields declaration:
>  type="string" />
> 
>  type="simpleTokenizer" multiValued="false" />
>
> For 5000 documents (every document has 2 unique fields, 2*5000=1
> unique fields in index), index size is 48.24 MB.
> But if we enable omitting norms (omitNorms="true"), index size is 0.56
> MB.
>
> Next, if we increase number of unique fields per document to 3
> (3*5000=15000 unique fields in index) we receive: 72.23 MB and 0.70 MB
> respectively.
> And if we increase number of documents to 1 ( 3*1 unique fields
> in index) we receive: 287.54 MB and 1.44 MB respectively.
>
> We've prepared test application to reproduce mentioned behavior. It can
> be downloaded here:
> https://bitbucket.org/coldserenity/solr-large-index-with-norms
>
> Could anyone point out if size of index is as expected in mentioned
> cases? And if it's, what configuration can be applied to reduce size of
> index.
>
> Thank you in advance, Ivan
>



-- 
lucidimagination.com


Re: trouble with CollationKeyFilter

2011-11-23 Thread Robert Muir
hi,

locale sensitive range queries don't work with these filters, only sort,
although erick erickson has a patch that will enable this (the lowercasing
wildcards patch, then you could add this filter to your multiterm chain).

separately locale range queries and sort both work easily on trunk (with
binary terms)... just use collationfield or icucollationfield if you are
able to use trunk...

otherwise for 3.x I think that patch is pretty close any day now, so we can
add an example for localized range queries that makes use of it.

On Nov 23, 2011 4:39 PM, "Michael Sokolov"  wrote:
>
> I'm using CollectionKeyFilter to sort my documents using the Unicode root
collation, and my documents do appear to be getting sorted correctly, but
I'm getting weird results when performing range filtering using the sort
key field.  For example:
>
> ifp_sortkey_ls:["youth culture" TO "youth culture"]
>
> and
>
> ifp_sortkey_ls:{"youth culture" TO "youth culture"}
>
> both return 0 hits
>
> but
>
> ifp_sortkey_ls:"youth culture"
>
> returns 1 hit
>
> It seems as if any query using the ifp_sortkey_ls:[A to B] syntax is
acting as if the terms A, B are greater than all documents whose sortkeys
start with an A-Z character, but less than a few documents that have greek
letters as their first characters of their sortkeys.
>
> the analysis chain for ifp_sortkey_ls is:
>
> 
> 
> 
> 
> 
> 
> language=""
>strength="primary"
>/>
> 
> 
>
> Does anyone have any idea what might be going on here?
>


Re: trouble with CollationKeyFilter

2011-11-25 Thread Robert Muir
On Wed, Nov 23, 2011 at 11:22 PM, Michael Sokolov  wrote:
> Thanks for confirming that, and laying out the options, Robert.
>

FYI: Erick committed the multiterm stuff, so I opened an issue for
this: https://issues.apache.org/jira/browse/SOLR-2919

-- 
lucidimagination.com


Re: trouble with CollationKeyFilter

2011-11-27 Thread Robert Muir
On Sat, Nov 26, 2011 at 8:43 PM, Michael Sokolov  wrote:
> That's great news!  We can't really track trunk, but it looks like this is
> targeted for 3.6, right? As a short-term alternative, I was considering
> using ICUFoldingFilter; this won't preserve some of the finer distinctions,
> but will at least sort the accented characters in with their unaccented kin,
> which is 90% of what we need. Does that make sense?  It should index regular
> characters then, and not ICU collation keys, I think?
>

yes, should be pretty easy to make the range queries work for these.

As far as doing things with filters as an alternative: it depends what
you need, doing stuff with the analyzer is pretty inflexible because
its just a tokenfilter and still binary order at the end of the day,
so the order might not make sense for some languages.

Because of this its also difficult/impossible if you are picky about
sorting to do things like sort lowercase values first (for when you
care about case), ignore punctuation (so U.S.A. = USA), sort numerics
correctly (so FOOBAR-10 sorts after FOOBAR-9)... etc etc... though the
factory in solr doesn't yet expose these options either :)

also, looking at your configuration, the lowercasefilter is not
needed, you are using primary strength.

-- 
lucidimagination.com


Re: DirectSolrSpellChecker on request specified field.

2011-11-28 Thread Robert Muir
technically it could? I'm just not sure if the current spellchecking
apis allow for it? But maybe someone has a good idea on how to easily
expose this.

I think its a good idea.

Care to open a JIRA issue?

On Mon, Nov 28, 2011 at 1:31 PM, Phil Hoy  wrote:
> Hi,
>
> Can the DirectSolrSpellChecker be used for autosuggest but defer to request 
> time the name of the field to use to create the dictionary. That way I don't 
> have to define spellcheckers specific to each field which for me is not 
> really possible as the fields I wish to spell check are DynamicFields.
>
> I could copy all dynamic fields into a 'spellcheck' field but then I could 
> get false suggestions if I use it to get suggestions for a particular dynamic 
> field where a term returned derives from a different field.
>
> Phil
>
>
>



-- 
lucidimagination.com


Re: DirectSolrSpellChecker on request specified field.

2011-11-28 Thread Robert Muir
On Mon, Nov 28, 2011 at 4:36 PM, Phil Hoy  wrote:
> Added issue: https://issues.apache.org/jira/browse/SOLR-2926
> Please let me know if more information needs adding to JIRA.
>
> Phil
>

Thanks, I'll followup on the issue



-- 
lucidimagination.com


Re: Solr 4.0 Levenshtein distance algorithm for DirectSpellChecker

2011-11-29 Thread Robert Muir
On Tue, Nov 29, 2011 at 8:07 AM, elisabeth benoit
 wrote:
> Hello,
>
> I'd like to know if the Levensthein distance algorithm used by Solr 4.0
> DirectSpellChecker (working quite well I must say) is considering an
> inversion as distance = 1 or distance = 2?
>
> For instance, if I write Monteruil and I meant Montreuil, is the distance 1
> or 2?
>

the algorithm is just levenshtein, so 2. its possible to also support
a modified form where transpositions count as 1, but its not
implemented.

-- 
lucidimagination.com


Re: Solr 4.0 Levenshtein distance algorithm for DirectSpellChecker

2011-11-29 Thread Robert Muir
On Tue, Nov 29, 2011 at 9:21 AM, elisabeth benoit
 wrote:
> ok, thanks.
>
> I think it would be a nice improvment to consider inversion as distance =
> 1, since it's a so common mistake. The distance = 2 makes it difficult to
> correct transpositions on small words (for instance, the DirectSpellChecker
> couldn't make the right suggestion for "joile" given for 'jolie").
>

I agree with you, it would be a great improvement. The first step is
to get support for 'transpositions as a primitive edit operation' to
https://bitbucket.org/jpbarrette/moman/ . This is the library we use
to generate the tables.

-- 
lucidimagination.com


Re: RegexQuery performance

2011-12-08 Thread Robert Muir
On Thu, Dec 8, 2011 at 11:01 AM, Jay Luker  wrote:
> Hi,
>
> I am trying to provide a means to search our corpus of nearly 2
> million fulltext astronomy and physics articles using regular
> expressions. A small percentage of our users need to be able to
> locate, for example, certain types of identifiers that are present
> within the fulltext (grant numbers, dataset identifers, etc).
>
> My straightforward attempts to do this using RegexQuery have been
> successful only in the sense that I get the results I'm looking for.
> The performance, however, is pretty terrible, with most queries taking
> five minutes or longer. Is this the performance I should expect
> considering the size of my index and the massive number of terms? Are
> there any alternative approaches I could try?
>
> Things I've already tried:
>  * reducing the sheer number of terms by adding a LengthFilter,
> min=6, to my index analysis chain
>  * swapping in the JakartaRegexpCapabilities
>
> Things I intend to try if no one has any better suggestions:
>  * chunk up the index and search concurrently, either by sharding or
> using a RangeQuery based on document id
>
> Any suggestions appreciated.
>

This RegexQuery is not really scalable in my opinion, its always
linear to the number of terms except in super-rare circumstances where
it can compute a "common prefix" (and slow to boot).

You can try svn trunk's RegexpQuery <-- don't forget the "p", instead
from lucene core (it works from queryparser: /[ab]foo/, myfield:/bar/
etc)

The performance is faster, but keep in mind its only as good as the
regular expressions, if the regular expressions are like /.*foo.*/,
then
its just as slow as wildcard of *foo*.

-- 
lucidimagination.com


Re: Solr Lucene Index Version

2011-12-08 Thread Robert Muir
On Thu, Dec 8, 2011 at 10:46 AM, Mark Miller  wrote:
>
> On Dec 8, 2011, at 8:50 AM, Jamie Johnson wrote:
>
>> Isn't the codec stuff merged with trunk now?
>
> Robert merged this recently AFAIK.
>

true but that issue only moved the majority of the rest of the index
(stored fields, term vectors, fieldinfos, etc) to codec.

There is more work in progress/to be done before the format is really
extensible, particularly the long TODO list at the end of
https://issues.apache.org/jira/browse/LUCENE-2621

-- 
lucidimagination.com


Re: Solr Lucene Index Version

2011-12-08 Thread Robert Muir
On Thu, Dec 8, 2011 at 12:55 PM, Jamie Johnson  wrote:
> Thanks Andrzej.  I'll continue to follow the portable format JIRA
> along with 3622, are there any others that you're aware of that are
> blockers that would be useful to watch?
>

There is a lot to be done, particularly norms and deleted documents.
Some progress on norms is made on LUCENE-3606 (moved to codec,
simpletext implementation)
but its a stop-gap measure really until LUCENE-3622 and LUCENE-3074
are finished, then norms can be implemented via IndexDocValues APIs.

I havent really investigated deleted documents yet, but it should be
feasible after LUCENE-3606.

Then there still remains things like the fact that codec cannot
control how the compound file format is encoded and other minor
issues.

-- 
lucidimagination.com


Re: codec="Pulsing" per field broken?

2011-12-11 Thread Robert Muir
On Sun, Dec 11, 2011 at 11:34 AM, eks dev  wrote:
> on the latest trunk, my schema.xml with field type declaration
> containing //codec="Pulsing"// does not work any more (throws
> exception from FieldType). It used to work wit approx. a month old
> trunk version.
>
> I didn't dig deeper, can be that the old schema.xml  was broken and
> worked by accident.
>

Hi,

The short answer is, you should change this to //postingsFormat="Pulsing40"//
See 
http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/test-files/solr/conf/schema_codec.xml

The longer answer is that the Codec API in lucene trunk was extended recently:
https://issues.apache.org/jira/browse/LUCENE-3490

Previously "Codec" only allowed you to customize the format of the
postings lists.
We are working to have it cover the entire index segment (at the
moment nearly everything except deletes and encoding of compound files
can be customized).

For example, look at SimpleText now:
http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/simpletext/
As you see, it now implements plain-text stored fields, term vectors,
norms, segments file, fieldinfos, etc.
See Codec.java 
(http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/java/org/apache/lucene/index/codecs/Codec.java)
or LUCENE-3490 for more details.

Because of this, what you had before is now just "PostingsFormat", as
Pulsing is just a wrapper around a postings implementation that
inlines low frequency terms.
Lucene's default Codec uses a per-field postings setup, so you can
still configure the postings per-field, just differently.

-- 
lucidimagination.com


Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Robert Muir
On Mon, Dec 12, 2011 at 5:18 AM, Max  wrote:

> The end offset remains 11 even after folding and transforming "æ" to
> "ae", which seems wrong to me.

End offsets refer to the *original text* so this is correct.

What is wrong, is EdgeNGramsFilter. See how it turns that 11 to a 12?

>
> I also stumbled upon https://issues.apache.org/jira/browse/LUCENE-1500
> which seems like a similiar issue.
>
> Is there a workaround for that problem or is the field configuration wrong?

For now, don't use EdgeNGrams.

-- 
lucidimagination.com


Re: InvalidTokenOffsetsException in conjunction with highlighting and ICU folding and edgeNgrams

2011-12-12 Thread Robert Muir
On Mon, Dec 12, 2011 at 5:18 AM, Max  wrote:

> It seems like there is some weird stuff going on when folding the
> string, it can be seen in the analysis view, too:
>
> http://i.imgur.com/6B2Uh.png
>

I created a bug here, https://issues.apache.org/jira/browse/LUCENE-3642

Thanks for the screenshot, makes it easy to do a test case here.

-- 
lucidimagination.com


Re: feature of FST version of SynonymFilter affects Highlighter

2011-12-26 Thread Robert Muir
The old one didn't really handle this correctly either.

Koji, what is the highlighting problem? Can we have a test case?

2011/12/26 Koji Sekiguchi :
> I found that SynonymFilter javadoc says:
>
> "Matches single or multi word synonyms in a token stream.
> This token stream cannot properly handle position increments != 1"
>
> I think due to the feature, Highlighter doesn't work properly in some cases:
>
> http://www.lucidimagination.com/search/document/c3ed1e0a2b12ddfa#c3ed1e0a2b12ddfa
>
> https://issues.apache.org/jira/browse/SOLR-2845
>
> Can we remove the restriction in some future?
>
> If not, I'd propose we have an option to choose SlowSynonymFilterFactory 
> explicitly
> in schema.xml (we can choose it by setting luceneMatchVersion to 33,
> but it is global).
>
> koji
> --
> http://www.rondhuit.com/en/



-- 
lucidimagination.com


Re: feature of FST version of SynonymFilter affects Highlighter

2011-12-26 Thread Robert Muir
On Mon, Dec 26, 2011 at 10:54 AM, Koji Sekiguchi  wrote:

> I don't have JUnit test case. What I tried was:
>
> I have indexing time synonym definition:
>
> nhl, national hockey league
>
> and I indexed "I like national hockey league".
>
> Then I searched nhl with hl=on, I got an unwanted highlight snippet
> "I like national hockey league".
>
> But if I set luceneMatchVersion to LUCENE_33 and re-indexed,
> I got an expected result "I like national hockey league".
>

Thanks Koji, I'll see if I can create a test case for this later
today. SynonymFilter could have a bug with the offsets.

-- 
lucidimagination.com


Re: GermanAnalyzer

2012-01-14 Thread Robert Muir
On Sat, Jan 14, 2012 at 12:58 PM,   wrote:
> Hi,
>
> I'm switching from Lucene 2.3 to Solr 3.5. I want to reuse the existing
> indexes (huge...).

If you want to use a Lucene 2.3 index, then you should set this in
your solrconfig.xml:

LUCENE_23

>
> In Lucene I use an untweaked org.apache.lucene.analysis.de.GermanAnalyzer.
>
> What is an equivalent fieldType definition in Solr 3.5?


  


-- 
lucidimagination.com


Re: GermanAnalyzer

2012-01-14 Thread Robert Muir
On Sat, Jan 14, 2012 at 5:09 PM, Lance Norskog  wrote:
> Has the GermanAnalyzer behavior changed at all? This is another kind
> of mismatch, and it can cause very subtle problems.  If text is
> indexed and queried using different Analyzers, queries will not do
> what you think they should.

It acts the same as it did in 2.3 if you use
LUCENE_23 in your solrconfig,
as I already recommended.

-- 
lucidimagination.com


Re: Trying to understand SOLR memory requirements

2012-01-16 Thread Robert Muir
looks like https://issues.apache.org/jira/browse/SOLR-2888.

Previously, FST would need to hold all the terms in RAM during
construction, but with the patch it uses offline sorts/temporary
files.
I'll reopen the issue to backport this to the 3.x branch.


On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
> I'm trying to figure out what my memory needs are for a rather large
> dataset. I'm trying to build an auto-complete system for every
> city/state/country in the world. I've got a geographic database, and have
> setup the DIH to pull the proper data in. There are 2,784,937 documents
> which I've formatted into JSON-like output, so there's a bit of data
> associated with each one. Here is an example record:
>
> Brooklyn, New York, United States?{ |id|: |2620829|,
> |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
> |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
> States| }
>
> I've got the spellchecker / suggester module setup, and I can confirm that
> everything works properly with a smaller dataset (i.e. just a couple of
> countries worth of cities/states). However I'm running into a big problem
> when I try to index the entire dataset. The dataimport?command=full-import
> works and the system comes to an idle state. It generates the following
> data/index/ directory (I'm including it in case it gives any indication on
> memory requirements):
>
> -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
> -rw-rw 1 root   root    22M Jan 17 00:13 _2w.fdx
> -rw-rw 1 root   root    131 Jan 17 00:13 _2w.fnm
> -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
> -rw-rw 1 root   root    16M Jan 17 00:13 _2w.nrm
> -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
> -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
> -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
> -rw-rw 1 root   root     20 Jan 17 00:13 segments.gen
> -rw-rw 1 root   root    291 Jan 17 00:13 segments_2
>
> Next I try to run the suggest?spellcheck.build=true command, and I get the
> following error:
>
> Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build
> INFO: build()
> Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
> SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>  at java.util.Arrays.copyOfRange(Arrays.java:3209)
> at java.lang.String.(String.java:215)
>  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
> at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
>  at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
> at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
>  at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
> at org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
>  at org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
> at
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
>  at
> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
> at org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
>  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
> at org.apache.solr.spelling.suggest.Suggester.build(Suggester.java:133)
>  at
> org.apache.solr.handler.component.SpellCheckComponent.prepare(SpellCheckComponent.java:109)
> at
> org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
>  at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
>  at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
> at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
>  at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>  at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>  at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
> at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>  at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326)
>  at org.mortbay.jetty.HttpConnection.handleRequest(Htt

Re: Trying to understand SOLR memory requirements

2012-01-17 Thread Robert Muir
I committed it already: so you can try out branch_3x if you want.

you can either wait for a nightly build or compile from svn
(http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).

On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
> Thank you Robert, I'd appreciate that. Any idea how long it will take to
> get a fix? Would I be better switching to trunk? Is trunk stable enough for
> someone who's very much a SOLR novice?
>
> Thanks,
> Dave
>
> On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
>
>> looks like https://issues.apache.org/jira/browse/SOLR-2888.
>>
>> Previously, FST would need to hold all the terms in RAM during
>> construction, but with the patch it uses offline sorts/temporary
>> files.
>> I'll reopen the issue to backport this to the 3.x branch.
>>
>>
>> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
>> > I'm trying to figure out what my memory needs are for a rather large
>> > dataset. I'm trying to build an auto-complete system for every
>> > city/state/country in the world. I've got a geographic database, and have
>> > setup the DIH to pull the proper data in. There are 2,784,937 documents
>> > which I've formatted into JSON-like output, so there's a bit of data
>> > associated with each one. Here is an example record:
>> >
>> > Brooklyn, New York, United States?{ |id|: |2620829|,
>> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
>> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
>> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
>> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
>> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
>> > States| }
>> >
>> > I've got the spellchecker / suggester module setup, and I can confirm
>> that
>> > everything works properly with a smaller dataset (i.e. just a couple of
>> > countries worth of cities/states). However I'm running into a big problem
>> > when I try to index the entire dataset. The
>> dataimport?command=full-import
>> > works and the system comes to an idle state. It generates the following
>> > data/index/ directory (I'm including it in case it gives any indication
>> on
>> > memory requirements):
>> >
>> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
>> > -rw-rw 1 root   root    22M Jan 17 00:13 _2w.fdx
>> > -rw-rw 1 root   root    131 Jan 17 00:13 _2w.fnm
>> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
>> > -rw-rw 1 root   root    16M Jan 17 00:13 _2w.nrm
>> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
>> > -rw-rw 1 root   root   9.2M Jan 17 00:13 _2w.tii
>> > -rw-rw 1 root   root   1.1G Jan 17 00:13 _2w.tis
>> > -rw-rw 1 root   root     20 Jan 17 00:13 segments.gen
>> > -rw-rw 1 root   root    291 Jan 17 00:13 segments_2
>> >
>> > Next I try to run the suggest?spellcheck.build=true command, and I get
>> the
>> > following error:
>> >
>> > Jan 16, 2012 4:01:47 PM org.apache.solr.spelling.suggest.Suggester build
>> > INFO: build()
>> > Jan 16, 2012 4:03:27 PM org.apache.solr.common.SolrException log
>> > SEVERE: java.lang.OutOfMemoryError: GC overhead limit exceeded
>> >  at java.util.Arrays.copyOfRange(Arrays.java:3209)
>> > at java.lang.String.(String.java:215)
>> >  at org.apache.lucene.index.TermBuffer.toTerm(TermBuffer.java:122)
>> > at org.apache.lucene.index.SegmentTermEnum.term(SegmentTermEnum.java:184)
>> >  at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:203)
>> > at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:172)
>> >  at org.apache.lucene.index.SegmentReader.docFreq(SegmentReader.java:509)
>> > at
>> org.apache.lucene.index.DirectoryReader.docFreq(DirectoryReader.java:719)
>> >  at
>> org.apache.solr.search.SolrIndexReader.docFreq(SolrIndexReader.java:309)
>> > at
>> >
>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.isFrequent(HighFrequencyDictionary.java:75)
>> >  at
>> >
>> org.apache.lucene.search.spell.HighFrequencyDictionary$HighFrequencyIterator.hasNext(HighFrequencyDictionary.java:125)
>> > at
>> org.apache.lucene.search.suggest.fst.FSTLookup.build(FSTLookup.java:157)
>> >  at org.apache.lucene.search.suggest.Lookup.build(Lookup.java:70)
>> > at org.apache.solr.spelli

Re: Trying to understand SOLR memory requirements

2012-01-19 Thread Robert Muir
ervletHandler.handle(ServletHandler.java:399)
>> at
>> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
>>  at
>> org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
>> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
>>  at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
>> at
>> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>>  at
>> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
>> at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
>>  at org.mortbay.jetty.Server.handle(Server.java:326)
>> at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
>>  at
>> org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928)
>> at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549)
>>
>>
>> On Tue, Jan 17, 2012 at 8:59 AM, Robert Muir  wrote:
>>
>>> I committed it already: so you can try out branch_3x if you want.
>>>
>>> you can either wait for a nightly build or compile from svn
>>> (http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/).
>>>
>>> On Tue, Jan 17, 2012 at 8:35 AM, Dave  wrote:
>>> > Thank you Robert, I'd appreciate that. Any idea how long it will take to
>>> > get a fix? Would I be better switching to trunk? Is trunk stable enough
>>> for
>>> > someone who's very much a SOLR novice?
>>> >
>>> > Thanks,
>>> > Dave
>>> >
>>> > On Mon, Jan 16, 2012 at 10:08 PM, Robert Muir  wrote:
>>> >
>>> >> looks like https://issues.apache.org/jira/browse/SOLR-2888.
>>> >>
>>> >> Previously, FST would need to hold all the terms in RAM during
>>> >> construction, but with the patch it uses offline sorts/temporary
>>> >> files.
>>> >> I'll reopen the issue to backport this to the 3.x branch.
>>> >>
>>> >>
>>> >> On Mon, Jan 16, 2012 at 8:31 PM, Dave  wrote:
>>> >> > I'm trying to figure out what my memory needs are for a rather large
>>> >> > dataset. I'm trying to build an auto-complete system for every
>>> >> > city/state/country in the world. I've got a geographic database, and
>>> have
>>> >> > setup the DIH to pull the proper data in. There are 2,784,937
>>> documents
>>> >> > which I've formatted into JSON-like output, so there's a bit of data
>>> >> > associated with each one. Here is an example record:
>>> >> >
>>> >> > Brooklyn, New York, United States?{ |id|: |2620829|,
>>> >> > |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229|
>>> },
>>> >> > |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
>>> >> > |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
>>> >> > |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
>>> >> > |Brooklyn, New York, United States|, |title|: |Brooklyn, New York,
>>> United
>>> >> > States| }
>>> >> >
>>> >> > I've got the spellchecker / suggester module setup, and I can confirm
>>> >> that
>>> >> > everything works properly with a smaller dataset (i.e. just a couple
>>> of
>>> >> > countries worth of cities/states). However I'm running into a big
>>> problem
>>> >> > when I try to index the entire dataset. The
>>> >> dataimport?command=full-import
>>> >> > works and the system comes to an idle state. It generates the
>>> following
>>> >> > data/index/ directory (I'm including it in case it gives any
>>> indication
>>> >> on
>>> >> > memory requirements):
>>> >> >
>>> >> > -rw-rw 1 root   root   2.2G Jan 17 00:13 _2w.fdt
>>> >> > -rw-rw 1 root   root    22M Jan 17 00:13 _2w.fdx
>>> >> > -rw-rw 1 root   root    131 Jan 17 00:13 _2w.fnm
>>> >> > -rw-rw 1 root   root   134M Jan 17 00:13 _2w.frq
>>> >> > -rw-rw 1 root   root    16M Jan 17 00:13 _2w.nrm
>>> >> > -rw-rw 1 root   root   130M Jan 17 00:13 _2w.prx
>>> >> > -rw

Re: Trying to understand SOLR memory requirements

2012-01-19 Thread Robert Muir
I really don't think you should put a huge json document as a search term.

Just make "Brooklyn, New York, United States" or whatever you intend
the user to actually search on/type in as your search term.
put the rest in different fields (e.g. stored-only, not even indexed
if you dont need that) and have solr return it that way.

On Thu, Jan 19, 2012 at 12:31 PM, Dave  wrote:
> In my original post I included one of my terms:
>
> Brooklyn, New York, United States?{ |id|: |2620829|,
> |timezone|:|America/New_York|,|type|: |3|, |country|: { |id| : |229| },
> |region|: { |id| : |3608| }, |city|: { |id|: |2616971|, |plainname|:
> |Brooklyn|, |name|: |Brooklyn, New York, United States| }, |hint|:
> |2300664|, |label|: |Brooklyn, New York, United States|, |value|:
> |Brooklyn, New York, United States|, |title|: |Brooklyn, New York, United
> States| }
>
> I'm matching on the first part of the term (the part before the ?), and
> then the rest is being passed via JSON into Javascript, then converted to a
> JSON term itself. Here is my data-config.xml file, in case it sheds any
> light:
>
> 
>                driver="com.mysql.jdbc.Driver"
>              url=""
>              user=""
>              password=""
>              encoding="UTF-8"/>
>  
>                pk="id"
>            query="select p.id as placeid, c.id, c.plainname, c.name,
> p.timezone from countries c, places p where p.regionid = 1 AND p.cityid = 1
> AND c.id=p.countryid AND p.settingid=1"
>            transformer="TemplateTransformer">
>            
>            
>            
>            
>            
>            
>    
>                pk="id"
>            query="select p.id as placeid, p.countryid as countryid,
> c.plainname as countryname, p.timezone as timezone, r.id as regionid,
> r.plainname as regionname, r.population as regionpop from places p, regions
> r, countries c where r.id = p.regionid AND p.settingid = 1 AND p.regionid >
> 1 AND p.countryid=c.id AND p.cityid=1 AND r.population > 0"
>            transformer="TemplateTransformer">
>            
>            
>            
>            
>            
>            
>    
>                pk="id"
>            query="select c2.id as cityid, c2.plainname as cityname,
> c2.population as citypop, p.id as placeid, p.countryid as countryid,
> c.plainname as countryname, p.timezone as timezone, r.id as regionid,
> r.plainname as regionname from places p, regions r, countries c, cities c2
> where c2.id = p.cityid AND p.settingid = 1 AND p.regionid > 1 AND
> p.countryid=c.id AND r.id=p.regionid"
>            transformer="TemplateTransformer">
>            
>            
>            
>            
>            
>            
>            
>            
>            
>            
>    
>  
> 
>
>
>
>
> On Thu, Jan 19, 2012 at 11:52 AM, Robert Muir  wrote:
>
>> I don't think the problem is FST, since it sorts offline in your case.
>>
>> More importantly, what are you trying to put into the FST?
>>
>> it appears you are indexing terms from your term dictionary, but your
>> term dictionary is over 1GB, why is that?
>>
>> what do your terms look like? 1GB for 2,784,937 documents does not make
>> sense.
>> for example, all place names in geonames (7.2M documents) creates a
>> term dictionary of 22MB.
>>
>> So there is something wrong with your data importing and/or analysis
>> process, your terms are not what you think they are.
>>
>> On Thu, Jan 19, 2012 at 11:27 AM, Dave  wrote:
>> > I'm also seeing the error when I try to start up the SOLR instance:
>> >
>> > SEVERE: java.lang.OutOfMemoryError: Java heap space
>> > at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:344)
>> >  at org.apache.lucene.util.ArrayUtil.grow(ArrayUtil.java:352)
>> > at org.apache.lucene.util.fst.FST$BytesWriter.writeByte(FST.java:975)
>> >  at org.apache.lucene.util.fst.FST.writeLabel(FST.java:395)
>> > at org.apache.lucene.util.fst.FST.addNode(FST.java:499)
>> >  at org.apache.lucene.util.fst.Builder.compileNode(Builder.java:182)
>> > at org.apache.lucene.util.fst.Builder.freezeTail(Builder.java:270)
>> >  at org.apache.lucene.util.fst.Builder.add(Builder.java:365)
>> > at
>> >
>> org.apache.lucene.search.suggest.fst.FSTCompletionBuilder.buildAutomaton(FSTCompletionBuilder.java:228)
>> >  at
>> >
>> org.apache.lucene.search.suggest.fst.FSTCompletionBuil

Re: Plural only stemmer

2010-06-17 Thread Robert Muir
I created LUCENE-2503 to address this.

On Thu, Jun 17, 2010 at 12:56 PM, Rachel Arbit  wrote:

> Hi all,
> I'm having trouble finding a stemmer that's less aggressive than the
> porter-stemmer, ideally, one that does only plural stemming.
> I've been trying to get KStem to work by copying the lucid-kstem and
> lucid-solr-kstem jars from the lucid distribution into solr/lib, but I get
> a
> classNotFound Exception for CharArraySet when I do that.
>
> Does anyone know where I can get a stemmer that fits my needs, or have tips
> on how to make it work with the KStem jars?
>
> Thanks!
>



-- 
Robert Muir
rcm...@gmail.com


Re: MappingCharFilterFactory equivalent for use after tokenizer?

2010-06-18 Thread Robert Muir
On Fri, Jun 18, 2010 at 7:11 PM, Lance Norskog  wrote:

> Indeed. Also, it should be possible to output multiple synonyms based
> on the mapping: word_with_umlaut should be become word_with_u and
> word_with_ue as synonyms. (Ok, maybe this example is wrong, but it
> illustrates the idea.)
>
>
I don't think we should do this. how many tokens would  make?
(such malformed input exists in the wild, e.g. someone spills beer on their
keyboard and they key gets sticky)

-- 
Robert Muir
rcm...@gmail.com


Re: fuzzy query performance

2010-06-23 Thread Robert Muir
On Wed, Jun 23, 2010 at 3:34 PM, Peter Karich  wrote:

>
> So, you mean I should try it out her:
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/
>
>
yes, the speedups are only in trunk.

-- 
Robert Muir
rcm...@gmail.com


Re: Stemmed and/or unStemmed field

2010-06-23 Thread Robert Muir
On Wed, Jun 23, 2010 at 3:58 PM, Vishal A.
wrote:

>
> Here is what I am trying to do :  Someone clicks on  'Comforters & Pillows'
> , we would want the results to be filtered where title has keyword
> 'Comforter' or  'Pillows' but we have been getting results with word
> 'comfort' in the title. I assume it is because of stemming. What is the
> right way to handle this?
>

from your examples, it seems a more lightweight stemmer might be an easy
option: https://issues.apache.org/jira/browse/LUCENE-2503

-- 
Robert Muir
rcm...@gmail.com


Re: NGramFilterFactory usage

2010-06-26 Thread Robert Muir
yes, you need to use ngramfilter at query-time too.

On Sat, Jun 26, 2010 at 3:55 PM, Indika Tantrigoda wrote:

> Hi all,
>
> I've been working with Solr for while and the search components work as
> expected.
> Recently I've had the requirement to do searching on partial words and I
> setup the NGramFilterFactory.
>
> My schema.xml is as follows :
>
> positionIncrementGap="100" stored="false" multiValued="true">
>
>
>
>  maxGramSize="15"/>
>
>
>
>
>
>
>
>  multiValued="false"/>
>  multiValued="true"/>
> 
>
> Furthermore I am using the dismax query hanlder and have set a boost on the
> nGram_text field.
>
> If I do a *:* on the Solr administration interface it shows the nGram_text
> field to be populated.
> However if I search for plan (Assume I indexed the word Plane) no results
> are shown.
> Is there any other configurations that needs to be done ?
>
> Thanks in advance,
>
> Regards,
> Indika
>



-- 
Robert Muir
rcm...@gmail.com


Re: Indexing slowdowns

2010-07-08 Thread Robert Muir
On Thu, Jul 8, 2010 at 7:44 PM, Mark Holland wrote:

>
> Can anyone suggest where I might start looking for answers? I have a
> yourkit
> snapshot if anyone would care to see it.
>
>
Doesn't sound good. I'd like to see whatever data you can provide (i worry
it might be something in analysis)


-- 
Robert Muir
rcm...@gmail.com


Re: Polish language support?

2010-07-09 Thread Robert Muir
Hi Peter,

this stemmer is integrated into trunk and 3x.

http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/stempel/
http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/stempel/


On Fri, Jul 9, 2010 at 2:38 PM, Peter Wolanin wrote:

> In IRC trying to help someone find Polish-language support for Solr.
>
> Seems lucene has nothing to offer?  Found one stemmer that looks to be
> compatibly licensed in case someone wants to take a shot at
> incorporating it:  http://www.getopt.org/stempel/
>
> -Peter
>
> --
> Peter M. Wolanin, Ph.D.
> Momentum Specialist,  Acquia. Inc.
> peter.wola...@acquia.com
>



-- 
Robert Muir
rcm...@gmail.com


Re: Foreign characters question

2010-07-14 Thread Robert Muir
is your synonyms file in UTF-8 encoding?

On Wed, Jul 14, 2010 at 11:11 AM, Blargy  wrote:

>
> Thanks for the reply but that didnt help.
>
> Tomcat is accepting foreign characters but for some reason when it reads
> the
> synonyms file and it encounters that character ñ it doesnt appear correctly
> in the Field Analysis admin. It shows up as �. If I query exactly for ñ it
> will work but the synonyms file is srcrewy.
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Foreign-characters-question-tp964078p966740.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Robert Muir
rcm...@gmail.com


Re: Foreign characters question

2010-07-14 Thread Robert Muir
On Wed, Jul 14, 2010 at 12:59 PM, Blargy  wrote:

>
> Nevermind. Apparently my IDE (Netbeans) was set to "No encoding"... wtf.
> Changed it to UTF-8 and recreated the file and all is good now. Thanks!
>
>
fyi I created an issue with your example here:
https://issues.apache.org/jira/browse/SOLR-2003

In this case, the wrong encoding could have been detected and saved you some
time...

-- 
Robert Muir
rcm...@gmail.com


Re: Error in building Solr-Cloud (ant example)

2010-07-15 Thread Robert Muir
ymbol - pasted below).
> >>
> >> I also get a bunch of warnings:
> >> [javac] Note: Some input files use or override a deprecated API.
> >> [javac] Note: Recompile with -Xlint:deprecation for details.
> >> I have tried both Java 1.5 and 1.6.
> >>
> >>
> >> Before I got to this point, I was having problems with the included
> >> ZooKeeper jar (java versioning issue) - so I had to download the source
> and
> >> build this. Now 'ant' gets a bit further, to the stage listed above.
> >>
> >> Any idea of the problem??? THANKS!
> >>
> >> [javac] Compiling 438 source files to
> >> /Volumes/newpart/solrcloud/cloud/build/solr
> >> [javac]
> >>
> /Volumes/newpart/solrcloud/cloud/src/java/org/apache/solr/cloud/ZkController.java:588:
> >> cannot find symbol
> >> [javac] symbol  : method stringPropertyNames()
> >> [javac] location: class java.util.Properties
> >> [javac] for (String sprop :
> >> System.getProperties().stringPropertyNames()) {
> >>
> >
> >
> >
>



-- 
Robert Muir
rcm...@gmail.com


Re: How to get search results taking into account ortographies errors ???

2010-07-15 Thread Robert Muir
I think you want to look at using solr.ASCIIFoldingFilterFactory:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

On Thu, Jul 15, 2010 at 12:43 PM, Ariel  wrote:

> Hi every body I am working with apache solr and django with spanish
> documents and I would want when a user make a search and forget to accent
> the words the search results show both posibilities: the results without
> the
> accent an the results with the accent.
>
> would you help me please ???
> Regards
> Ariel
>



-- 
Robert Muir
rcm...@gmail.com


Re: slovene language support

2010-07-19 Thread Robert Muir
Hello,

There is some information here (prototype stemmer) about support in
snowball.
But Martin Porter had some unanswered questions/reservations so nothing ever
got added to snowball:

http://snowball.tartarus.org/archives/snowball-discuss/0725.html
<http://snowball.tartarus.org/archives/snowball-discuss/0725.html>
Of course you could take that stemmer and generate java code with the
snowball code generator and use it, but it seems like it would be best for
those issues to get resolved and get it fixed/included in snowball itself...

On Mon, Jul 19, 2010 at 10:42 AM, Markus Goldbach  wrote:

> Hi,
>
> I want to setup an solr with support for several languages.
> The language list includes slovene, unfortunately I found nothing about it
> in the wiki.
> Has some one experiences with solr 1.4 and slovene?
>
> thanks for help
> Markus




-- 
Robert Muir
rcm...@gmail.com


Re: Stemming

2010-07-20 Thread Robert Muir
http://wiki.apache.org/solr/LanguageAnalysis#solr.StemmerOverrideFilterFactory

On Tue, Jul 20, 2010 at 5:53 PM, Blargy  wrote:

>
> I am using the LucidKStemmer and I noticed that it doesnt stem certain
> words... for example "bags". How could I create a list of explicit words to
> stem... ie sort of the opposite of protected words.
>
> I know this can be accomplished using the synonyms file but I want to know
> how to just replace one word with another.
>
> "This is a bags test" => "This is a bag test"
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Stemming-tp982690p982690.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Robert Muir
rcm...@gmail.com


Re: Stemming

2010-07-20 Thread Robert Muir
https://issues.apache.org/jira/browse/LUCENE-2055

On Tue, Jul 20, 2010 at 7:01 PM, Blargy  wrote:

>
> Perfect!
>
> Is there an associated JIRA ticket/patch for this so I can patch my 4.1
> build?
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Stemming-tp982690p982786.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Robert Muir
rcm...@gmail.com


Re: Russian stemmer

2010-07-27 Thread Robert Muir
All of your examples stem to "ковров":

   assertAnalyzesTo(a, "Коврова Коврову Ковровом Коврове",
  new String[] { "ковров", "ковров", "ковров", "ковров" });
}

Are you sure you enabled this at *both* index and query time?

2010/7/27 Oleg Burlaca 

> Hello,
>
> I'm using SnowballPorterFilterFactory with language="Russian".
> The stemming works ok except people names, geographical places.
> Here are some examples:
>
> searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.
>
> Are there other stemming plugins for the russian language that can handle
> this?
> If not, what are the options. A simple solution may be to use the wildcard
> queries in Standard mode instead of the DisMaxQueryHandler:
> Ковров*
>
> but I'd like to avoid it.
>
> Thanks.
>



-- 
Robert Muir
rcm...@gmail.com


Re: Russian stemmer

2010-07-27 Thread Robert Muir
another look, your problem is ковров itself... its mapped to ковр

a workaround might be to use the protected words functionality to
keep ковров and any other problematic people/geo names as-is.

separately, in trunk there is an alternative russian stemmer
(RussianLightStemFilterFactory), which might give you less problems on
average, but I noticed it has this same problem with the example you gave.

On Tue, Jul 27, 2010 at 4:25 AM, Robert Muir  wrote:

> All of your examples stem to "ковров":
>
>assertAnalyzesTo(a, "Коврова Коврову Ковровом Коврове",
>   new String[] { "ковров", "ковров", "ковров", "ковров" });
> }
>
> Are you sure you enabled this at *both* index and query time?
>
> 2010/7/27 Oleg Burlaca 
>
> Hello,
>>
>> I'm using SnowballPorterFilterFactory with language="Russian".
>> The stemming works ok except people names, geographical places.
>> Here are some examples:
>>
>> searching for Ковров should also find Коврова, Коврову, Ковровом, Коврове.
>>
>> Are there other stemming plugins for the russian language that can handle
>> this?
>> If not, what are the options. A simple solution may be to use the wildcard
>> queries in Standard mode instead of the DisMaxQueryHandler:
>> Ковров*
>>
>> but I'd like to avoid it.
>>
>> Thanks.
>>
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com


Re: Russian stemmer

2010-07-27 Thread Robert Muir
2010/7/27 Oleg Burlaca 

> Actually the situation with Немцов из ок,
> I've just checked how Yandex works with Немцов and Немцова:
> http://nano.yandex.ru/project/inflect/
>
> I think there are two solutions:
> a) manually search for both Немцов and then Немцова
> b) use wildcard query: Немцов*
>

Well, here is one idea of a more general solution.
The problem with "protected words" is you must have a complete list.

One idea would be to add a filter that protects any words from stemming that
match a regular expression:
In english maybe someone wants to avoid any capitalized words to reduce
trouble: [A-Z].*
in your case then some pattern like [A-Я].*ов might prevent problems.


> Robert, thanks for the RussianLightStemFilterFactory info,
> I've found this page
> http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
> that somehow describes it. Where can I read more about
> RussianLightStemFilterFactory ?
>
>
Here is the link:
http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf


> Regards,
> Oleg
>
> 2010/7/27 Oleg Burlaca 
>
> > A similar word is Немцов.
> > The strange thing is that searching for "Немцова" will not find documents
> > containing "Немцов"
> >
> > Немцова: 14 articles
> >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> >
> > Немцов: 74 articles
> >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> >
> >
> >
> >
>



-- 
Robert Muir
rcm...@gmail.com


Re: Russian stemmer

2010-07-27 Thread Robert Muir
right, but your problem is this is the current output:

Ковров -> Ковр
Коврову -> Ковров
Ковровом -> Ковров
Коврове -> Ковров

so, if Ковров was simply left alone, all your forms would match...

2010/7/27 Oleg Burlaca 

> Thanks Robert for all your help,
>
> The idea of ы[A-Z].* stopwords is ideal for the english language,
> although in russian nouns are inflected: Борис, Борису, Бориса, Борисом
>
> I'll try the RussianLightStemFilterFactory (the article in the PDF
> mentioned
> it's more accurate).
>
> Once again thanks,
> Oleg Burlaca
>
> On Tue, Jul 27, 2010 at 12:07 PM, Robert Muir  wrote:
>
> > 2010/7/27 Oleg Burlaca 
> >
> > > Actually the situation with Немцов из ок,
> > > I've just checked how Yandex works with Немцов and Немцова:
> > > http://nano.yandex.ru/project/inflect/
> > >
> > > I think there are two solutions:
> > > a) manually search for both Немцов and then Немцова
> > > b) use wildcard query: Немцов*
> > >
> >
> > Well, here is one idea of a more general solution.
> > The problem with "protected words" is you must have a complete list.
> >
> > One idea would be to add a filter that protects any words from stemming
> > that
> > match a regular expression:
> > In english maybe someone wants to avoid any capitalized words to reduce
> > trouble: [A-Z].*
> > in your case then some pattern like [A-Я].*ов might prevent problems.
> >
> >
> > > Robert, thanks for the RussianLightStemFilterFactory info,
> > > I've found this page
> > >
> http://www.mail-archive.com/solr-comm...@lucene.apache.org/msg06857.html
> > > that somehow describes it. Where can I read more about
> > > RussianLightStemFilterFactory ?
> > >
> > >
> > Here is the link:
> >
> >
> http://doc.rero.ch/lm.php?url=1000,43,4,20091209094227-CA/Dolamic_Ljiljana_-_Indexing_and_Searching_Strategies_for_the_Russian_20091209.pdf
> >
> >
> > > Regards,
> > > Oleg
> > >
> > > 2010/7/27 Oleg Burlaca 
> > >
> > > > A similar word is Немцов.
> > > > The strange thing is that searching for "Немцова" will not find
> > documents
> > > > containing "Немцов"
> > > >
> > > > Немцова: 14 articles
> > > >
> > > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2%D0%B0
> > > >
> > > > Немцов: 74 articles
> > > >
> > > >
> > >
> >
> http://www.sova-center.ru/search/?lg=1&q=%D0%BD%D0%B5%D0%BC%D1%86%D0%BE%D0%B2
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>



-- 
Robert Muir
rcm...@gmail.com


Re: Good list of English words that get "butchered" by Porter Stemmer

2010-07-30 Thread Robert Muir
Otis,

I think this is a great idea.

you could also go even further by making a better example for
StemmerOverrideFilter (stemdict.txt)
(
http://wiki.apache.org/solr/LanguageAnalysis#solr.StemmerOverrideFilterFactory
)

for example:
animated  animate
animation  animation
animations  animation

this might be a bit better (but more work!) than protected words since then
you could let animation and animations conflate, rather than just forcing
them to be all unchanged. i wouldnt go crazy and worry about animator
matching animation etc, but would at least let plural forms match the
singular, without screwing other things up.

On Fri, Jul 30, 2010 at 4:41 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> Hello,
>
> I'm looking for a list of English  words that, when stemmed by Porter
> stemmer,
> end up in the same stem as  some similar, but unrelated words.  Below are
> some
> examples:
>
> # this gets stemmed to "iron", so if you search for "ironic", you'll get
> "iron"
> matches
> ironic
>
> # same stem as animal
> anime
> animated
> animation
> animations
>
> I imagine such a list could be added to the example protwords.txt
>
> Thanks,
> Otis
> 
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>


-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-04 Thread Robert Muir
I think I agree with Justin here, I think the way analysis tool highlights
'matches' is extremely misleading, especially considering it completely
ignores queryparsing.

it would be better if it put your text in a memoryindex and actually parsed
the query w/ queryparser, ran it, and used the highlighter to try to show
any matches.

On Wed, Aug 4, 2010 at 10:14 AM, Justin Lolofie  wrote:

> Erik: Yes, I did re-index if that means adding the document again.
> Here are the exact steps I took:
>
> 1. analysis.jsp "ABC12" does NOT match title "ABC12" (however, ABC or 12
> does)
> 2. changed schema.xml WordDelimeterFilterFactory catenate-all
> 3. restarted tomcat
> 4. deleted the document with title "ABC12"
> 5. added the document with title "ABC12"
> 6. query "ABC12" does NOT result in the document with title "ABC12"
> 7. analysis.jsp "ABC12" DOES match that document now
>
> Is there any way to see, given an ID, how something is indexed internally?
>
> Lance: I understand the index/query sections of analysis.jsp. However,
> it operates on text that you enter into the form, not on actual index
> data. Since all my documents have a unique ID, I'd like to supply an
> ID and a query, and get back the same index/query sections- using
> whats actually in the index.
>
>
> -- Forwarded message --
> From: Erik Hatcher 
> To: solr-user@lucene.apache.org
> Date: Tue, 3 Aug 2010 22:43:17 -0400
> Subject: Re: analysis tool vs. reality
> Did you reindex after changing the schema?
>
>
> On Aug 3, 2010, at 7:35 PM, Justin Lolofie wrote:
>
>Hi Erik, thank you for replying. So, turning on debugQuery shows
>information about how the query is processed- is there a way to see
>how things are stored internally in the index?
>
>My query is "ABC12". There is a document who's "title" field is
>"ABC12". However, I can only get it to match if I search for "ABC" or
>"12". This was also true in the analysis tool up until recently.
>However, I changed schema.xml and turned on catenate-all in
>WordDelimterFilterFactory for title fieldtype. Now, in the analysis
>tool "ABC12" matches "ABC12". However, when doing an actual query, it
>does not match.
>
>Thank you for any help,
>Justin
>
>
>-- Forwarded message --
>From: Erik Hatcher 
>To: solr-user@lucene.apache.org
>Date: Tue, 3 Aug 2010 16:50:06 -0400
>Subject: Re: analysis tool vs. reality
>The analysis tool is merely that, but during querying there is also a
>query parser involved.  Adding debugQuery=true to your request will
>give you the parsed query in the response offering insight into what
>might be going on.   Could be lots of things, like not querying the
>fields you think you are to a misunderstanding about some text not
>being analyzed (like wildcard clauses).
>
> Erik
>
>On Aug 3, 2010, at 4:43 PM, Justin Lolofie wrote:
>
>  Hello,
>
>  I have found the analysis tool in the admin page to be very useful in
>  understanding my schema. I've made changes to my schema so that a
>  particular case I'm looking at matches properly. I restarted solr,
>  deleted the document from the index, and added it again. But still,
>  when I do a query, the document does not get returned in the results.
>
>  Does anyone have any tips for debugging this sort of issue? What is
>  different between what I see in analysis tool and new documents added
>  to the index?
>
>  Thanks,
>   Justin
>



-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-04 Thread Robert Muir
On Wed, Aug 4, 2010 at 1:45 PM, Chris Hostetter wrote:

>
> it really only attempts to identify when there is overlap between
> analaysis at query time and at indexing time so you can easily spot when
> one analyzer or the other "breaks" things so that they no longer line up
> (or when it "fiexes" things so they start to line up)
>

It attempts badly, because it only "works" in the most trivial of cases
(e.g. doesnt reflect the interaction of queryparser with multiword synonyms
or worddelimiterfilter).

Since Solr includes these non-trivial analysis components *in the example*
it means that this 'highlight matches' doesnt actually even really work at
all.

Someone is gonna use this thing when they dont understand why analysis isnt
doing what they want, i.e. the cases like I outlined above.

For the trivial cases where it does "work" the 'highlight matches' isnt
useful anyway, so in its current state its completely unnecessary.


> Even if we eliminated that highlighting as missleading, people would still
> do it in thier minds, it would just be harder -- it doesn't change the
> underlying fact that analysis is only part of the picture.
>

I'm not suggesting that. I'm suggesting fixing the highlighting so its not
misleading. There are really only two choices:
1. remove the current highlighting
2. fix it.

in its current state its completely useless and misleading, except for very
trivial cases, in which you dont need it anyway.


>
> : it would be better if it put your text in a memoryindex and actually
> parsed
> : the query w/ queryparser, ran it, and used the highlighter to try to show
> : any matches.
>
> Thta level of "query explanation" really only works if the user gives us a
> full document (all fields, not just one) and a full query string, and all
> of the possible query params -- because the query parser (either implicit
> because of config, or explicitly specified by the user) might change it's
> behavior based on those other params.
>

thats true, but I dont see why the user couldnt be allowed to provide just
this.
I'd bet money a lot of people are using this thing with a specific
query/document in mind anyway!


> people can infer from that page.  As i said, i don't think removing the
> "match" highlighting will actaully reduce confusion, but perhaps there is
> verbage/disclaimers that could be added to make it more clear?
>

 As i said before, I think i disagree with you. I think for stuff like this
the technicals are less important, whats important is this is a misleading
checkbox that really confuses users.

I suggest disabling it entirely, you are only going to remove confusion.


-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-04 Thread Robert Muir
Furthermore, I would like to add its not just the highlight matches
functionality that is horribly broken here, but the output of the analysis
itself is misleading.

lets say i take 'textTight' from the example, and add the following synonym:

this is broken => broke

the query time analysis is wrong, as it clearly shows synonymfilter
collapsing "this is broken" to broke, but in reality with the qp for that
field, you are gonna get 3 separate tokenstreams and this will never
actually happen (because the qp will divide it up on whitespace first)

So really the output from 'Query Analyzer' is completely bogus.

On Wed, Aug 4, 2010 at 1:57 PM, Robert Muir  wrote:

>
>
> On Wed, Aug 4, 2010 at 1:45 PM, Chris Hostetter 
> wrote:
>
>>
>> it really only attempts to identify when there is overlap between
>> analaysis at query time and at indexing time so you can easily spot when
>> one analyzer or the other "breaks" things so that they no longer line up
>> (or when it "fiexes" things so they start to line up)
>>
>
> It attempts badly, because it only "works" in the most trivial of cases
> (e.g. doesnt reflect the interaction of queryparser with multiword synonyms
> or worddelimiterfilter).
>
> Since Solr includes these non-trivial analysis components *in the example*
> it means that this 'highlight matches' doesnt actually even really work at
> all.
>
> Someone is gonna use this thing when they dont understand why analysis isnt
> doing what they want, i.e. the cases like I outlined above.
>
> For the trivial cases where it does "work" the 'highlight matches' isnt
> useful anyway, so in its current state its completely unnecessary.
>
>
>> Even if we eliminated that highlighting as missleading, people would still
>> do it in thier minds, it would just be harder -- it doesn't change the
>> underlying fact that analysis is only part of the picture.
>>
>
> I'm not suggesting that. I'm suggesting fixing the highlighting so its not
> misleading. There are really only two choices:
> 1. remove the current highlighting
> 2. fix it.
>
> in its current state its completely useless and misleading, except for very
> trivial cases, in which you dont need it anyway.
>
>
>>
>> : it would be better if it put your text in a memoryindex and actually
>> parsed
>> : the query w/ queryparser, ran it, and used the highlighter to try to
>> show
>> : any matches.
>>
>> Thta level of "query explanation" really only works if the user gives us a
>> full document (all fields, not just one) and a full query string, and all
>> of the possible query params -- because the query parser (either implicit
>> because of config, or explicitly specified by the user) might change it's
>> behavior based on those other params.
>>
>
> thats true, but I dont see why the user couldnt be allowed to provide just
> this.
> I'd bet money a lot of people are using this thing with a specific
> query/document in mind anyway!
>
>
>> people can infer from that page.  As i said, i don't think removing the
>> "match" highlighting will actaully reduce confusion, but perhaps there is
>> verbage/disclaimers that could be added to make it more clear?
>>
>
>  As i said before, I think i disagree with you. I think for stuff like this
> the technicals are less important, whats important is this is a misleading
> checkbox that really confuses users.
>
> I suggest disabling it entirely, you are only going to remove confusion.
>
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Robert Muir
rcm...@gmail.com


Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-05 Thread Robert Muir
On Thu, Aug 5, 2010 at 9:07 PM, Chris Hostetter wrote:

>
> That should still be true in the the official 4.0 release (i really should
> have said "When 4.0 can no longer read SOlr 1.4 indexes"), ...
> i havne't been following the detials closely, but i suspect that tool
> hasn't been writen yet because there isn't much point until the full
> details of the trunk index format are nailed down.
>
>
This is news to me?

File formats are back-compatible between major versions. Version X.N should
be able to read indexes generated by any version after and including version
X-1.0, but may-or-may-not be able to read indexes generated by version
X-2.N.

(And personally I think there is stuff in 2.x like modified-utf8 that i
would object to adding support for with terms now as byte[])

-- 
Robert Muir
rcm...@gmail.com


Re: Improve Query Time For Large Index

2010-08-11 Thread Robert Muir
On Wed, Aug 11, 2010 at 11:47 AM, Burton-West, Tom wrote:

> Hi Peter,
>
> Can you give a few more examples of slow queries?
> Are they phrase queries? Boolean queries? prefix or wildcard queries?
> If one word queries are your slow queries, than CommonGrams won't help.
>  CommonGrams will only help with phrase queries.
>

 Since the example given was "http" being slow, its worth mentioning that if
queries are "one word" urls [for example http://lucene.apache.org] these
will actually form slow phrase queries by default.

Because your content is very tiny documents, its probably good to disable
this since the phrases won't likely help the results at all, but make things
unbearably slow. in solr 3_x and trunk, you can disable these automatic
phrase queries in schema.xml with autoGeneratePhraseQueries="false":



then the system won't form phrase queries unless the user explicitly puts
double quotes around it.

-- 
Robert Muir
rcm...@gmail.com


Re: Improve Query Time For Large Index

2010-08-12 Thread Robert Muir
exactly!

On Thu, Aug 12, 2010 at 5:26 AM, Peter Karich  wrote:

> Hi Robert!
>
> >  Since the example given was "http" being slow, its worth mentioning that
> if
> > queries are "one word" urls [for example http://lucene.apache.org] these
> > will actually form slow phrase queries by default.
> >
>
> do you mean that http://lucene.apache.org will be split up into "http
> lucene apache org" and solr will perform a phrase query?
>
> Regards,
> Peter.
>



-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 7:55 PM, Chris Hostetter
wrote:
>
>
> You say it's bogus because the qp will divide on whitesapce first -- but
> you're assuming you know what query parser will be used ... the "field"
> query parser (to name one) doesn't split on whitespace first.  That's my
> point: analysis.jsp doesn't make any assumptions about what query parser
> *might* be used, it just tells you what your analyzers do with strings.
>

you're right, we should just fix the bug that the queryparser tokenizes on
whitespace first. then analysis.jsp will be significantly less confusing.


-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 8:07 PM, Chris Hostetter
wrote:

>
> : > You say it's bogus because the qp will divide on whitesapce first --
> but
> : > you're assuming you know what query parser will be used ... the "field"
> : > query parser (to name one) doesn't split on whitespace first.  That's
> my
> : > point: analysis.jsp doesn't make any assumptions about what query
> parser
> : > *might* be used, it just tells you what your analyzers do with strings.
> : >
> :
> : you're right, we should just fix the bug that the queryparser tokenizes
> on
> : whitespace first. then analysis.jsp will be significantly less confusing.
>
> dude .. not trying to get into a holy war here
>
> actually I'm suggesting the practical solution: that we fix the primary
problem that makes it confusing.


> even if you change the Lucene QUeryParser so that whitespace isn't a meta
> character it doens't affect the underlying issue: analysis.jsp is agnostic
> about QueryParsers.


analysis.jsp isn't agnostic about queryparsers, its ignorant of them, and
your default queryparser is actually a de-facto whitespace tokenizer, don't
try to sugarcoat it.

-- 
Robert Muir
rcm...@gmail.com


Re: Index compatibility 1.4 Vs 3.1 Trunk

2010-08-12 Thread Robert Muir
On Thu, Aug 12, 2010 at 8:29 PM, Chris Hostetter
wrote:

>
> It was a big part of the proposal regarding hte creation of hte 3x
> branch ... that index format compabtibility between major versions would
> no longer be supported by silently converted on first write -- instead
> there there would be a tool for explicit conversion...
>
>
> http://search.lucidimagination.com/search/document/c10057266d3471c6/proposal_about_version_api_relaxation
>
> http://search.lucidimagination.com/search/document/c494a78f1ec1bfb5/lucene_3_x_branch_created
>
>
>
Hoss, did you actually *read* these documents


"We will only provide a conversion tool that can convert indexes from
the last "branch_3x" up to this trunk (4.0) release, so they can be
read later, but may not contain terms with all current analyzers, so
people need mostly reindexing. Older indexes will not be able to be
read natively without conversion first (with maybe loss of analyzer
compatibility)."



the fact 4.0 can read 3.x indexes *at all* without a converter tool is
only because Mike Mccandless went the extra mile.


i dont see anything suggesting we should support any tools for 2.x indexes!

-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-16 Thread Robert Muir
On Mon, Aug 16, 2010 at 4:20 PM, Chris Hostetter
wrote:

>
> Even if you convince folks to make every change you think should be made
> to the Lucene QueryParser (again: please take that up in a seperate
> thread) it won't change the fact that people using analysis.jsp should
> understand the distinction between Query Parsing and Analysis -- unless
> you plan on getting rid of every metacharacter that the Lucene QueryParser
> uses to decide what types of Query to build (ie: '"', '-', '"', '*', '?')
> and unless you plan on forcing Solr users to only ever use that one
> QueryParser, then no matter what the Lucene QueryParser does with
> whitespace, users still need to understand the distinction between Query
> Parsing and Analysis so they don't type 'Foo*' into analysis.jsp and then
> ask why it says that will match "food" but it doesn't actually match at
> query time. (suprise suprise: Query Parsing is not the same as analysis,
> and when the QueryParser sees wildcards it doesn't use the analyzer)
>
>
Maybe for once your argument isn't completely bogus: the surprise is
actually key here. Theres really nothing documenting the various
hacks/limitations in the queryparsers: such as auto-tokenizing on
whitespace.

I think the 'expanded terms' not being analyzed is similar, its not really
documented well. Thats probably why it comes up on the mailing list it seems
at least every week [at this point you have to admit, there is a problem].

If you want to say the analysis tool is agnostic about queryparsers, thats
fine, you can keep saying that. I'm saying it shouldn't be.


-- 
Robert Muir
rcm...@gmail.com


Re: analysis tool vs. reality

2010-08-16 Thread Robert Muir
On Mon, Aug 16, 2010 at 5:23 PM, Steven A Rowe  wrote:

> Hi Robert,
>
> You wrote in response to Hoss:
> > Maybe for once your argument isn't completely bogus
>
> Attacking people here is really uncalled for.
>
>
actually, he asked for it:

> you're right, we should just fix the bug that the queryparser tokenizes on
> whitespace first. then analysis.jsp will be significantly less confusing.

>> dude .. not trying to get into a holy war here


> -1 from me.
>
>
well, that might be your opinion, but it doesn't change the facts.

-- 
Robert Muir
rcm...@gmail.com


Re: TurkishLowerCaseFilterFactory

2010-08-26 Thread Robert Muir
On Thu, Aug 26, 2010 at 7:28 AM, Yavuz Selim YILMAZ  wrote:

> I downloaded latest jars except snowball 3-1.jar. I can't find it any
> place?
> --
>
> Yavuz Selim YILMAZ
>
>
Hello,

in 3.1 the contrib/snowball is now integrated with contrib/analyzers, so you
just need the analyzers jar!

This way, in a single jar you have the TurkishLowerCaseFilter, but also the
Turkish stemmer from snowball, a set of Turkish stopwords in resources/, and
a Lucene TurkishAnalyzer that puts it all together.

-- 
Robert Muir
rcm...@gmail.com


Re: shingles work in analyzer but not real data

2010-09-01 Thread Robert Muir
On Wed, Sep 1, 2010 at 8:21 AM, Jeff Rose  wrote:

> Hi,
>  We are using SOLR to match query strings with a keyword database, where
> some of the keywords are actually more than one word.  For example a
> keyword
> might be "apple pie" and we only want it to match for a query containing
> that word pair, but not one only containing "apple".  Here is the relevant
> piece of the schema.xml, defining the index and query pipelines:
>
>  
> 
>   
>
>
> 
> 
>
> 
>
> 
>  
>   
>
> In the analysis tool this schema looks like it works correctly.  Our
> multi-word keywords are indexed as a single entry, and then when a search
> phrase contains one of these multi-word keywords it is shingled and
> matched.
>  Unfortunately, when we do the same queries on top of the actual index it
> responds with zero matches.  I can see in the index histogram that the
> terms
> are correctly indexed from our mysql datasource containing the keywords,
> but
> somehow the shingling doesn't appear to work on this live data.  Does
> anyone
> have experience with shingling that might have some tips for us, or
> otherwise advice for debugging the issue?
>

query-time shingling probably isnt working with the queryparser you are
using, the default lucene one first splits on whitespace before sending it
to the analyzer: e.g. a query of foo bar is processed as TokenStream(foo) +
TokenStream(bar)

so query-time shingling like this doesn't work as you expect for this
reason.


-- 
Robert Muir
rcm...@gmail.com


  1   2   3   4   >