EdgeNGramTokenFilter, term position?

2007-09-16 Thread Ryan McKinley
Should the EdgeNGramFilter use the same term position for the ngrams 
within a single token?


As is, the EdgeNGramTokenFilter increments the term position for each 
character.  In analysis.jsp, with the input "hello", I get:


term position   1   2   3   4   5
term text   h   he  hel hellhello
term type   wordwordwordwordword
start,end   0,1 0,2 0,3 0,4 0,5


I would expect something more like what is generated from SOLR-357:

term position   1
term text   hello
hell
hel
he
h
term type   word
prefix
prefix
prefix
prefix
start,end   0,5
0,4
0,3
0,2
0,1

This seems like it would affect slop queries, but I don't really 
understand them yet.


thanks
ryan


Control index/store at document level

2007-09-16 Thread Bharani

Hi,

Is it possible to turn off the "store" option based on field value. I would
like to index and store the primary document but for all revisions i only
need to index it and not store it. 

Thanks
Bharani
-- 
View this message in context: 
http://www.nabble.com/Control-index-store-at-document-level-tf4450446.html#a12697579
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Control index/store at document level

2007-09-16 Thread Ryan McKinley


nope, the field options are created on startup -- you can't change them 
dynamically (i don't know all the details, but I think it is a file 
format issue, not just a configuration issue)


I'm not sure how your app is structured, from what you describe, it 
sounds like you need two fields, one that is indexed and not stored and 
another that is stored and not indexed.  For each revision, put text 
into the indexed field.  for the primary document, put it in both.


ryan


Bharani wrote:

Hi,

Is it possible to turn off the "store" option based on field value. I would
like to index and store the primary document but for all revisions i only
need to index it and not store it. 


Thanks
Bharani




Indexing Speed

2007-09-16 Thread erolagnab

Hi,

Just a FYI.

I've seen some posts mentioned that Solr can index 100-150 docs/s and the
comparison between embedded solr and HTTP. I've tried to do the indexing
with 1.7+ million docs, each doc has 30 fields among which 10 fields are
indexed/stored and the rest are only stored. The result was pretty
impressive, it took approx 1.4 hour to finish. Noted that, the docs were
sent synchronously, one after the other. The solr server and client were
both running on Pentium Dual Core 3.2, 2G Ram, Ubuntu Feisty.

The only issue I noticed is that, Solr does occupy some amount of memory. In
the first run, after indexing around 500 thousands docs, it threw
OutOfMemory exception. In the second trial, I setup -Xms and -Xmx for the
JVM to run on 1G memory, Solr performed till the finish. 

Some questions
1) Is it a good practice to allow Solr indexing docs in real time (millions
docs per day)? What I'm worry is that, Solr may eat up the memory as it
goes.
2) If docs are sent asynchronously, how well could Solr can index?

Any comments are highly appriciated

Trung
-- 
View this message in context: 
http://www.nabble.com/Indexing-Speed-tf4464036.html#a12728679
Sent from the Solr - User mailing list archive at Nabble.com.



Solr - rudimentary problems

2007-09-16 Thread Venkatraman S
We are using Lucene and are migrating to Solr 1.2 (we are using Embedded
Solr). During this process we are stumbling on certain problems :

1) IF the same document is added again, then it it getting added in the
index again(duplicated); inspite of the fact that the IDs are unique across
documents. This document should be updated in the Index.
 The corresponding entry for this field in schema.xml is :
 

2) Also, at the time of deleting a document, by providing its ID(exactly
similar to the deleteById proc in the Embedded Solr example) , we find that
the document is not getting deleted(and we also do not get any errors).

3) While using facets, we are getting the stemmed versions of the
corresponding words in the faceted fields - how do we get the 'original'
word?
As in, 'intenti' for 'intentional' etc

As i am new to Solr and did not find any documentation/on JIRA , i have
posted these. Any help would be highly appreciated.

-Venkat

--


RE: Solr - rudimentary problems

2007-09-16 Thread Stu Hood
With regards to #3, it is recommended that for faceting, you use a separate 
copy of the field with stemming/tokenizing disabled. See : 
http://wiki.apache.org/solr/SolrFacetingOverview#head-fc68926c8421055de872acc694a6a966fab705d6

Thanks,
Stu


-Original Message-
From: Venkatraman S 
Sent: Monday, September 17, 2007 1:05am
To: solr-user@lucene.apache.org
Subject: Solr - rudimentary problems

We are using Lucene and are migrating to Solr 1.2 (we are using Embedded
Solr). During this process we are stumbling on certain problems :

1) IF the same document is added again, then it it getting added in the
index again(duplicated); inspite of the fact that the IDs are unique across
documents. This document should be updated in the Index.
 The corresponding entry for this field in schema.xml is :
 
stored="true" multiValued="false"  required="true"/>

2) Also, at the time of deleting a document, by providing its ID(exactly
similar to the deleteById proc in the Embedded Solr example) , we find that
the document is not getting deleted(and we also do not get any errors).

3) While using facets, we are getting the stemmed versions of the
corresponding words in the faceted fields - how do we get the 'original'
word?
As in, 'intenti' for 'intentional' etc

As i am new to Solr and did not find any documentation/on JIRA , i have
posted these. Any help would be highly appreciated.

-Venkat

--


Re: Solr - rudimentary problems

2007-09-16 Thread Ryan McKinley

Venkatraman S wrote:

We are using Lucene and are migrating to Solr 1.2 (we are using Embedded
Solr). During this process we are stumbling on certain problems :

1) IF the same document is added again, then it it getting added in the
index again(duplicated); inspite of the fact that the IDs are unique across
documents. This document should be updated in the Index.
 The corresponding entry for this field in schema.xml is :
 



Do you have:
id



2) Also, at the time of deleting a document, by providing its ID(exactly
similar to the deleteById proc in the Embedded Solr example) , we find that
the document is not getting deleted(and we also do not get any errors).



are you calling ?



3) While using facets, we are getting the stemmed versions of the
corresponding words in the faceted fields - how do we get the 'original'
word?
As in, 'intenti' for 'intentional' etc



Faceting works on the indexed terms - if the field has stemming applied, 
the facets will be stemmed.


If you need to have stemming in some cases and the direct string in 
other cases, you can use 




Re: Solr - rudimentary problems

2007-09-16 Thread Venkatraman S
Kindly Note again : we are using Embedded Solr.

On 9/17/07, Ryan McKinley <[EMAIL PROTECTED]> wrote:
>
> Venkatraman S wrote:
> > We are using Lucene and are migrating to Solr 1.2 (we are using Embedded
> > Solr). During this process we are stumbling on certain problems :
> >
> > 1) IF the same document is added again, then it it getting added in the
> > index again(duplicated); inspite of the fact that the IDs are unique
> across
> > documents. This document should be updated in the Index.
> >  The corresponding entry for this field in schema.xml is :
> >   > stored="true" multiValued="false"  required="true"/>
> >
>
> Do you have:
> id


yes - i am using it

> 2) Also, at the time of deleting a document, by providing its ID(exactly
> > similar to the deleteById proc in the Embedded Solr example) , we find
> that
> > the document is not getting deleted(and we also do not get any errors).
> >
>
> are you calling ?


Yes - exactly similar to the code mentioned in the embedded solr example in
the wiki .

> 3) While using facets, we are getting the stemmed versions of the
> > corresponding words in the faceted fields - how do we get the 'original'
> > word?
> > As in, 'intenti' for 'intentional' etc
> >
>
> Faceting works on the indexed terms - if the field has stemming applied,
> the facets will be stemmed.
>
> If you need to have stemming in some cases and the direct string in
> other cases, you can use 
>
>
Yea -got this. i rather commented the
  in a

--


commit, concurrency, full text search

2007-09-16 Thread Dilip.TS
Hi,

1)How does the commit works with multiple requests?
2)Does SOLR handle the concurrency during updates?
3)Does solr support any thing like, if I enclose the keywords within quotes,
then we are searching for exactly those keywords together. Some thing like
google does, for example if I enclose like this "java programming"  then it
should search for this keyword as a whole instead breaking the phrase apart.

Any help would be highly appreciated.

Thanks in advance.

Regards,
Dilip