Modelling Access Control

2010-10-23 Thread Paul Carey
Hi

My domain model is made of users that have access to projects which
are composed of items. I'm hoping to use Solr and would like to make
sure that searches only return results for items that users have
access to.

I've looked over some of the older posts on this mailing list about
access control and saw a suggestion along the lines of
acl: AND (actual query).

While this obviously works, there are a couple of niggles. Every item
must have a list of valid user ids (typically less than 100 in my
case). Every time a collaborator is added to or removed from a
project, I need to update every item in that project. This will
typically be fewer than 1000 items, so I guess is no big deal.

I wondered if the following might be a reasonable alternative,
assuming the number of projects to which a user has access is lower
than a certain bound.
(acl: OR acl: OR ... ) AND (actual query)

When the numbers are small - e.g. each user has access to ~20 projects
and each project has ~20 collaborators - is one approach preferable
over another? And when outliers exist - e.g. a project with 2000
collaborators, or a user with access to 2000 projects - is one
approach more liable to fail than the other?

Many thanks

Paul


Re: A bug in ComplexPhraseQuery ?

2010-10-23 Thread jmr


iorixxx wrote:
> 
>> > class="org.apache.solr.search.ComplexPhraseQParserPlugin">
>>     > name="inOrder">false
>>   
>> 
> 
> I added this change to SOLR-1604, can you test it give us feedback?
> 
> 

May thanks. I'll test this quite soon and let you know.
J-Michel
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/A-bug-in-ComplexPhraseQuery-tp1744659p1757145.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: xpath processing

2010-10-23 Thread Ben Boggess
> processor="FileListEntityProcessor" fileName=".*xml" recursive="true" 

Shouldn't this be fileName="*.xml"?

Ben

On Oct 22, 2010, at 10:52 PM, pghorp...@ucla.edu wrote:

> 
> 
> 
> 
> 
>  processor="FileListEntityProcessor" fileName=".*xml" recursive="true" 
> baseDir="C:\data\sample_records\mods\starr">
>  url="${f.fileAbsolutePath}" stream="false" forEach="/mods" 
> transformer="DateFormatTransformer,RegexTransformer,TemplateTransformer">
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  />
> 
> 
> 
> 
> 
> Quoting Ken Stanley :
> 
>> Parinita,
>> 
>> In its simplest form, what does your entity definition for DIH look like;
>> also, what does one record from your xml look like? We need more information
>> before we can really be of any help. :)
>> 
>> - Ken
>> 
>> It looked like something resembling white marble, which was
>> probably what it was: something resembling white marble.
>>-- Douglas Adams, "The Hitchhikers Guide to the Galaxy"
>> 
>> 
>> On Fri, Oct 22, 2010 at 8:00 PM,  wrote:
>> 
>>> Quoting pghorp...@ucla.edu:
>>> Can someone help me please?
>>> 
>>> 
 I am trying to import mods xml data in solr using  the xml/http datasource
 
 This does not work with XPathEntityProcessor of the data import handler
 xpath="/mods/name/namepa...@type = 'date']"
 
 I actually have 143 records with type attribute as 'date' for element
 namePart.
 
 Thank you
 Parinita
 
 
>>> 
>>> 
>> 
> 
> 


Re: Spatial

2010-10-23 Thread Grant Ingersoll

On Oct 20, 2010, at 12:14 PM, Pradeep Singh wrote:

> Thanks for your response Grant.
> 
> I already have the bounding box based implementation in place. And on a
> document base of around 350K it is super fast.
> 
> What about a document base of millions of documents? While a tier based
> approach will narrow down the document space significantly this concern
> might be misplaced because there are other numeric range queries I am going
> to run anyway which don't have anything to do with spatial query. But the
> keyword here is numeric range query based on NumericField, which is going to
> be significantly faster than regular number based queries. I see that the
> dynamic field type _latLon is of type double and not tdouble by default. Can
> I have your input about that decision?

It's just an example.  There shouldn't be any problem with using tdouble (or 
tfloat if you don't need the precision)


> 
> -Pradeep
> 
> On Tue, Oct 19, 2010 at 6:10 PM, Grant Ingersoll wrote:
> 
>> 
>> On Oct 19, 2010, at 6:23 PM, Pradeep Singh wrote:
>> 
>>> https://issues.apache.org/jira/browse/LUCENE-2519
>>> 
>>> If I change my code as per 2519
>>> 
>>> to have this  -
>>> 
>>> public double[] coords(double latitude, double longitude) {
>>>   double rlat = Math.toRadians(latitude);
>>>   double rlong = Math.toRadians(longitude);
>>>   double nlat = rlong * Math.cos(rlat);
>>>   return new double[]{nlat, rlong};
>>> 
>>> }
>>> 
>>> 
>>> return this -
>>> 
>>> x = (gamma - gamma[0]) cos(phi)
>>> y = phi
>>> 
>>> would it make it give correct results? Correct projections, tier ids?
>> 
>> I'm not sure.  I have a lot of doubt around that code.  After making that
>> correction, I spent several days trying to get the tests to pass and
>> ultimately gave up.  Does that mean it is wrong?  I don't know.  I just
>> don't have enough confidence to recommend it given that the tests I were
>> asking it to do I could verify through other tools.  Personally, I would
>> recommend seeing if one of the non-tier based approaches suffices for your
>> situation and use that.
>> 
>> -Grant

--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem docs using Solr/Lucene:
http://www.lucidimagination.com/search



Re: Import From MYSQL database

2010-10-23 Thread do3do3

what i know is to define you field in schema.xml file and build
database_conf.xml file which contain identification for your database 
finally you should define dataimporthandler in solrconfig.xml file
i put sample from what you should done in first post in this topic you can
check it,
if i know additional information i will tell you
good luck
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Import-From-MYSQL-database-tp1738753p1756744.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Import From MYSQL database

2010-10-23 Thread do3do3

i found this files but i can't found any useful info. inside it, what i found
is GET command in http request
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Import-From-MYSQL-database-tp1738753p1756778.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Sergey Bartunov
Here are all the files: http://rghost.net/3016862

1) StandardAnalyzer.java, StandardTokenizer.java - patched files from
lucene-2.9.3
2) I patch these files and build lucene by typing "ant"
3) I replace lucene-core-2.9.3.jar in solr/lib/ by my
lucene-core-2.9.3-dev.jar that I'd just compiled
4) than I do "ant compile" and "ant dist" in solr folder
5) after that I recompile solr/example/webapps/solr.war with my new
solr and lucene-core jars
6) I put my schema.xml in solr/example/solr/conf/
7) then I do "java -jar start.jar" in solr/example
8) index big_post.xml
9) trying to find this document by "curl
http://localhost:8983/solr/select?q=body:big*"; (big_post.xml contains
a long word biga...)
10) solr returns nothing

On 23 October 2010 02:43, Steven A Rowe  wrote:
> Hi Sergey,
>
> What does your ~34kb field value look like?  Does StandardTokenizer think 
> it's just one token?
>
> What doesn't work?  What happens?
>
> Steve
>
>> -Original Message-
>> From: Sergey Bartunov [mailto:sbos@gmail.com]
>> Sent: Friday, October 22, 2010 3:18 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to index long words with StandardTokenizerFactory?
>>
>> I'm using Solr 1.4.1. Now I'm successed with replacing lucene-core jar
>> but maxTokenValue seems to be used in very strange way. Currenty for
>> me it's set to 1024*1024, but I couldn't index a field with just size
>> of ~34kb. I understand that it's a little weird to index such a big
>> data, but I just want to know it doesn't work
>>
>> On 22 October 2010 20:36, Steven A Rowe  wrote:
>> > Hi Sergey,
>> >
>> > I've opened an issue to add a maxTokenLength param to the
>> StandardTokenizerFactory configuration:
>> >
>> >        https://issues.apache.org/jira/browse/SOLR-2188
>> >
>> > I'll work on it this weekend.
>> >
>> > Are you using Solr 1.4.1?  I ask because of your mention of Lucene
>> 2.9.3.  I'm not sure there will ever be a Solr 1.4.2 release.  I plan on
>> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
>> >
>> > I'm not sure why you didn't get the results you wanted with your Lucene
>> hack - is it possible you have other Lucene jars in your Solr classpath?
>> >
>> > Steve
>> >
>> >> -Original Message-
>> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
>> >> Sent: Friday, October 22, 2010 12:08 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: How to index long words with StandardTokenizerFactory?
>> >>
>> >> I'm trying to force solr to index words which length is more than 255
>> >> symbols (this constant is DEFAULT_MAX_TOKEN_LENGTH in lucene
>> >> StandardAnalyzer.java) using StandardTokenizerFactory as 'filter' tag
>> >> in schema configuration XML. Specifying the maxTokenLength attribute
>> >> won't work.
>> >>
>> >> I'd tried to make the dirty hack: I downloaded lucene-core-2.9.3 src
>> >> and changed the DEFAULT_MAX_TOKEN_LENGTH to 100, built it to jar
>> >> and replaced original lucene-core jar in solr /lib. But seems like
>> >> that it had bring no effect.


Re: Solr Javascript+JSON not optimized for SEO

2010-10-23 Thread PeterKerk

Unfortunately its not online yet, but is there anything I can clarify in more
detail?

Thanks!
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Javascript-JSON-not-optimized-for-SEO-tp1751641p1758054.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Ahmet Arslan
Did you delete the folder Jetty_0_0_0_0_8983_solr.war_** under 
apache-solr-1.4.1\example\work?

--- On Sat, 10/23/10, Sergey Bartunov  wrote:

> From: Sergey Bartunov 
> Subject: Re: How to index long words with StandardTokenizerFactory?
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 3:56 PM
> Here are all the files: http://rghost.net/3016862
> 
> 1) StandardAnalyzer.java, StandardTokenizer.java - patched
> files from
> lucene-2.9.3
> 2) I patch these files and build lucene by typing "ant"
> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by my
> lucene-core-2.9.3-dev.jar that I'd just compiled
> 4) than I do "ant compile" and "ant dist" in solr folder
> 5) after that I recompile solr/example/webapps/solr.war
> with my new
> solr and lucene-core jars
> 6) I put my schema.xml in solr/example/solr/conf/
> 7) then I do "java -jar start.jar" in solr/example
> 8) index big_post.xml
> 9) trying to find this document by "curl
> http://localhost:8983/solr/select?q=body:big*";
> (big_post.xml contains
> a long word biga...)
> 10) solr returns nothing
> 
> On 23 October 2010 02:43, Steven A Rowe 
> wrote:
> > Hi Sergey,
> >
> > What does your ~34kb field value look like?  Does
> StandardTokenizer think it's just one token?
> >
> > What doesn't work?  What happens?
> >
> > Steve
> >
> >> -Original Message-
> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
> >> Sent: Friday, October 22, 2010 3:18 PM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How to index long words with
> StandardTokenizerFactory?
> >>
> >> I'm using Solr 1.4.1. Now I'm successed with
> replacing lucene-core jar
> >> but maxTokenValue seems to be used in very strange
> way. Currenty for
> >> me it's set to 1024*1024, but I couldn't index a
> field with just size
> >> of ~34kb. I understand that it's a little weird to
> index such a big
> >> data, but I just want to know it doesn't work
> >>
> >> On 22 October 2010 20:36, Steven A Rowe 
> wrote:
> >> > Hi Sergey,
> >> >
> >> > I've opened an issue to add a maxTokenLength
> param to the
> >> StandardTokenizerFactory configuration:
> >> >
> >> >        https://issues.apache.org/jira/browse/SOLR-2188
> >> >
> >> > I'll work on it this weekend.
> >> >
> >> > Are you using Solr 1.4.1?  I ask because of
> your mention of Lucene
> >> 2.9.3.  I'm not sure there will ever be a Solr
> 1.4.2 release.  I plan on
> >> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
> >> >
> >> > I'm not sure why you didn't get the results
> you wanted with your Lucene
> >> hack - is it possible you have other Lucene jars
> in your Solr classpath?
> >> >
> >> > Steve
> >> >
> >> >> -Original Message-
> >> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
> >> >> Sent: Friday, October 22, 2010 12:08 PM
> >> >> To: solr-user@lucene.apache.org
> >> >> Subject: How to index long words with
> StandardTokenizerFactory?
> >> >>
> >> >> I'm trying to force solr to index words
> which length is more than 255
> >> >> symbols (this constant is
> DEFAULT_MAX_TOKEN_LENGTH in lucene
> >> >> StandardAnalyzer.java) using
> StandardTokenizerFactory as 'filter' tag
> >> >> in schema configuration XML. Specifying
> the maxTokenLength attribute
> >> >> won't work.
> >> >>
> >> >> I'd tried to make the dirty hack: I
> downloaded lucene-core-2.9.3 src
> >> >> and changed the DEFAULT_MAX_TOKEN_LENGTH
> to 100, built it to jar
> >> >> and replaced original lucene-core jar in
> solr /lib. But seems like
> >> >> that it had bring no effect.
> 





Re: Modelling Access Control

2010-10-23 Thread Israel Ekpo
Hi Paul,

Regardless of how you implement it, I would recommend you use filter queries
for the permissions check rather than making it part of the main query.

On Sat, Oct 23, 2010 at 4:03 AM, Paul Carey  wrote:

> Hi
>
> My domain model is made of users that have access to projects which
> are composed of items. I'm hoping to use Solr and would like to make
> sure that searches only return results for items that users have
> access to.
>
> I've looked over some of the older posts on this mailing list about
> access control and saw a suggestion along the lines of
> acl: AND (actual query).
>
> While this obviously works, there are a couple of niggles. Every item
> must have a list of valid user ids (typically less than 100 in my
> case). Every time a collaborator is added to or removed from a
> project, I need to update every item in that project. This will
> typically be fewer than 1000 items, so I guess is no big deal.
>
> I wondered if the following might be a reasonable alternative,
> assuming the number of projects to which a user has access is lower
> than a certain bound.
> (acl: OR acl: OR ... ) AND (actual query)
>
> When the numbers are small - e.g. each user has access to ~20 projects
> and each project has ~20 collaborators - is one approach preferable
> over another? And when outliers exist - e.g. a project with 2000
> collaborators, or a user with access to 2000 projects - is one
> approach more liable to fail than the other?
>
> Many thanks
>
> Paul
>



-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Sergey Bartunov
Yes. I did. Won't help.

On 23 October 2010 17:45, Ahmet Arslan  wrote:
> Did you delete the folder Jetty_0_0_0_0_8983_solr.war_** under 
> apache-solr-1.4.1\example\work?
>
> --- On Sat, 10/23/10, Sergey Bartunov  wrote:
>
>> From: Sergey Bartunov 
>> Subject: Re: How to index long words with StandardTokenizerFactory?
>> To: solr-user@lucene.apache.org
>> Date: Saturday, October 23, 2010, 3:56 PM
>> Here are all the files: http://rghost.net/3016862
>>
>> 1) StandardAnalyzer.java, StandardTokenizer.java - patched
>> files from
>> lucene-2.9.3
>> 2) I patch these files and build lucene by typing "ant"
>> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by my
>> lucene-core-2.9.3-dev.jar that I'd just compiled
>> 4) than I do "ant compile" and "ant dist" in solr folder
>> 5) after that I recompile solr/example/webapps/solr.war
>> with my new
>> solr and lucene-core jars
>> 6) I put my schema.xml in solr/example/solr/conf/
>> 7) then I do "java -jar start.jar" in solr/example
>> 8) index big_post.xml
>> 9) trying to find this document by "curl
>> http://localhost:8983/solr/select?q=body:big*";
>> (big_post.xml contains
>> a long word biga...)
>> 10) solr returns nothing
>>
>> On 23 October 2010 02:43, Steven A Rowe 
>> wrote:
>> > Hi Sergey,
>> >
>> > What does your ~34kb field value look like?  Does
>> StandardTokenizer think it's just one token?
>> >
>> > What doesn't work?  What happens?
>> >
>> > Steve
>> >
>> >> -Original Message-
>> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
>> >> Sent: Friday, October 22, 2010 3:18 PM
>> >> To: solr-user@lucene.apache.org
>> >> Subject: Re: How to index long words with
>> StandardTokenizerFactory?
>> >>
>> >> I'm using Solr 1.4.1. Now I'm successed with
>> replacing lucene-core jar
>> >> but maxTokenValue seems to be used in very strange
>> way. Currenty for
>> >> me it's set to 1024*1024, but I couldn't index a
>> field with just size
>> >> of ~34kb. I understand that it's a little weird to
>> index such a big
>> >> data, but I just want to know it doesn't work
>> >>
>> >> On 22 October 2010 20:36, Steven A Rowe 
>> wrote:
>> >> > Hi Sergey,
>> >> >
>> >> > I've opened an issue to add a maxTokenLength
>> param to the
>> >> StandardTokenizerFactory configuration:
>> >> >
>> >> >        https://issues.apache.org/jira/browse/SOLR-2188
>> >> >
>> >> > I'll work on it this weekend.
>> >> >
>> >> > Are you using Solr 1.4.1?  I ask because of
>> your mention of Lucene
>> >> 2.9.3.  I'm not sure there will ever be a Solr
>> 1.4.2 release.  I plan on
>> >> targeting Solr 3.1 and 4.0 for the SOLR-2188 fix.
>> >> >
>> >> > I'm not sure why you didn't get the results
>> you wanted with your Lucene
>> >> hack - is it possible you have other Lucene jars
>> in your Solr classpath?
>> >> >
>> >> > Steve
>> >> >
>> >> >> -Original Message-
>> >> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
>> >> >> Sent: Friday, October 22, 2010 12:08 PM
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Subject: How to index long words with
>> StandardTokenizerFactory?
>> >> >>
>> >> >> I'm trying to force solr to index words
>> which length is more than 255
>> >> >> symbols (this constant is
>> DEFAULT_MAX_TOKEN_LENGTH in lucene
>> >> >> StandardAnalyzer.java) using
>> StandardTokenizerFactory as 'filter' tag
>> >> >> in schema configuration XML. Specifying
>> the maxTokenLength attribute
>> >> >> won't work.
>> >> >>
>> >> >> I'd tried to make the dirty hack: I
>> downloaded lucene-core-2.9.3 src
>> >> >> and changed the DEFAULT_MAX_TOKEN_LENGTH
>> to 100, built it to jar
>> >> >> and replaced original lucene-core jar in
>> solr /lib. But seems like
>> >> >> that it had bring no effect.
>>
>
>
>
>


Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Ahmet Arslan
I think you should replace your new lucene-core-2.9.3-dev.jar in 
\apache-solr-1.4.1\lib and then create a new solr.war under 
\apache-solr-1.4.1\dist. And copy this new solr.war to 
solr/example/webapps/solr.war

--- On Sat, 10/23/10, Sergey Bartunov  wrote:

> From: Sergey Bartunov 
> Subject: Re: How to index long words with StandardTokenizerFactory?
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 5:45 PM
> Yes. I did. Won't help.
> 
> On 23 October 2010 17:45, Ahmet Arslan 
> wrote:
> > Did you delete the folder
> Jetty_0_0_0_0_8983_solr.war_** under
> apache-solr-1.4.1\example\work?
> >
> > --- On Sat, 10/23/10, Sergey Bartunov 
> wrote:
> >
> >> From: Sergey Bartunov 
> >> Subject: Re: How to index long words with
> StandardTokenizerFactory?
> >> To: solr-user@lucene.apache.org
> >> Date: Saturday, October 23, 2010, 3:56 PM
> >> Here are all the files: http://rghost.net/3016862
> >>
> >> 1) StandardAnalyzer.java, StandardTokenizer.java -
> patched
> >> files from
> >> lucene-2.9.3
> >> 2) I patch these files and build lucene by typing
> "ant"
> >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by
> my
> >> lucene-core-2.9.3-dev.jar that I'd just compiled
> >> 4) than I do "ant compile" and "ant dist" in solr
> folder
> >> 5) after that I recompile
> solr/example/webapps/solr.war
> >> with my new
> >> solr and lucene-core jars
> >> 6) I put my schema.xml in solr/example/solr/conf/
> >> 7) then I do "java -jar start.jar" in
> solr/example
> >> 8) index big_post.xml
> >> 9) trying to find this document by "curl
> >> http://localhost:8983/solr/select?q=body:big*";
> >> (big_post.xml contains
> >> a long word biga...)
> >> 10) solr returns nothing
> >>
> >> On 23 October 2010 02:43, Steven A Rowe 
> >> wrote:
> >> > Hi Sergey,
> >> >
> >> > What does your ~34kb field value look like?
>  Does
> >> StandardTokenizer think it's just one token?
> >> >
> >> > What doesn't work?  What happens?
> >> >
> >> > Steve
> >> >
> >> >> -Original Message-
> >> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
> >> >> Sent: Friday, October 22, 2010 3:18 PM
> >> >> To: solr-user@lucene.apache.org
> >> >> Subject: Re: How to index long words
> with
> >> StandardTokenizerFactory?
> >> >>
> >> >> I'm using Solr 1.4.1. Now I'm successed
> with
> >> replacing lucene-core jar
> >> >> but maxTokenValue seems to be used in
> very strange
> >> way. Currenty for
> >> >> me it's set to 1024*1024, but I couldn't
> index a
> >> field with just size
> >> >> of ~34kb. I understand that it's a little
> weird to
> >> index such a big
> >> >> data, but I just want to know it doesn't
> work
> >> >>
> >> >> On 22 October 2010 20:36, Steven A Rowe
> 
> >> wrote:
> >> >> > Hi Sergey,
> >> >> >
> >> >> > I've opened an issue to add a
> maxTokenLength
> >> param to the
> >> >> StandardTokenizerFactory configuration:
> >> >> >
> >> >> >        https://issues.apache.org/jira/browse/SOLR-2188
> >> >> >
> >> >> > I'll work on it this weekend.
> >> >> >
> >> >> > Are you using Solr 1.4.1?  I ask
> because of
> >> your mention of Lucene
> >> >> 2.9.3.  I'm not sure there will ever be
> a Solr
> >> 1.4.2 release.  I plan on
> >> >> targeting Solr 3.1 and 4.0 for the
> SOLR-2188 fix.
> >> >> >
> >> >> > I'm not sure why you didn't get the
> results
> >> you wanted with your Lucene
> >> >> hack - is it possible you have other
> Lucene jars
> >> in your Solr classpath?
> >> >> >
> >> >> > Steve
> >> >> >
> >> >> >> -Original Message-
> >> >> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
> >> >> >> Sent: Friday, October 22, 2010
> 12:08 PM
> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> Subject: How to index long words
> with
> >> StandardTokenizerFactory?
> >> >> >>
> >> >> >> I'm trying to force solr to
> index words
> >> which length is more than 255
> >> >> >> symbols (this constant is
> >> DEFAULT_MAX_TOKEN_LENGTH in lucene
> >> >> >> StandardAnalyzer.java) using
> >> StandardTokenizerFactory as 'filter' tag
> >> >> >> in schema configuration XML.
> Specifying
> >> the maxTokenLength attribute
> >> >> >> won't work.
> >> >> >>
> >> >> >> I'd tried to make the dirty
> hack: I
> >> downloaded lucene-core-2.9.3 src
> >> >> >> and changed the
> DEFAULT_MAX_TOKEN_LENGTH
> >> to 100, built it to jar
> >> >> >> and replaced original
> lucene-core jar in
> >> solr /lib. But seems like
> >> >> >> that it had bring no effect.
> >>
> >
> >
> >
> >
> 





Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Yonik Seeley
On Fri, Oct 22, 2010 at 12:07 PM, Sergey Bartunov  wrote:
> I'm trying to force solr to index words which length is more than 255

If the field is not a text field, the Solr's default analyzer is used,
which currently limits the token to 256 bytes.
Out of curiosity, what's your usecase that you really need a single 34KB token?

-Yonik
http://www.lucidimagination.com


Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Sergey Bartunov
Look at the scheme.xml that I provided. I use my own "text_block" type
which is derived from "TextField". And I force using
StandardTokenizerFactory using tokenizer tag.

If I use StrField type there are no problems with big data indexing.
The problem is in the tokenizer.

On 23 October 2010 18:55, Yonik Seeley  wrote:
> On Fri, Oct 22, 2010 at 12:07 PM, Sergey Bartunov  wrote:
>> I'm trying to force solr to index words which length is more than 255
>
> If the field is not a text field, the Solr's default analyzer is used,
> which currently limits the token to 256 bytes.
> Out of curiosity, what's your usecase that you really need a single 34KB 
> token?
>
> -Yonik
> http://www.lucidimagination.com
>


Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Sergey Bartunov
This is exactly what I did. Look:

>> >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by
>> my
>> >> lucene-core-2.9.3-dev.jar that I'd just compiled
>> >> 4) than I do "ant compile" and "ant dist" in solr
>> folder
>> >> 5) after that I recompile
>> solr/example/webapps/solr.war

On 23 October 2010 18:53, Ahmet Arslan  wrote:
> I think you should replace your new lucene-core-2.9.3-dev.jar in 
> \apache-solr-1.4.1\lib and then create a new solr.war under 
> \apache-solr-1.4.1\dist. And copy this new solr.war to 
> solr/example/webapps/solr.war
>
> --- On Sat, 10/23/10, Sergey Bartunov  wrote:
>
>> From: Sergey Bartunov 
>> Subject: Re: How to index long words with StandardTokenizerFactory?
>> To: solr-user@lucene.apache.org
>> Date: Saturday, October 23, 2010, 5:45 PM
>> Yes. I did. Won't help.
>>
>> On 23 October 2010 17:45, Ahmet Arslan 
>> wrote:
>> > Did you delete the folder
>> Jetty_0_0_0_0_8983_solr.war_** under
>> apache-solr-1.4.1\example\work?
>> >
>> > --- On Sat, 10/23/10, Sergey Bartunov 
>> wrote:
>> >
>> >> From: Sergey Bartunov 
>> >> Subject: Re: How to index long words with
>> StandardTokenizerFactory?
>> >> To: solr-user@lucene.apache.org
>> >> Date: Saturday, October 23, 2010, 3:56 PM
>> >> Here are all the files: http://rghost.net/3016862
>> >>
>> >> 1) StandardAnalyzer.java, StandardTokenizer.java -
>> patched
>> >> files from
>> >> lucene-2.9.3
>> >> 2) I patch these files and build lucene by typing
>> "ant"
>> >> 3) I replace lucene-core-2.9.3.jar in solr/lib/ by
>> my
>> >> lucene-core-2.9.3-dev.jar that I'd just compiled
>> >> 4) than I do "ant compile" and "ant dist" in solr
>> folder
>> >> 5) after that I recompile
>> solr/example/webapps/solr.war
>> >> with my new
>> >> solr and lucene-core jars
>> >> 6) I put my schema.xml in solr/example/solr/conf/
>> >> 7) then I do "java -jar start.jar" in
>> solr/example
>> >> 8) index big_post.xml
>> >> 9) trying to find this document by "curl
>> >> http://localhost:8983/solr/select?q=body:big*";
>> >> (big_post.xml contains
>> >> a long word biga...)
>> >> 10) solr returns nothing
>> >>
>> >> On 23 October 2010 02:43, Steven A Rowe 
>> >> wrote:
>> >> > Hi Sergey,
>> >> >
>> >> > What does your ~34kb field value look like?
>>  Does
>> >> StandardTokenizer think it's just one token?
>> >> >
>> >> > What doesn't work?  What happens?
>> >> >
>> >> > Steve
>> >> >
>> >> >> -Original Message-
>> >> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
>> >> >> Sent: Friday, October 22, 2010 3:18 PM
>> >> >> To: solr-user@lucene.apache.org
>> >> >> Subject: Re: How to index long words
>> with
>> >> StandardTokenizerFactory?
>> >> >>
>> >> >> I'm using Solr 1.4.1. Now I'm successed
>> with
>> >> replacing lucene-core jar
>> >> >> but maxTokenValue seems to be used in
>> very strange
>> >> way. Currenty for
>> >> >> me it's set to 1024*1024, but I couldn't
>> index a
>> >> field with just size
>> >> >> of ~34kb. I understand that it's a little
>> weird to
>> >> index such a big
>> >> >> data, but I just want to know it doesn't
>> work
>> >> >>
>> >> >> On 22 October 2010 20:36, Steven A Rowe
>> 
>> >> wrote:
>> >> >> > Hi Sergey,
>> >> >> >
>> >> >> > I've opened an issue to add a
>> maxTokenLength
>> >> param to the
>> >> >> StandardTokenizerFactory configuration:
>> >> >> >
>> >> >> >        https://issues.apache.org/jira/browse/SOLR-2188
>> >> >> >
>> >> >> > I'll work on it this weekend.
>> >> >> >
>> >> >> > Are you using Solr 1.4.1?  I ask
>> because of
>> >> your mention of Lucene
>> >> >> 2.9.3.  I'm not sure there will ever be
>> a Solr
>> >> 1.4.2 release.  I plan on
>> >> >> targeting Solr 3.1 and 4.0 for the
>> SOLR-2188 fix.
>> >> >> >
>> >> >> > I'm not sure why you didn't get the
>> results
>> >> you wanted with your Lucene
>> >> >> hack - is it possible you have other
>> Lucene jars
>> >> in your Solr classpath?
>> >> >> >
>> >> >> > Steve
>> >> >> >
>> >> >> >> -Original Message-
>> >> >> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
>> >> >> >> Sent: Friday, October 22, 2010
>> 12:08 PM
>> >> >> >> To: solr-user@lucene.apache.org
>> >> >> >> Subject: How to index long words
>> with
>> >> StandardTokenizerFactory?
>> >> >> >>
>> >> >> >> I'm trying to force solr to
>> index words
>> >> which length is more than 255
>> >> >> >> symbols (this constant is
>> >> DEFAULT_MAX_TOKEN_LENGTH in lucene
>> >> >> >> StandardAnalyzer.java) using
>> >> StandardTokenizerFactory as 'filter' tag
>> >> >> >> in schema configuration XML.
>> Specifying
>> >> the maxTokenLength attribute
>> >> >> >> won't work.
>> >> >> >>
>> >> >> >> I'd tried to make the dirty
>> hack: I
>> >> downloaded lucene-core-2.9.3 src
>> >> >> >> and changed the
>> DEFAULT_MAX_TOKEN_LENGTH
>> >> to 100, built it to jar
>> >> >> >> and replaced original
>> lucene-core jar in
>> >> solr /lib. But seems like
>> >> >> >> that it had bring no effect.
>> >>
>> >
>> >
>> >
>> >
>>
>
>
>
>


Re: xpath processing

2010-10-23 Thread Ken Stanley
On Fri, Oct 22, 2010 at 11:52 PM,  wrote:

>
>
> 
> 
> 
>  processor="FileListEntityProcessor" fileName=".*xml" recursive="true"
> baseDir="C:\data\sample_records\mods\starr">
>  url="${f.fileAbsolutePath}" stream="false" forEach="/mods"
> transformer="DateFormatTransformer,RegexTransformer,TemplateTransformer">
> 
> 
> 
> 
> 
> 
> 
> 
> 
>  />
> 
> 
> 
> 


The documentation says you don't need a dataSource for your
XPathEntityProcessor entity; in my configuration, I have mine set to the
name of the top-level FileListEntityProcessor. Everything else looks fine.
Can you provide one record from your data? Also, are you getting any errors
in your log?

- Ken


Re: Modelling Access Control

2010-10-23 Thread Dennis Gearon
Two things will lessen the solr admininstrative load :

1/ Follow examples of databases and *nix OSs. Give each user their own group, 
or set up groups that don't have regular users as OWNERS, but can have users 
assigned to the group to give them particular permissions. I.E. Roles, like 
publishers, reviewers, friends, etc.

2/ Put your ACL outside of Solr, using your server-side/command line language's 
object oriented properties. Force all searches to come from a single location 
in code (not sure how to do that), and make the piece of code check 
authentication and authorization.

This is what my research shows how others do it, and how I plan to do it. ANY 
insight others have on this, I really want to hear.

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sat, 10/23/10, Paul Carey  wrote:

> From: Paul Carey 
> Subject: Modelling Access Control
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 1:03 AM
> Hi
> 
> My domain model is made of users that have access to
> projects which
> are composed of items. I'm hoping to use Solr and would
> like to make
> sure that searches only return results for items that users
> have
> access to.
> 
> I've looked over some of the older posts on this mailing
> list about
> access control and saw a suggestion along the lines of
> acl: AND (actual query).
> 
> While this obviously works, there are a couple of niggles.
> Every item
> must have a list of valid user ids (typically less than 100
> in my
> case). Every time a collaborator is added to or removed
> from a
> project, I need to update every item in that project. This
> will
> typically be fewer than 1000 items, so I guess is no big
> deal.
> 
> I wondered if the following might be a reasonable
> alternative,
> assuming the number of projects to which a user has access
> is lower
> than a certain bound.
> (acl: OR acl: OR ... )
> AND (actual query)
> 
> When the numbers are small - e.g. each user has access to
> ~20 projects
> and each project has ~20 collaborators - is one approach
> preferable
> over another? And when outliers exist - e.g. a project with
> 2000
> collaborators, or a user with access to 2000 projects - is
> one
> approach more liable to fail than the other?
> 
> Many thanks
> 
> Paul
>


Re: Modelling Access Control

2010-10-23 Thread Dennis Gearon
why use filter queries?

Wouldn't reducing the set headed into the filters by putting it in the main 
query be faster? (A question to learn, since I do NOT know :-)

Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sat, 10/23/10, Israel Ekpo  wrote:

> From: Israel Ekpo 
> Subject: Re: Modelling Access Control
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 7:01 AM
> Hi Paul,
> 
> Regardless of how you implement it, I would recommend you
> use filter queries
> for the permissions check rather than making it part of the
> main query.
> 
> On Sat, Oct 23, 2010 at 4:03 AM, Paul Carey 
> wrote:
> 
> > Hi
> >
> > My domain model is made of users that have access to
> projects which
> > are composed of items. I'm hoping to use Solr and
> would like to make
> > sure that searches only return results for items that
> users have
> > access to.
> >
> > I've looked over some of the older posts on this
> mailing list about
> > access control and saw a suggestion along the lines
> of
> > acl: AND (actual query).
> >
> > While this obviously works, there are a couple of
> niggles. Every item
> > must have a list of valid user ids (typically less
> than 100 in my
> > case). Every time a collaborator is added to or
> removed from a
> > project, I need to update every item in that project.
> This will
> > typically be fewer than 1000 items, so I guess is no
> big deal.
> >
> > I wondered if the following might be a reasonable
> alternative,
> > assuming the number of projects to which a user has
> access is lower
> > than a certain bound.
> > (acl: OR acl: OR
> ... ) AND (actual query)
> >
> > When the numbers are small - e.g. each user has access
> to ~20 projects
> > and each project has ~20 collaborators - is one
> approach preferable
> > over another? And when outliers exist - e.g. a project
> with 2000
> > collaborators, or a user with access to 2000 projects
> - is one
> > approach more liable to fail than the other?
> >
> > Many thanks
> >
> > Paul
> >
> 
> 
> 
> -- 
> °O°
> "Good Enough" is not good enough.
> To give anything less than your best is to sacrifice the
> gift.
> Quality First. Measure Twice. Cut Once.
> http://www.israelekpo.com/
>


Re: Modelling Access Control

2010-10-23 Thread Dennis Gearon
Forgot to add,
3/ The external, application code selects the GROUPS that the user has 
permission to read (Solr will only serve up what is to be read?) then search on 
those groups.


Dennis Gearon

Signature Warning

It is always a good idea to learn from your own mistakes. It is usually a 
better idea to learn from others’ mistakes, so you do not have to make them 
yourself. from 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'

EARTH has a Right To Life,
  otherwise we all die.


--- On Sat, 10/23/10, Dennis Gearon  wrote:

> From: Dennis Gearon 
> Subject: Re: Modelling Access Control
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 11:49 AM
> Two things will lessen the solr
> admininstrative load :
> 
> 1/ Follow examples of databases and *nix OSs. Give each
> user their own group, or set up groups that don't have
> regular users as OWNERS, but can have users assigned to the
> group to give them particular permissions. I.E. Roles, like
> publishers, reviewers, friends, etc.
> 
> 2/ Put your ACL outside of Solr, using your
> server-side/command line language's object oriented
> properties. Force all searches to come from a single
> location in code (not sure how to do that), and make the
> piece of code check authentication and authorization.
> 
> This is what my research shows how others do it, and how I
> plan to do it. ANY insight others have on this, I really
> want to hear.
> 
> Dennis Gearon
> 
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes.
> It is usually a better idea to learn from others’
> mistakes, so you do not have to make them yourself. from 
> 'http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> 
> EARTH has a Right To Life,
>   otherwise we all die.
> 
> 
> --- On Sat, 10/23/10, Paul Carey 
> wrote:
> 
> > From: Paul Carey 
> > Subject: Modelling Access Control
> > To: solr-user@lucene.apache.org
> > Date: Saturday, October 23, 2010, 1:03 AM
> > Hi
> > 
> > My domain model is made of users that have access to
> > projects which
> > are composed of items. I'm hoping to use Solr and
> would
> > like to make
> > sure that searches only return results for items that
> users
> > have
> > access to.
> > 
> > I've looked over some of the older posts on this
> mailing
> > list about
> > access control and saw a suggestion along the lines
> of
> > acl: AND (actual query).
> > 
> > While this obviously works, there are a couple of
> niggles.
> > Every item
> > must have a list of valid user ids (typically less
> than 100
> > in my
> > case). Every time a collaborator is added to or
> removed
> > from a
> > project, I need to update every item in that project.
> This
> > will
> > typically be fewer than 1000 items, so I guess is no
> big
> > deal.
> > 
> > I wondered if the following might be a reasonable
> > alternative,
> > assuming the number of projects to which a user has
> access
> > is lower
> > than a certain bound.
> > (acl: OR acl: OR
> ... )
> > AND (actual query)
> > 
> > When the numbers are small - e.g. each user has access
> to
> > ~20 projects
> > and each project has ~20 collaborators - is one
> approach
> > preferable
> > over another? And when outliers exist - e.g. a project
> with
> > 2000
> > collaborators, or a user with access to 2000 projects
> - is
> > one
> > approach more liable to fail than the other?
> > 
> > Many thanks
> > 
> > Paul
> >
>


Re: Multiple indexes inside a single core

2010-10-23 Thread Erick Erickson
Ah, I should have read more carefully...

I remember this being discussed on the dev list, and I thought there might
be
a Jira attached but I sure can't find it.

If you're willing to work on it, you might hop over to the solr dev list and
start
a discussion, maybe ask for a place to start. I'm sure some of the devs have
thought about this...

If nobody on the dev list says "There's already a JIRA on it", then you
should
open one. The Jira issues are generally preferred when you start getting
into
design because the comments are preserved for the next person who tries
the idea or makes changes, etc

Best
Erick

On Wed, Oct 20, 2010 at 9:52 PM, Ben Boggess  wrote:

> Thanks Erick.  The problem with multiple cores is that the documents are
> scored independently in each core.  I would like to be able to search across
> both cores and have the scores 'normalized' in a way that's similar to what
> Lucene's MultiSearcher would do.  As far a I understand, multiple cores
> would likely result in seriously skewed scores in my case since the
> documents are not distributed evenly or randomly.  I could have one
> core/index with 20 million docs and another with 200.
>
> I've poked around in the code and this feature doesn't seem to exist.  I
> would be happy with finding a decent place to try to add it.  I'm not sure
> if there is a clean place for it.
>
> Ben
>
> On Oct 20, 2010, at 8:36 PM, Erick Erickson 
> wrote:
>
> > It seems to me that multiple cores are along the lines you
> > need, a single instance of Solr that can search across multiple
> > sub-indexes that do not necessarily share schemas, and are
> > independently maintainable..
> >
> > This might be a good place to start:
> http://wiki.apache.org/solr/CoreAdmin
> >
> > HTH
> > Erick
> >
> > On Wed, Oct 20, 2010 at 3:23 PM, ben boggess 
> wrote:
> >
> >> We are trying to convert a Lucene-based search solution to a
> >> Solr/Lucene-based solution.  The problem we have is that we currently
> have
> >> our data split into many indexes and Solr expects things to be in a
> single
> >> index unless you're sharding.  In addition to this, our indexes wouldn't
> >> work well using the distributed search functionality in Solr because the
> >> documents are not evenly or randomly distributed.  We are currently
> using
> >> Lucene's MultiSearcher to search over subsets of these indexes.
> >>
> >> I know this has been brought up a number of times in previous posts and
> the
> >> typical response is that the best thing to do is to convert everything
> into
> >> a single index.  One of the major reasons for having the indexes split
> up
> >> the way we do is because different types of data need to be indexed at
> >> different intervals.  You may need one index to be updated every 20
> minutes
> >> and another is only updated every week.  If we move to a single index,
> then
> >> we will constantly be warming and replacing searchers for the entire
> >> dataset, and will essentially render the searcher caches useless.  If we
> >> were able to have multiple indexes, they would each have a searcher and
> >> updates would be isolated to a subset of the data.
> >>
> >> The other problem is that we will likely need to shard this large single
> >> index and there isn't a clean way to shard randomly and evenly across
> the
> >> of
> >> the data.  We would, however like to shard a single data type.  If we
> could
> >> use multiple indexes, we would likely be also sharding a small sub-set
> of
> >> them.
> >>
> >> Thanks in advance,
> >>
> >> Ben
> >>
>


Re: FieldCache

2010-10-23 Thread Erick Erickson
Why do you want to? Basically, the caches are there to improve
#searching#. To search something, you must index it. Retrieving
it is usually a rare enough operation that caching is irrelevant.

This smells like an XY problem, see:
http://people.apache.org/~hossman/#xyproblem

If this seems like gibberish, could you explain your problem
a little more?

Best
Erick

On Thu, Oct 21, 2010 at 10:20 AM, Mathias Walter wrote:

> Hi,
>
> does a field which should be cached needs to be indexed?
>
> I have a binary field which is just stored. Retrieving it via
> FieldCache.DEFAULT.getTerms returns empty ByteRefs.
>
> Then I found the following post:
> http://www.mail-archive.com/d...@lucene.apache.org/msg05403.html
>
> How can I use the FieldCache with a binary field?
>
> --
> Kind regards,
> Mathias
>
>


Re: How to index long words with StandardTokenizerFactory?

2010-10-23 Thread Ahmet Arslan
Ops I am sorry, I thought that solr/lib refers to solrhome/lib.

I just tested this and it seems that you have successfully increased the max 
token length. You can verify this by analysis.jsp page.

Although analysis.jsp's output, it seems that some other mechanism is 
preventing this huge token to be indexed. Response of 
http://localhost:8983/solr/terms?terms.fl=body
 does not have that huge token.

If you are interested in only prefix queries, as a workaround, you can use  
 
at index time.  So the query (without star) 
solr/select?q=body:big will return that document. 

By the way for this particular task you don't need to edit lucene/solr disto. 
You can use this class for this with standard pre-compiled solr.war.
By putting jar into SolrHome/lib directory.

package foo.solr.analysis;

import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.solr.analysis.BaseTokenizerFactory;
import java.io.Reader;


public class CustomStandardTokenizerFactory extends BaseTokenizerFactory {
  public StandardTokenizer create(Reader input) {
    final StandardTokenizer tokenizer = new StandardTokenizer(input);
    tokenizer.setMaxTokenLength(Integer.MAX_VALUE);
    return tokenizer;
  }
}


      
          
     
      
    

--- On Sat, 10/23/10, Sergey Bartunov  wrote:

> From: Sergey Bartunov 
> Subject: Re: How to index long words with StandardTokenizerFactory?
> To: solr-user@lucene.apache.org
> Date: Saturday, October 23, 2010, 6:01 PM
> This is exactly what I did. Look:
> 
> >> >> 3) I replace lucene-core-2.9.3.jar in
> solr/lib/ by
> >> my
> >> >> lucene-core-2.9.3-dev.jar that I'd just
> compiled
> >> >> 4) than I do "ant compile" and "ant dist"
> in solr
> >> folder
> >> >> 5) after that I recompile
> >> solr/example/webapps/solr.war
> 
> On 23 October 2010 18:53, Ahmet Arslan 
> wrote:
> > I think you should replace your new
> lucene-core-2.9.3-dev.jar in \apache-solr-1.4.1\lib and then
> create a new solr.war under \apache-solr-1.4.1\dist. And
> copy this new solr.war to solr/example/webapps/solr.war
> >
> > --- On Sat, 10/23/10, Sergey Bartunov 
> wrote:
> >
> >> From: Sergey Bartunov 
> >> Subject: Re: How to index long words with
> StandardTokenizerFactory?
> >> To: solr-user@lucene.apache.org
> >> Date: Saturday, October 23, 2010, 5:45 PM
> >> Yes. I did. Won't help.
> >>
> >> On 23 October 2010 17:45, Ahmet Arslan 
> >> wrote:
> >> > Did you delete the folder
> >> Jetty_0_0_0_0_8983_solr.war_** under
> >> apache-solr-1.4.1\example\work?
> >> >
> >> > --- On Sat, 10/23/10, Sergey Bartunov 
> >> wrote:
> >> >
> >> >> From: Sergey Bartunov 
> >> >> Subject: Re: How to index long words
> with
> >> StandardTokenizerFactory?
> >> >> To: solr-user@lucene.apache.org
> >> >> Date: Saturday, October 23, 2010, 3:56
> PM
> >> >> Here are all the files: http://rghost.net/3016862
> >> >>
> >> >> 1) StandardAnalyzer.java,
> StandardTokenizer.java -
> >> patched
> >> >> files from
> >> >> lucene-2.9.3
> >> >> 2) I patch these files and build lucene
> by typing
> >> "ant"
> >> >> 3) I replace lucene-core-2.9.3.jar in
> solr/lib/ by
> >> my
> >> >> lucene-core-2.9.3-dev.jar that I'd just
> compiled
> >> >> 4) than I do "ant compile" and "ant dist"
> in solr
> >> folder
> >> >> 5) after that I recompile
> >> solr/example/webapps/solr.war
> >> >> with my new
> >> >> solr and lucene-core jars
> >> >> 6) I put my schema.xml in
> solr/example/solr/conf/
> >> >> 7) then I do "java -jar start.jar" in
> >> solr/example
> >> >> 8) index big_post.xml
> >> >> 9) trying to find this document by "curl
> >> >> http://localhost:8983/solr/select?q=body:big*";
> >> >> (big_post.xml contains
> >> >> a long word biga...)
> >> >> 10) solr returns nothing
> >> >>
> >> >> On 23 October 2010 02:43, Steven A Rowe
> 
> >> >> wrote:
> >> >> > Hi Sergey,
> >> >> >
> >> >> > What does your ~34kb field value
> look like?
> >>  Does
> >> >> StandardTokenizer think it's just one
> token?
> >> >> >
> >> >> > What doesn't work?  What happens?
> >> >> >
> >> >> > Steve
> >> >> >
> >> >> >> -Original Message-
> >> >> >> From: Sergey Bartunov [mailto:sbos@gmail.com]
> >> >> >> Sent: Friday, October 22, 2010
> 3:18 PM
> >> >> >> To: solr-user@lucene.apache.org
> >> >> >> Subject: Re: How to index long
> words
> >> with
> >> >> StandardTokenizerFactory?
> >> >> >>
> >> >> >> I'm using Solr 1.4.1. Now I'm
> successed
> >> with
> >> >> replacing lucene-core jar
> >> >> >> but maxTokenValue seems to be
> used in
> >> very strange
> >> >> way. Currenty for
> >> >> >> me it's set to 1024*1024, but I
> couldn't
> >> index a
> >> >> field with just size
> >> >> >> of ~34kb. I understand that it's
> a little
> >> weird to
> >> >> index such a big
> >> >> >> data, but I just want to know it
> doesn't
> >> work
> >> >> >>
> >> >> >> On 22 October 2010 20:36, Steven
> A Rowe
> >> 
> >> >> wrote:
> >> >> >> > Hi Sergey,
> >> >> >> >
> >> >> >> > I've opened an issue to add
> a
> >> maxTokenLength
> >> >> param to the
> >> >>

Re: Solr sorting problem

2010-10-23 Thread Erick Erickson
In general, the behavior when sorting is not predictable when
sorting on a tokenized field, which "text" is. What would
it mean to sort on a field with "erick" "Moazzam" as tokens
in a single document? Should it be in the "e"s or the "m"s?

That said, you probably want to watch out for case

Best
Erick

On Fri, Oct 22, 2010 at 10:02 AM, Moazzam Khan  wrote:

> For anyone who faced the same problem, changing the field to string
> from text worked!
>
> -Moazzam
>
> On Fri, Oct 22, 2010 at 8:50 AM, Moazzam Khan  wrote:
> > The field type of the first name and last name is text. Could that be
> > why it's not sorting properly? I just changed it to string and started
> > a full-import. Hopefully that will work.
> >
> > Thanks,
> > Moazzam
> >
> > On Thu, Oct 21, 2010 at 7:42 PM, Jayendra Patil
> >  wrote:
> >> need additional information .
> >> Sorting is easy in Solr just by passing the sort parameter
> >>
> >> However, when it comes to text sorting it depends on how you analyse
> >> and tokenize your fields
> >> Sorting does not work on fields with multiple tokens.
> >>
> http://wiki.apache.org/solr/FAQ#Why_Isn.27t_Sorting_Working_on_my_Text_Fields.3F
> >>
> >> On Thu, Oct 21, 2010 at 7:24 PM, Moazzam Khan 
> wrote:
> >>
> >>> Hey guys,
> >>>
> >>> I have a list of people indexed in Solr. I am trying to sort by their
> >>> first names but I keep getting results that are not alphabetically
> >>> sorted (I see the names starting with W before the names starting with
> >>> A). I have a feeling that the results are first being sorted by
> >>> relevancy then sorted by first name.
> >>>
> >>> Is there a way I can get the results to be sorted alphabetically?
> >>>
> >>> Thanks,
> >>> Moazzam
> >>>
> >>
> >
>


Re: MoreLikeThis explanation?

2010-10-23 Thread Koji Sekiguchi

Hi Darren,

Usually patches are written for the latest trunk branch at the time.

I've just updated the patch. Try it for the current trunk if you prefer.

Koji
--
http://www.rondhuit.com/en/

(10/10/22 19:10), Darren Govoni wrote:

Hi Koji,
I tried to apply your patch to the 1.4.0 tagged branch, but it didn't
take completely.
What branch does it work for?

Darren

On Thu, 2010-10-21 at 23:03 +0900, Koji Sekiguchi wrote:


(10/10/21 20:33), dar...@ontrenet.com wrote:

Hi,
Does the latest Solr provide an explanation for results returned by MLT?


No, but there is an open issue:

https://issues.apache.org/jira/browse/SOLR-860

Koji







Re: Modelling Access Control

2010-10-23 Thread Savvas-Andreas Moysidis
Pushing ACL logic outside Solr sounds like a prudent choice indeed as in, my
opinion, all of the business rules/conceptual logic should reside only
within the code boundaries. This way your domain will be easier to model and
your code to read, understand and maintain.

More information on Filter Queries, when they should be used and how they
affect performance can be found here:
http://wiki.apache.org/solr/FilterQueryGuidance

On 23 October 2010 20:00, Dennis Gearon  wrote:

> Forgot to add,
> 3/ The external, application code selects the GROUPS that the user has
> permission to read (Solr will only serve up what is to be read?) then search
> on those groups.
>
>
> Dennis Gearon
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better idea to learn from others’ mistakes, so you do not have to make them
> yourself. from '
> http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
> EARTH has a Right To Life,
>  otherwise we all die.
>
>
> --- On Sat, 10/23/10, Dennis Gearon  wrote:
>
> > From: Dennis Gearon 
> > Subject: Re: Modelling Access Control
> > To: solr-user@lucene.apache.org
> > Date: Saturday, October 23, 2010, 11:49 AM
> > Two things will lessen the solr
> > admininstrative load :
> >
> > 1/ Follow examples of databases and *nix OSs. Give each
> > user their own group, or set up groups that don't have
> > regular users as OWNERS, but can have users assigned to the
> > group to give them particular permissions. I.E. Roles, like
> > publishers, reviewers, friends, etc.
> >
> > 2/ Put your ACL outside of Solr, using your
> > server-side/command line language's object oriented
> > properties. Force all searches to come from a single
> > location in code (not sure how to do that), and make the
> > piece of code check authentication and authorization.
> >
> > This is what my research shows how others do it, and how I
> > plan to do it. ANY insight others have on this, I really
> > want to hear.
> >
> > Dennis Gearon
> >
> > Signature Warning
> > 
> > It is always a good idea to learn from your own mistakes.
> > It is usually a better idea to learn from others’
> > mistakes, so you do not have to make them yourself. from '
> http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
> >
> > EARTH has a Right To Life,
> >   otherwise we all die.
> >
> >
> > --- On Sat, 10/23/10, Paul Carey 
> > wrote:
> >
> > > From: Paul Carey 
> > > Subject: Modelling Access Control
> > > To: solr-user@lucene.apache.org
> > > Date: Saturday, October 23, 2010, 1:03 AM
> > > Hi
> > >
> > > My domain model is made of users that have access to
> > > projects which
> > > are composed of items. I'm hoping to use Solr and
> > would
> > > like to make
> > > sure that searches only return results for items that
> > users
> > > have
> > > access to.
> > >
> > > I've looked over some of the older posts on this
> > mailing
> > > list about
> > > access control and saw a suggestion along the lines
> > of
> > > acl: AND (actual query).
> > >
> > > While this obviously works, there are a couple of
> > niggles.
> > > Every item
> > > must have a list of valid user ids (typically less
> > than 100
> > > in my
> > > case). Every time a collaborator is added to or
> > removed
> > > from a
> > > project, I need to update every item in that project.
> > This
> > > will
> > > typically be fewer than 1000 items, so I guess is no
> > big
> > > deal.
> > >
> > > I wondered if the following might be a reasonable
> > > alternative,
> > > assuming the number of projects to which a user has
> > access
> > > is lower
> > > than a certain bound.
> > > (acl: OR acl: OR
> > ... )
> > > AND (actual query)
> > >
> > > When the numbers are small - e.g. each user has access
> > to
> > > ~20 projects
> > > and each project has ~20 collaborators - is one
> > approach
> > > preferable
> > > over another? And when outliers exist - e.g. a project
> > with
> > > 2000
> > > collaborators, or a user with access to 2000 projects
> > - is
> > > one
> > > approach more liable to fail than the other?
> > >
> > > Many thanks
> > >
> > > Paul
> > >
> >
>


Re: pf parameter in edismax (SOLR-1553)

2010-10-23 Thread Jan Høydahl / Cominvent
Answering my own question:
The "pf" feature only kicks in with multi term "q" param. In my case I used a 
field tokenized by KeywordTokenizer, hence pf never kicked in.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

On 14. okt. 2010, at 13.29, Jan Høydahl / Cominvent wrote:

> Hi,
> 
> Have applied SOLR-1553 to 1.4.2 and it works great.
> However, I can't get the pf param to work. Example:
>   q=foo bar&qf=title^2.0 body^0.5&pf=title^50.0
> 
> Shouldn't I see the phrase query boost in debugQuery? Currently I see no 
> trace of pf being used.
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> 



Re: Modelling Access Control

2010-10-23 Thread Israel Ekpo
Hi All,

I think using filter queries will be a good option to consider because of
the following reasons

* The filter query does not affect the score of the items in the result set.
If the ACL logic is part of the main query, it could influence the scores of
the items in the result set.

* Using a filter query could lead to better performance in complex queries
because the results from the query specified with fq are cached
independently from that of the main query. Since the result of a filter
query is cached, it will be used to filter the primary query result using
set intersection without having to fetch the ids of the documents from the
fq again a second time.

It think this will be useful because we could assume that the ACL portion in
the fq is relatively constant since the permissions for each user is not
something that is changing frequently.

http://wiki.apache.org/solr/FilterQueryGuidance


On Sat, Oct 23, 2010 at 2:58 PM, Dennis Gearon wrote:

> why use filter queries?
>
> Wouldn't reducing the set headed into the filters by putting it in the main
> query be faster? (A question to learn, since I do NOT know :-)
>
> Dennis Gearon
>
> Signature Warning
> 
> It is always a good idea to learn from your own mistakes. It is usually a
> better idea to learn from others’ mistakes, so you do not have to make them
> yourself. from '
> http://blogs.techrepublic.com.com/security/?p=4501&tag=nl.e036'
>
> EARTH has a Right To Life,
>  otherwise we all die.
>
>
> --- On Sat, 10/23/10, Israel Ekpo  wrote:
>
> > From: Israel Ekpo 
> > Subject: Re: Modelling Access Control
> > To: solr-user@lucene.apache.org
> > Date: Saturday, October 23, 2010, 7:01 AM
> > Hi Paul,
> >
> > Regardless of how you implement it, I would recommend you
> > use filter queries
> > for the permissions check rather than making it part of the
> > main query.
> >
> > On Sat, Oct 23, 2010 at 4:03 AM, Paul Carey 
> > wrote:
> >
> > > Hi
> > >
> > > My domain model is made of users that have access to
> > projects which
> > > are composed of items. I'm hoping to use Solr and
> > would like to make
> > > sure that searches only return results for items that
> > users have
> > > access to.
> > >
> > > I've looked over some of the older posts on this
> > mailing list about
> > > access control and saw a suggestion along the lines
> > of
> > > acl: AND (actual query).
> > >
> > > While this obviously works, there are a couple of
> > niggles. Every item
> > > must have a list of valid user ids (typically less
> > than 100 in my
> > > case). Every time a collaborator is added to or
> > removed from a
> > > project, I need to update every item in that project.
> > This will
> > > typically be fewer than 1000 items, so I guess is no
> > big deal.
> > >
> > > I wondered if the following might be a reasonable
> > alternative,
> > > assuming the number of projects to which a user has
> > access is lower
> > > than a certain bound.
> > > (acl: OR acl: OR
> > ... ) AND (actual query)
> > >
> > > When the numbers are small - e.g. each user has access
> > to ~20 projects
> > > and each project has ~20 collaborators - is one
> > approach preferable
> > > over another? And when outliers exist - e.g. a project
> > with 2000
> > > collaborators, or a user with access to 2000 projects
> > - is one
> > > approach more liable to fail than the other?
> > >
> > > Many thanks
> > >
> > > Paul
> > >
> >
> >
> >
> > --
> > °O°
> > "Good Enough" is not good enough.
> > To give anything less than your best is to sacrifice the
> > gift.
> > Quality First. Measure Twice. Cut Once.
> > http://www.israelekpo.com/
> >
>



-- 
°O°
"Good Enough" is not good enough.
To give anything less than your best is to sacrifice the gift.
Quality First. Measure Twice. Cut Once.
http://www.israelekpo.com/


Re: How to delete a SOLR document if that particular data doesnt exist in DB?

2010-10-23 Thread bbarani

Thanks a lot for all your replies.

I finally wrote a program which will fetch and store all the UID from source
(DB) in one list and fetch and store all the UID from SOLR document in
another list.

Next using the binarySearch method of collection I was able to filter out
the list of UID's that are not present in SOLR UID list with that of DB UID
list and passed those UID for deletion using deletebyQuery.

It took under 7 minutes to compare 2 list with over 3 million records (in
each list) and delete the orphan documents from SOLR index.

Again thanks a lot for all your replies. 

Thanks,
Barani
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-delete-a-SOLR-document-if-that-particular-data-doesnt-exist-in-DB-tp1739222p1761093.html
Sent from the Solr - User mailing list archive at Nabble.com.