DIH import "out of memory" problem (batchSize and autoCommit not working)

2009-09-22 Thread Steve Sun
Hi,

I spent a whole day trying to make "batchSize" work for JdbcDataSource with
"org.postgresql.Driver", but got frustrated.  At last I took a look into
DIH's source code and found that there's actually a bug in there.  When JDBC
driver is placed in /lib (as instructed by DIHQuickStart page of
Solr wiki), but not in tomcat's lib directory, the JDBC connection will not
be configured as specified in the DIH configuration at all.  Attributes like
autoCommit, readOnly and batchSize will be ignored.  The fix is simple, have
attached my patch.
(contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/JdbcDataSource.java
r817524)

One work-around is: place your JDBC driver jar under tomcat's application
lib directory.  e.g., tomcat/webapps/solr/WEB-INF/lib/

Have only tested with postgresql drivers, but seems the problem is generic
to all drivers placed in /lib.

Regards,
Steve


Re: DIH import "out of memory" problem (batchSize and autoCommit not working)

2009-09-22 Thread Shalin Shekhar Mangar
On Tue, Sep 22, 2009 at 2:29 PM, Steve Sun  wrote:

> Hi,
>
> I spent a whole day trying to make "batchSize" work for JdbcDataSource with
> "org.postgresql.Driver", but got frustrated.  At last I took a look into
> DIH's source code and found that there's actually a bug in there.  When JDBC
> driver is placed in /lib (as instructed by DIHQuickStart page of
> Solr wiki), but not in tomcat's lib directory, the JDBC connection will not
> be configured as specified in the DIH configuration at all.  Attributes like
> autoCommit, readOnly and batchSize will be ignored.  The fix is simple, have
> attached my patch.
> (contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/JdbcDataSource.java
> r817524)
>
> One work-around is: place your JDBC driver jar under tomcat's application
> lib directory.  e.g., tomcat/webapps/solr/WEB-INF/lib/
>
> Have only tested with postgresql drivers, but seems the problem is generic
> to all drivers placed in /lib.
>

Thanks Steve. The mailing list removed your attachment. Can you please open
a jira issue and attach a patch there?

-- 
Regards,
Shalin Shekhar Mangar.


Re: DIH import "out of memory" problem (batchSize and autoCommit not working)

2009-09-22 Thread Steve Sun
2009/9/22 Shalin Shekhar Mangar 

> On Tue, Sep 22, 2009 at 2:29 PM, Steve Sun  wrote:
>
> > Hi,
> >
> > I spent a whole day trying to make "batchSize" work for JdbcDataSource
> with
> > "org.postgresql.Driver", but got frustrated.  At last I took a look into
> > DIH's source code and found that there's actually a bug in there.  When
> JDBC
> > driver is placed in /lib (as instructed by DIHQuickStart page
> of
> > Solr wiki), but not in tomcat's lib directory, the JDBC connection will
> not
> > be configured as specified in the DIH configuration at all.  Attributes
> like
> > autoCommit, readOnly and batchSize will be ignored.  The fix is simple,
> have
> > attached my patch.
> >
> (contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/JdbcDataSource.java
> > r817524)
> >
> > One work-around is: place your JDBC driver jar under tomcat's application
> > lib directory.  e.g., tomcat/webapps/solr/WEB-INF/lib/
> >
> > Have only tested with postgresql drivers, but seems the problem is
> generic
> > to all drivers placed in /lib.
> >
>
> Thanks Steve. The mailing list removed your attachment. Can you please open
> a jira issue and attach a patch there?
>
>
Done.
http://issues.apache.org/jira/browse/SOLR-1450

--
> Regards,
> Shalin Shekhar Mangar.
>


Apache Hadoop Get Together: Next week Tuesday, newthinking store Berlin Germany

2009-09-22 Thread Isabel Drost

This is a friendly reminder that the next Apache Hadoop Get Together
takes place next week on Tuesday, 29th of September* at newthinking
store (Tucholskystr. 48, Berlin):

http://upcoming.yahoo.com/event/4314020/

   * Thorsten Schuett, Solving Puzzles with MapReduce.
   * Thilo Götz, Text analytics on jaql.
   * Uwe Schindler, Lucene 2.9 Developments.

Big thanks goes to newthinking store for providing the venue for free
and to Cloudera for sponsoring videos of the talks. Links to the videos
will be posted on , on the upcoming page
linked above, as well as on the Cloudera Blog soon after the event.

The 7th Get Together is scheduled for December, 16th. If you would like
to submit a talk or sponsor the event, please contact me.


Hope to see you in Berlin next week,

Isabel



* The event is scheduled right before the UIMA workshop in Potsdam,
which may be of interest to you if you are a UIMA user:

http://docs.google.com/View?id=dft23bqs_3c7qnzg6x


Query performance

2009-09-22 Thread Gargate, Siddharth
Hi all,

Does the following query has any performance impact over
the second query?

 

+title:lucene +(title:lucene -name:sid)

 

 

+(title:lucene -name:sid)

 

 



Re: DIH import "out of memory" problem (batchSize and autoCommit not working)

2009-09-22 Thread Shalin Shekhar Mangar
On Tue, Sep 22, 2009 at 3:00 PM, Steve Sun  wrote:

> Done.
> http://issues.apache.org/jira/browse/SOLR-1450
>
>
This is fixed in trunk now. Thanks Steve!

-- 
Regards,
Shalin Shekhar Mangar.


solr caching problem

2009-09-22 Thread satyasundar jena
I configured filter cache in solrconfig.xml as here under :


true

as per
http://wiki.apache.org/solr/SolrCaching#head-b6a7d51521d55fa0c89f2b576b2659f297f9

And executed a query as:
http://localhost:8080/solr/select/?q=*:*&fq=id:(172704TO
2079813)&sort=id asc

But when i deleted the doc id:172704 and executed the query again , i didnt
find the same doc(172704 ) in my
result.


Re: what is too large for an indexed field

2009-09-22 Thread Erick Erickson
You might also want to get a copy of Luke and examine your index to seewhat's
actually in there. Could you be being mislead by, say, punctuation?

Erick

On Mon, Sep 21, 2009 at 4:28 PM, Yonik Seeley wrote:

> On Mon, Sep 21, 2009 at 4:22 PM, Park, Michael 
> wrote:
> > I get no results back on a search.  But I can see the actual word or
> phrase in the stored doc.
>
> Ok cool - that should make it much easier to debug.
> #1) verify that you changed the maxFieldLength property in both places
> in solrconfig.xml, and that you restarted and reindexed.
> #2) if still broken, could you show the output after adding
> debugQuery=true to the request, along with a snippet from the document
> that should match?
>
> -Yonik
> http://www.lucidimagination.com
>


Function query result as a filter query

2009-09-22 Thread Pete Smith
Hi,

Is it possible to constrain a resultset using a filter query to only
return the top 100 documents for a particular field?

Say I have a field called 'hits' that has the total number of hits for
that item. I want to return only the documents that have the top 100
highest hits.

I want something like this:

fq=ord(hits):[* TO 100]

But that does not appear to work - I don't think I can use a function
query for the source of a query. I want it as a filter query so I can
also use it as a facet query.

Cheers,
Pete



Re: Function query result as a filter query

2009-09-22 Thread Yonik Seeley
It's probably not exactly what you're looking for, but you can do
ranges over functions in Solr 1.4
http://www.lucidimagination.com/blog/2009/07/06/ranges-over-functions-in-solr-14/

-Yonik
http://www.lucidimagination.com



On Tue, Sep 22, 2009 at 10:26 AM, Pete Smith  wrote:
> Hi,
>
> Is it possible to constrain a resultset using a filter query to only
> return the top 100 documents for a particular field?
>
> Say I have a field called 'hits' that has the total number of hits for
> that item. I want to return only the documents that have the top 100
> highest hits.
>
> I want something like this:
>
> fq=ord(hits):[* TO 100]
>
> But that does not appear to work - I don't think I can use a function
> query for the source of a query. I want it as a filter query so I can
> also use it as a facet query.
>
> Cheers,
> Pete
>
>


Re: solr caching problem

2009-09-22 Thread Yonik Seeley
Solr's caches should be transparent - they should only speed up
queries, not change the result of queries.

-Yonik
http://www.lucidimagination.com

On Tue, Sep 22, 2009 at 9:45 AM, satyasundar jena  wrote:
> I configured filter cache in solrconfig.xml as here under :
>  class="solr.FastLRUCache"
> size="16384"
> initialSize="4096"
> autowarmCount="4096"/>
>
> true
>
> as per
> http://wiki.apache.org/solr/SolrCaching#head-b6a7d51521d55fa0c89f2b576b2659f297f9
>
> And executed a query as:
> http://localhost:8080/solr/select/?q=*:*&fq=id:(172704TO
> 2079813)&sort=id asc
>
> But when i deleted the doc id:172704 and executed the query again , i didnt
> find the same doc(172704 ) in my
> result.
>


Re: solr caching problem

2009-09-22 Thread satyasundar jena
1)Then do you mean , if we delete a perticular doc ,then that is going to be
deleted from
  cache also.
2)In solr,is  cache storing the entire document in memory or only the
references to
   documents in memory.
And how to test this caching after all.
I ll be thankful upon getting an elaboration.

On Tue, Sep 22, 2009 at 8:46 PM, Yonik Seeley wrote:

> Solr's caches should be transparent - they should only speed up
> queries, not change the result of queries.
>
> -Yonik
> http://www.lucidimagination.com
>
> On Tue, Sep 22, 2009 at 9:45 AM, satyasundar jena 
> wrote:
> > I configured filter cache in solrconfig.xml as here under :
> >  > class="solr.FastLRUCache"
> > size="16384"
> > initialSize="4096"
> > autowarmCount="4096"/>
> >
> > true
> >
> > as per
> >
> http://wiki.apache.org/solr/SolrCaching#head-b6a7d51521d55fa0c89f2b576b2659f297f9
> >
> > And executed a query as:
> > http://localhost:8080/solr/select/?q=*:*&fq=id:(172704
> TO
> > 2079813)&sort=id asc
> >
> > But when i deleted the doc id:172704 and executed the query again , i
> didnt
> > find the same doc(172704 ) in my
> > result.
> >
>


Oracle incomplete DataImport results

2009-09-22 Thread Daniel Bradley
I appear to be getting only a small number of items imported into Solr
when doing a full-import against an oracle data-provider. The query I'm
running is something approximately similar to:

SELECT "ID", dbms_lob.substr("Text", 4000, 1) "Text", "Date",
"LastModified", "Type", "Created", "Available", "Parent", "Title" from
"TheTableName" where "Available" < CURRENT_DATE and "Available" >
add_months(current_date, -1)

This retrieves the last month's items from the database (The
dbms_lob.substr function is used to avoid Solr simply indexing the
object name as Text is the Oracle clob type). When running this in
oracle sql developer approximately 5600 rows are returned however
running a full import only imports approximately 550 items. 

There's no visible memory use and no exceptions suggesting any problems
with lack of memory. Is there any limiting of the number of items you
can import in a single request? Any other thoughts on this problem would
be much appreciated.

Thanks



Other Information:

Running the command:
http://xxx.xxx.xxx.xxx:8080/solr/dataimport?command=full-import

Produces the output:


  
0
0
  
  

  data-config.xml

  
  full-import
  idle
  
  
0:5:43.58
559
4726
557
0
2009-09-22 16:58:46
  
  This response format is experimental.  It is
likely to change in the future.


Running the command:
http://xxx.xxx.xxx.xxx:8080/solr/dataimport?command=full-import&debug=on
&verbose=true

Produces the following output (dots added where content is not
relevant):



  
0
40906
  
  

  data-config.xml

  
  full-import
  debug
  
...
  
  

  
SELECT "ID", dbms_lob.substr("Text", 4000, 1)
"Text", "Date", "LastModified", "Type", "Created", "Available",
"Parent", "Title" from "TheTableName" where "Available" <
CURRENT_DATE and "Available" > add_months(current_date, -1)
0:0:7.766
--- row #1-
2009-08-22T16:04:04Z
java.math.BigDecimal:0

  ...

2009-08-22T16:04:04Z
2009-08-22T16:04:04Z
java.math.BigDecimal:235
java.math.BigDecimal:1320541
2009-08-22T16:04:58Z
...
-

  SELECT
CONCAT(CONCAT(CONCAT(CONCAT(CONCAT(CONCAT("Level1",' '),"Level2"), ' '),
"Level3"), ' '), "Level4") "Levels", TO_NCHAR("TheCategories"."Value")
"Value" FROM "TheCategories" WHERE "TheCategories".ID='1320541'
  0:0:5.485
  --- row #1-
  12 235 1848 
  
  -
  ...

  
  ...

  
  idle
  Configuration Re-loaded sucessfully
  
11
93
0
2009-09-22 16:47:28
0:0:39.47
  
  This response format is experimental.  It is
likely to change in the future.




This message has been scanned for viruses by Websense Hosted Email Security - 
On Behalf of Adfero Ltd

DISCLAIMER: This email (including any attachments) is subject to copyright, and 
the information in it is confidential. Use of this email or of any information 
in it other than by the addressee is unauthorised and unlawful. Whilst 
reasonable efforts are made to ensure that any attachments are virus-free, it 
is the recipient's sole responsibility to scan all attachments for viruses.  
All calls and emails to and from this company may be monitored and recorded for 
legitimate purposes relating to this company's business.

Any opinions expressed in this email (or in any attachments) are those of the 
author and do not necessarily represent the opinions of Adfero Ltd or of any 
other group company.


RE: solr caching problem

2009-09-22 Thread Fuad Efendi
> 1)Then do you mean , if we delete a perticular doc ,then that is going to
be
> deleted from
>   cache also.

When you delete document, and then COMMIT your changes, new caches will be
warmed up (and prepopulated by some key-value pairs from old instances),
etc:

  


- this one won't be 'prepopulated'.




> 2)In solr,is  cache storing the entire document in memory or only the
> references to
>documents in memory.

There are many different cache instances, DocumentCache should store  pairs, etc




Code sync between Lucene and Solr, crossing Apache project boundaries, etc.

2009-09-22 Thread Mark Bennett
To do any serious Solr debugging (or filter development) you also need the
Solr source code tree.  And you'd like them to be in sync, so that the
Lucene code you see is exactly the same as what was used for the Solr
version you're working with.

I did find this link on sync'ing the two source tress, for a specific Solr
version.  But this seems a bit convoluted, is this really the right answer?
http://happygiraffe.net/blog/2009/07/16/solrs-lucene-source/

Generally:

1: Given that many threads on the Solr mailing list seem to end with "yeah,
you could write some code to do that", it seems like most developers would
face this problem, and yet I haven't seen it discussed much (at least not
with the search terms I'm using, I keep finding stuff about SolrJ or just
getting the Solr code to build)

2: Do most people pick a version of Solr, then try carefully to sync it with
the exact version of Lucene code, or do they just not worry about the
version stuff unless they get an error?

3: Or do most Solr/Lucene developers just live out of the Apache nightly
snapshots for both source code trees?  And if so, I have a question about
even doing that.  (see below)

4: Getting both projects to live together under Eclipse has seemed a bit
awkward, though again I'd think it's a common task.  Any good links on that?

5: So do you most of you:
a: Bother with both code trees?  (Solr and Lucene)
b: Live from the command line with ant?  or
c: Get them both living under Eclipse?  With dependencies back and forth?  I
haven't found a good resource for this yet.
d: Do you also think maven, git and clover are required?

6: Since advanced developers using Nutch and Solr would typically want
Lucene's source code as well, wouldn't it be good to have separate
distribution that includes all of those?  Or perhaps there's some "apache
thing" that makes this so trivial it's not worth bothering with?

Back to question 3, just using the nightly trees for both Solr and Lucene,
which presumably are in sync.  If you don't need a specific release, this
might be a reasonable workaround, but still a few of the details bother me.

Get the code:
svn co http://svn.apache.org/repos/asf/lucene/java/trunk lucene-nightly
Checked out revision 817722.
svn co http://svn.apache.org/repos/asf/lucene/solr/trunk solr-nightly
Checked out revision 817722.

* Note that both are revision 817722

Now do builds:
cd lucene-nightly / ant / cd ..
cd solr-nightly / ant / cd ..

* OK, fine, no clover, I guess I'll live...

Now check the two Lucene core jar files:
proj $ ls -l lucene-nightly/build/lucene-core-2.9.jar
-rw-r--r--  1 xyz  staff  1104037 Sep 22 09:44
lucene-nightly/build/lucene-core-2.9.jar
proj $ ls -l solr-nightly/lib/lucene-core-2.9.0.jar
-rw-r--r--  1 xyz  staff  1104049 Sep 22 09:43
solr-nightly/lib/lucene-core-2.9.0.jar

* Notice the sizes.  They're pretty close, but not identical.  The jar in
lucene-nightly was created by my tools, whereas the lucene jar in Solr was
created on another machine.

To prove this use Subversion status command:
proj $ svn status lucene-nightly/build/lucene-core-2.9.jar
svn: warning: 'lucene-nightly/build/lucene-core-2.9.jar' is not a working
copy
So... not under version control, because it's built when I run ant.
proj $ svn status solr-nightly/lib/lucene-core-2.9.0.jar
(no output)
No complaints from Subversion, so we know that it IS under version control
and therefore came from the apache, and I haven't messed with it.

So I'd chalk up the difference exclusively to that then?

If folks work with both code trees a lot, maybe having a parent build file
could copy over the fresh Lucene jar over to Solr.  Also curious if there's
an automated way to get this working in Eclipse.


--
Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513


Re: Oracle incomplete DataImport results

2009-09-22 Thread Shalin Shekhar Mangar
On Tue, Sep 22, 2009 at 10:53 PM, Daniel Bradley <
daniel.brad...@adfero.co.uk> wrote:

> I appear to be getting only a small number of items imported into Solr
> when doing a full-import against an oracle data-provider. The query I'm
> running is something approximately similar to:
>
> SELECT "ID", dbms_lob.substr("Text", 4000, 1) "Text", "Date",
> "LastModified", "Type", "Created", "Available", "Parent", "Title" from
> "TheTableName" where "Available" < CURRENT_DATE and "Available" >
> add_months(current_date, -1)
>
> This retrieves the last month's items from the database (The
> dbms_lob.substr function is used to avoid Solr simply indexing the
> object name as Text is the Oracle clob type). When running this in
> oracle sql developer approximately 5600 rows are returned however
> running a full import only imports approximately 550 items.
>
> There's no visible memory use and no exceptions suggesting any problems
> with lack of memory. Is there any limiting of the number of items you
> can import in a single request? Any other thoughts on this problem would
> be much appreciated.
>
>
What is the uniqueKey in schema.xml? Is it possible that many of those 5600
rows share the same value for solr's uniqueKey field?

There are no limits on the number of items you can import. The number of
documents created should correspond to the number of rows returned by the
root level entity's query (assuming the uniqueKey for each of those
documents is actually unique).

-- 
Regards,
Shalin Shekhar Mangar.


Re: Code sync between Lucene and Solr, crossing Apache project boundaries, etc.

2009-09-22 Thread Shalin Shekhar Mangar
On Tue, Sep 22, 2009 at 11:24 PM, Mark Bennett  wrote:

> To do any serious Solr debugging (or filter development) you also need the
> Solr source code tree.  And you'd like them to be in sync, so that the
> Lucene code you see is exactly the same as what was used for the Solr
> version you're working with.
>
> I did find this link on sync'ing the two source tress, for a specific Solr
> version.  But this seems a bit convoluted, is this really the right answer?
> http://happygiraffe.net/blog/2009/07/16/solrs-lucene-source/
>
> Generally:
>
> 1: Given that many threads on the Solr mailing list seem to end with "yeah,
> you could write some code to do that", it seems like most developers would
> face this problem, and yet I haven't seen it discussed much (at least not
> with the search terms I'm using, I keep finding stuff about SolrJ or just
> getting the Solr code to build)
>
> 2: Do most people pick a version of Solr, then try carefully to sync it
> with
> the exact version of Lucene code, or do they just not worry about the
> version stuff unless they get an error?
>

3: Or do most Solr/Lucene developers just live out of the Apache nightly
> snapshots for both source code trees?  And if so, I have a question about
> even doing that.  (see below)
>
>
Actually, I have had to debug through to Lucene sources only a couple of
times till now. I keep both Lucene trunk handy for reference. I rarely need
it if I'm writing a Solr plugin. In my IDE I setup Solr trunk with a
dependency on Lucene trunk so that I can refer to both sources by ctrl+click
on a Lucene class name. It works well because Lucene has excellent
back-compat. Before creating a patch, I run tests through ant which uses the
bundled Lucene jars.


> 4: Getting both projects to live together under Eclipse has seemed a bit
> awkward, though again I'd think it's a common task.  Any good links on
> that?
>
>
I use IDEA now but I did use eclipse earlier and I had set it up the same
way. Keep both projects, from Solr remove the dependency on bundled lucene
jars, add a dependency on Lucene source for ease in cross-referencing
sources.


> 5: So do you most of you:
> a: Bother with both code trees?  (Solr and Lucene)
>

Yes!


> b: Live from the command line with ant?  or
>

Yes!


> c: Get them both living under Eclipse?  With dependencies back and forth?
>  I
> haven't found a good resource for this yet.
> d: Do you also think maven, git and clover are required?
>

Not really.


> 6: Since advanced developers using Nutch and Solr would typically want
> Lucene's source code as well, wouldn't it be good to have separate
> distribution that includes all of those?  Or perhaps there's some "apache
> thing" that makes this so trivial it's not worth bothering with?
>

For most such developers I assume this would be a one-time setup which is
used for a long time. Therefore, there is very little advantage in providing
a separate distribution I guess.


> Back to question 3, just using the nightly trees for both Solr and Lucene,
> which presumably are in sync.  If you don't need a specific release, this
> might be a reasonable workaround, but still a few of the details bother me.
>
> Get the code:
> svn co http://svn.apache.org/repos/asf/lucene/java/trunk lucene-nightly
> Checked out revision 817722.
> svn co http://svn.apache.org/repos/asf/lucene/solr/trunk solr-nightly
> Checked out revision 817722.
>
> * Note that both are revision 817722
>
> Now do builds:
> cd lucene-nightly / ant / cd ..
> cd solr-nightly / ant / cd ..
>
> * OK, fine, no clover, I guess I'll live...
>
> Now check the two Lucene core jar files:
> proj $ ls -l lucene-nightly/build/lucene-core-2.9.jar
> -rw-r--r--  1 xyz  staff  1104037 Sep 22 09:44
> lucene-nightly/build/lucene-core-2.9.jar
> proj $ ls -l solr-nightly/lib/lucene-core-2.9.0.jar
> -rw-r--r--  1 xyz  staff  1104049 Sep 22 09:43
> solr-nightly/lib/lucene-core-2.9.0.jar
>
> * Notice the sizes.  They're pretty close, but not identical.  The jar in
> lucene-nightly was created by my tools, whereas the lucene jar in Solr was
> created on another machine.
>
>
Well, trunk changes all the time. If you want exactly the same code, it is
best to checkout the Lucene revision mentioned in Solr's CHANGES.txt

-- 
Regards,
Shalin Shekhar Mangar.


No-op query for :q parameter?

2009-09-22 Thread Mat Brown
Hi all,

If I have a set of filter queries that I'd like to apply but nothing
that I particularly would like to put into the :q parameter (since I'd
like all of the scopes to be cached), is there any problem with just
passing "[* TO *]" for the :q param? Any performance implications?

Thanks!
Mat


Re: No-op query for :q parameter?

2009-09-22 Thread Shalin Shekhar Mangar
On Wed, Sep 23, 2009 at 12:19 AM, Mat Brown  wrote:

>
> If I have a set of filter queries that I'd like to apply but nothing
> that I particularly would like to put into the :q parameter (since I'd
> like all of the scopes to be cached), is there any problem with just
> passing "[* TO *]" for the :q param? Any performance implications?
>

You can use q=*:* to match all documents. [* TO *] will work too but it is
applied on the default search field and I'm not sure of its performance
characteristics.

I'm not sure about what you mean by "I'd like all of the scopes to be
cached".

-- 
Regards,
Shalin Shekhar Mangar.


Re: No-op query for :q parameter?

2009-09-22 Thread Mat Brown
Thanks, Shalin. The "*:*" sounds good - so that'll definitely have no
effect on query performance?

What I meant was, I'd like all of the queries that I'm using to
restrict search results to be cached (as filter queries are) - which
is why I don't have anything I'd particularly like to put into the :q
parameter.

Mat

On Tue, Sep 22, 2009 at 15:00, Shalin Shekhar Mangar
 wrote:
> On Wed, Sep 23, 2009 at 12:19 AM, Mat Brown  wrote:
>
>>
>> If I have a set of filter queries that I'd like to apply but nothing
>> that I particularly would like to put into the :q parameter (since I'd
>> like all of the scopes to be cached), is there any problem with just
>> passing "[* TO *]" for the :q param? Any performance implications?
>>
>
> You can use q=*:* to match all documents. [* TO *] will work too but it is
> applied on the default search field and I'm not sure of its performance
> characteristics.
>
> I'm not sure about what you mean by "I'd like all of the scopes to be
> cached".
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: No-op query for :q parameter?

2009-09-22 Thread Shalin Shekhar Mangar
On Wed, Sep 23, 2009 at 12:33 AM, Mat Brown  wrote:

> Thanks, Shalin. The "*:*" sounds good - so that'll definitely have no
> effect on query performance?
>
>
All query results are added to the query result cache so any cost is
one-time only (until a commit happens or eviction happens). In any case, if
you do not have anything to put in the query field, you have no choice :)


> What I meant was, I'd like all of the queries that I'm using to
> restrict search results to be cached (as filter queries are) - which
> is why I don't have anything I'd particularly like to put into the :q
> parameter.
>
>
OK, thanks for clearing that up. Note that filters and queries are cached
separately. A good article on this is on the wiki:

http://wiki.apache.org/solr/FilterQueryGuidance

-- 
Regards,
Shalin Shekhar Mangar.


Re: No-op query for :q parameter?

2009-09-22 Thread Mat Brown
Hey Shalin,

Thanks for the help. The particular attraction of filter queries is
that they are cached separately, and our application takes advantage
of that fact, since we often employ several filters in one search -
while the combinations of filters are numerous, the individual filters
comprise a small enough set that they can be effectively cached.

We actually do have a choice about whether to put something in :q - I
could always just arbitrarily pick one filter and put it in the :q
parameter instead of an :fq parameter. That's a good point about the
query result cache and '*:*' though - thanks.

Mat

On Tue, Sep 22, 2009 at 15:12, Shalin Shekhar Mangar
 wrote:
> On Wed, Sep 23, 2009 at 12:33 AM, Mat Brown  wrote:
>
>> Thanks, Shalin. The "*:*" sounds good - so that'll definitely have no
>> effect on query performance?
>>
>>
> All query results are added to the query result cache so any cost is
> one-time only (until a commit happens or eviction happens). In any case, if
> you do not have anything to put in the query field, you have no choice :)
>
>
>> What I meant was, I'd like all of the queries that I'm using to
>> restrict search results to be cached (as filter queries are) - which
>> is why I don't have anything I'd particularly like to put into the :q
>> parameter.
>>
>>
> OK, thanks for clearing that up. Note that filters and queries are cached
> separately. A good article on this is on the wiki:
>
> http://wiki.apache.org/solr/FilterQueryGuidance
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


RE: No-op query for :q parameter?

2009-09-22 Thread Fuad Efendi
> is there any problem with just
> passing "[* TO *]" for the :q param? Any performance implications?


Only if you are using faceting on a field with high cardinality (such as 
tokenized, multivalued)
Additional parameters: how many docs do you retrieve in a single query? 100, 
1, ... - lazy field loading? Sorting? etc




Parallel requests to Tomcat

2009-09-22 Thread Michael
Hi,
I have a Solr+Tomcat installation on an 8 CPU Linux box, and I just tried
sending parallel requests to it and measuring response time.  I would expect
that it could handle up to 8 parallel requests without significant slowdown
of any individual request.

Instead, I found that Tomcat is serializing the requests.

For example, the response time for each of 2 parallel requests is nearly 2
times that for a single request, and the time for each of 8 parallel
requests is about 4 times that of a single request.

I am pretty sure this is a Tomcat issue, for when I started 8 identical
instances of Solr+Tomcat on the machine (on 8 different ports), I could send
one request to each in parallel with only a 20% slowdown (compared to 300%
in a single Tomcat.)

I'm using the stock Tomcat download with minimal configuration changes,
except that I disabled all logging (in case the logger was blocking for each
request, serializing them.)  I'm giving 2G RAM to each JVM.

Does anyone more familiar with Tomcat know what's wrong?  I can't imagine
that Tomcat really can't handle parallel requests.


Re: Batching requests using SolrCell with SolrJ

2009-09-22 Thread Grant Ingersoll


On Sep 19, 2009, at 1:22 PM, Jay Hill wrote:


When working with SolrJ I have typically batched a Collection of
SolrInputDocument objects before sending them to the Solr server. I'm
working with the latest nightly build and using the  
ExtractingRequestHandler
to index documents, and everything is working fine. Except I haven't  
been
able to figure out how to batch documents when also including  
literals.

Here's what I've got:

//Looping over a List of Files
 ContentStreamUpdateRequest req = new
ContentStreamUpdateRequest("/update/extract");
 req.addFile(fileToIndex);
 req.setParam("literal.id", fileToIndex.getCanonicalPath());

 try {
   getSolrServer().request(req);
 } catch (SolrServerException e) {
   e.printStackTrace();
 }

Which works great, except that each document processed in the loop is
sending a separate request. Previously I built a collection of  
SolrInput

docs and had SolrJ send them in batches of 100 or whatever.

It seems like I could batch documents by continuing to add them to the
request (req.addFile(eachFileUpToACount)), but the literals seem to  
present
a problem. By sending one at a time the contents and the literals  
all wind

up in the same document. But in a batch there will just be an array of
params for literal.id (in this example) not matched to the contents.



It might be nice to be able to specify literals on a per stream name  
basis, such as literal.site_pdf.id=site_pdf, but there isn't currently  
support for this.  Then, you could combine that with the  
ContentStreamUpdateRequest to do what is needed, I believe.


-Grant


A little discovery about the solr classpath and jetty

2009-09-22 Thread Benson Margulies
On (at least) two occasions, I've opened JIRAs due to my getting tangled up
with eclipse, jetty, and solr/lib.

Well, it occurs to me that a recent idea might be of general use to others
in this regard.

This fragment is offered for illustration. The idea here is that you can
configure the jetty WebAppClassLoader to add an additional lib directory
and/or directory of open class files, such as a directory that Eclipse is
writing into. This is 'as good as' solr/lib, and allows seamless Eclipse
development.

File sourceDirFile = new
File(webapp.getWebappSourceDirectory());
WebAppContext wac = new
WebAppContext(sourceDirFile.getCanonicalPath(), webapp.getContextPath());
WebAppClassLoader loader = new WebAppClassLoader(wac);
if (webapp.getLibDirectory() != null) {
Resource r = Resource.newResource(webapp.getLibDirectory());
loader.addJars(r);
}
if (webapp.getClasspathEntries() != null) {
for (String dir : webapp.getClasspathEntries()) {
loader.addClassPath(dir);
}
}
wac.setClassLoader(loader);
server.addHandler(wac);


Re: Parallel requests to Tomcat

2009-09-22 Thread Yonik Seeley
What version of Solr are you using?
Solr1.3 and Lucene 2.4 defaulted to an index reader implementation
that had to synchronize, so search operations that are IO "heavy"
can't proceed in parallel.  You shouldn't see this with 1.4

-Yonik
http://www.lucidimagination.com



On Tue, Sep 22, 2009 at 4:03 PM, Michael  wrote:
> Hi,
> I have a Solr+Tomcat installation on an 8 CPU Linux box, and I just tried
> sending parallel requests to it and measuring response time.  I would expect
> that it could handle up to 8 parallel requests without significant slowdown
> of any individual request.
>
> Instead, I found that Tomcat is serializing the requests.
>
> For example, the response time for each of 2 parallel requests is nearly 2
> times that for a single request, and the time for each of 8 parallel
> requests is about 4 times that of a single request.
>
> I am pretty sure this is a Tomcat issue, for when I started 8 identical
> instances of Solr+Tomcat on the machine (on 8 different ports), I could send
> one request to each in parallel with only a 20% slowdown (compared to 300%
> in a single Tomcat.)
>
> I'm using the stock Tomcat download with minimal configuration changes,
> except that I disabled all logging (in case the logger was blocking for each
> request, serializing them.)  I'm giving 2G RAM to each JVM.
>
> Does anyone more familiar with Tomcat know what's wrong?  I can't imagine
> that Tomcat really can't handle parallel requests.
>


returning stored fields

2009-09-22 Thread Eric Lease Morgan


Is there any way to configure in solrconf.xml (or anywhere else) what  
fields to return by default?


I am indexing sets of full text books. My fields include metadata  
(author, title, publisher, etc.) as well as the full text of the book.  
Since I want to enable highlighting against the full text I need to  
store the full text field, but I don't want to return it in queries.


Is it all or nothing? Is the only way to specify what fields get  
returned by default through the fl parameter?


--
Eric Lease Morgan



Re: returning stored fields

2009-09-22 Thread Mark A. Matienzo
Hi Eric,

On Tue, Sep 22, 2009 at 8:41 PM, Eric Lease Morgan
 wrote:
>
> Is there any way to configure in solrconf.xml (or anywhere else) what fields
> to return by default?

Yes - in one of the requestHandler sections of solrconfig.xml, you can
specify defaults for specific query parameters. For example, you could
modify the configuration for stock search handler for the type of data
you mentioned as follows:

  

 
   author,title,publisher,date,subject
 
  

Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library


Re: returning stored fields [resolved]

2009-09-22 Thread Eric Lease Morgan


On Sep 22, 2009, at 8:51 PM, Mark A. Matienzo wrote:

Is there any way to configure in solrconf.xml (or anywhere else)  
what fields

to return by default?


Yes - in one of the requestHandler sections of solrconfig.xml, you can
specify defaults for specific query parameters. For example, you could
modify the configuration for stock search handler for the type of data
you mentioned as follows:

 
   

  author,title,publisher,date,subject

 

--
Mark A. Matienzo
Applications Developer, Digital Experience Group
The New York Public Library



Resolved. Tastes great; less filling, and it is a small world. Thanks!

--
Eric Morgan



Re: Code sync between Lucene and Solr, crossing Apache project boundaries, etc.

2009-09-22 Thread Grant Ingersoll


On Sep 22, 2009, at 1:54 PM, Mark Bennett wrote:

To do any serious Solr debugging (or filter development) you also  
need the

Solr source code tree.  And you'd like them to be in sync, so that the
Lucene code you see is exactly the same as what was used for the Solr
version you're working with.

I did find this link on sync'ing the two source tress, for a  
specific Solr
version.  But this seems a bit convoluted, is this really the right  
answer?

http://happygiraffe.net/blog/2009/07/16/solrs-lucene-source/

Generally:

1: Given that many threads on the Solr mailing list seem to end with  
"yeah,
you could write some code to do that", it seems like most developers  
would
face this problem, and yet I haven't seen it discussed much (at  
least not
with the search terms I'm using, I keep finding stuff about SolrJ or  
just

getting the Solr code to build)

2: Do most people pick a version of Solr, then try carefully to sync  
it with

the exact version of Lucene code, or do they just not worry about the
version stuff unless they get an error?


I'm probably atypical, but I check out the exact revision number of  
Lucene and associate it as source in IntelliJ.  Other times, I simply  
rely on the nightly Maven artifacts to get me all the correct versions  
of Lucene and Solr.




3: Or do most Solr/Lucene developers just live out of the Apache  
nightly
snapshots for both source code trees?  And if so, I have a question  
about

even doing that.  (see below)

4: Getting both projects to live together under Eclipse has seemed a  
bit
awkward, though again I'd think it's a common task.  Any good links  
on that?


5: So do you most of you:
a: Bother with both code trees?  (Solr and Lucene)


Yes


b: Live from the command line with ant?  or


Yes

c: Get them both living under Eclipse?  With dependencies back and  
forth?  I

haven't found a good resource for this yet.
d: Do you also think maven, git and clover are required?


No, they are not, but I sometimes find it easier to get up and running  
in both Solr and Lucene by using the Maven artifacts.




6: Since advanced developers using Nutch and Solr would typically want
Lucene's source code as well, wouldn't it be good to have separate
distribution that includes all of those?  Or perhaps there's some  
"apache

thing" that makes this so trivial it's not worth bothering with?


Not sure, there hasn't been too much cross fertilization between the  
projects.




Back to question 3, just using the nightly trees for both Solr and  
Lucene,
which presumably are in sync.  If you don't need a specific release,  
this
might be a reasonable workaround, but still a few of the details  
bother me.


Get the code:
svn co http://svn.apache.org/repos/asf/lucene/java/trunk lucene- 
nightly

Checked out revision 817722.
svn co http://svn.apache.org/repos/asf/lucene/solr/trunk solr-nightly
Checked out revision 817722.

* Note that both are revision 817722




I don't think this is correct.  For the Lucene version, you need to  
get the exact rev that Solr is expecting.  This can be seen in the  
Admin info page or by looking in the Manifest.MF in the Jars or in the  
CHANGES.txt file.



Now do builds:
cd lucene-nightly / ant / cd ..
cd solr-nightly / ant / cd ..

* OK, fine, no clover, I guess I'll live...

Now check the two Lucene core jar files:
proj $ ls -l lucene-nightly/build/lucene-core-2.9.jar
-rw-r--r--  1 xyz  staff  1104037 Sep 22 09:44
lucene-nightly/build/lucene-core-2.9.jar
proj $ ls -l solr-nightly/lib/lucene-core-2.9.0.jar
-rw-r--r--  1 xyz  staff  1104049 Sep 22 09:43
solr-nightly/lib/lucene-core-2.9.0.jar

* Notice the sizes.  They're pretty close, but not identical.  The  
jar in
lucene-nightly was created by my tools, whereas the lucene jar in  
Solr was

created on another machine.

To prove this use Subversion status command:
proj $ svn status lucene-nightly/build/lucene-core-2.9.jar
svn: warning: 'lucene-nightly/build/lucene-core-2.9.jar' is not a  
working

copy
So... not under version control, because it's built when I run ant.
proj $ svn status solr-nightly/lib/lucene-core-2.9.0.jar
(no output)
No complaints from Subversion, so we know that it IS under version  
control

and therefore came from the apache, and I haven't messed with it.

So I'd chalk up the difference exclusively to that then?

If folks work with both code trees a lot, maybe having a parent  
build file
could copy over the fresh Lucene jar over to Solr.  Also curious if  
there's

an automated way to get this working in Eclipse.


I have a little script to copy over the correct things from the Lucene  
build somewhere in JIRA, but it's not really a big deal to do it by  
hand.


--
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

http://www.lucidimagination.com/search



Re: solr caching problem

2009-09-22 Thread satya
First of all , thanks a lot for the clarification.Is there any way to see,
how this cache is working internally and what are the objects being stored
and how much memory its consuming,so that we can get a clear picture in
mind.And how to test the performance through cache.

On Tue, Sep 22, 2009 at 11:19 PM, Fuad Efendi  wrote:

> > 1)Then do you mean , if we delete a perticular doc ,then that is going to
> be
> > deleted from
> >   cache also.
>
> When you delete document, and then COMMIT your changes, new caches will be
> warmed up (and prepopulated by some key-value pairs from old instances),
> etc:
>
>  
>  class="solr.LRUCache"
>  size="512"
>  initialSize="512"
>  autowarmCount="0"/>
>
> - this one won't be 'prepopulated'.
>
>
>
>
> > 2)In solr,is  cache storing the entire document in memory or only the
> > references to
> >documents in memory.
>
> There are many different cache instances, DocumentCache should store  Document> pairs, etc
>
>
>


How to configure Solr 1.3 on Websphere 6.1

2009-09-22 Thread adnanqureshi

Hi all,

Solr 1.3
Websphere 6.1

I have been looking for some documentation on how to configure Solr on
Websphere but no luck yet.
Can some one suggest me some document which can give me the overview on how
to get started with Solr on Websphere and how to integrate it with websites
(Java/JSP).

I have been trying to deploy Solr on websphere but no luck yet.
I was trying to deploy the war file under "dist" folder, but I kept getting
errors. (recent one is that it couldn't find the configuration file). When I
deploy this war file, it only creates an admin folder root index file and
web-inf folder. it doesn't have any configuration file in it.

My understanding is that war file under dist folder is for Solr Server.
Also, what I'll need to integrate my website with Solr Search.

The biggest challenge for me is that I don't have much experience with
Websphere.
-- 
View this message in context: 
http://www.nabble.com/How-to-configure-Solr-1.3-on-Websphere-6.1-tp25530967p25530967.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Solr with Auto-suggest

2009-09-22 Thread dharhsana

Hi Ryan,

I gone through your post 
https://issues.apache.org/jira/browse/SOLR-357

where you mention about prefix filter,can you tell me how to use that
patch,and you mentioned to use the code as bellow,

























...


...



For using the above code is that you are using EdgeNGramFilterFactory or
PrefixingFilterFactory.

or the above code works for EdgeNGramFilterFactory,i am not clear about
it,with out using the PrefixingFilterFactory patch, is that i can write the
above code.


And the next is "name" in copyFiled is "text" type or "string" type


waiting for your reply,

Regards,

Rekha







-- 
View this message in context: 
http://www.nabble.com/Solr-with-Auto-suggest-tp16880894p25530993.html
Sent from the Solr - User mailing list archive at Nabble.com.



Highlighting not working on a prefix_token field

2009-09-22 Thread Avlesh Singh
I have a "prefix_token" field defined as underneath in my schema.xml













Searches on the field work fine and as expected.
However, attempts to highlight on this field does not yield any results.
Highlighting on other fields work fine.

Any clues? I am using Solr 1.3

Cheers
Avlesh


Re: Highlighting not working on a prefix_token field

2009-09-22 Thread Shalin Shekhar Mangar
On Wed, Sep 23, 2009 at 12:23 PM, Avlesh Singh  wrote:

> I have a "prefix_token" field defined as underneath in my schema.xml
>
>  positionIncrementGap="1">
>
>
>
> maxGramSize="20"/>
>
>
>
>
>
> 
>
> Searches on the field work fine and as expected.
> However, attempts to highlight on this field does not yield any results.
> Highlighting on other fields work fine.
>


Won't work until SOLR-1268 comes along.

http://www.lucidimagination.com/search/document/4da480fe3eb0e7e4/highlighting_in_stemmed_or_n_grammed_fields_possible

-- 
Regards,
Shalin Shekhar Mangar.