Solr1.3 / MySql / Tomcat55 multiple delta-import inside a big full-import

2008-10-31 Thread sunnyfr

Hi,

I would like to know if it's very longer to make a limited full import and
multi delta-import to index all the database.
If I fire a full-import without limit 4M in my request that will run me OOM
because I've 8,5M of document.
If I fire a full-import without limit and a batchsize=-1 I will stock the
database just for me, and stack other request for 10hours, but it will work.

Do you have an advice??
Thanks a lot
-- 
View this message in context: 
http://www.nabble.com/Solr1.3---MySql---Tomcat55--multiple-delta-import-inside-a-big-full-import-tp20262801p20262801.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Performanec Lucene / Solr

2008-10-31 Thread Kraus, Ralf | pixelhouse GmbH

Hey,

I think it will have the disadvantage of being a lot slower though...

How were you handling things with Lucene? You must have used Java 
then? If you even want to get close to that performance I think you 
need to use non http embedded solr.

I am using this :

- I wrote a JAVA JSP file to get an EmbeddedSolrServer
- Now I call this JSP file from my PHP script and the JSP makes my 
search request to SOLR

- after that I generate a CSV file out of the JSP and read it from PHP

It´s the same way I did it with the prior LUCENE engine I used.
But now the peformence is 10% from the prior LUCENE speed :-(

Greets -Ralf-


Re: DataImportHandler running out of memory

2008-10-31 Thread sunnyfr

Hi Grant,

How did you finally managed it  
I've the same problem with less data, 8,5M, if I put a batchsize -1, I will
slow down a lot the database which is not that good for the website and
stack request.
What did you do you ??? 

Thanks,


Grant Ingersoll-6 wrote:
> 
> I think it's a bit different.  I ran into this exact problem about two  
> weeks ago on a 13 million record DB.  MySQL doesn't honor the fetch  
> size for it's v5 JDBC driver.
> 
> See
> http://www.databasesandlife.com/reading-row-by-row-into-java-from-mysql/ 
>   or do a search for MySQL fetch size.
> 
> You actually have to do setFetchSize(Integer.MIN_VALUE) (-1 doesn't  
> work) in order to get streaming in MySQL.
> 
> -Grant
> 
> 
> On Jun 24, 2008, at 10:35 PM, Shalin Shekhar Mangar wrote:
> 
>> Setting the batchSize to 1 would mean that the Jdbc driver will  
>> keep
>> 1 rows in memory *for each entity* which uses that data source (if
>> correctly implemented by the driver). Not sure how well the Sql Server
>> driver implements this. Also keep in mind that Solr also needs  
>> memory to
>> index documents. You can probably try setting the batch size to a  
>> lower
>> value.
>>
>> The regular memory tuning stuff should apply here too -- try disabling
>> autoCommit and turn-off autowarming and see if it helps.
>>
>> On Wed, Jun 25, 2008 at 5:53 AM, wojtekpia <[EMAIL PROTECTED]>  
>> wrote:
>>
>>>
>>> I'm trying to load ~10 million records into Solr using the
>>> DataImportHandler.
>>> I'm running out of memory (java.lang.OutOfMemoryError: Java heap  
>>> space) as
>>> soon as I try loading more than about 5 million records.
>>>
>>> Here's my configuration:
>>> I'm connecting to a SQL Server database using the sqljdbc driver.  
>>> I've
>>> given
>>> my Solr instance 1.5 GB of memory. I have set the dataSource  
>>> batchSize to
>>> 1. My SQL query is "select top XXX field1, ... from table1". I  
>>> have
>>> about 40 fields in my Solr schema.
>>>
>>> I thought the DataImportHandler would stream data from the DB  
>>> rather than
>>> loading it all into memory at once. Is that not the case? Any  
>>> thoughts on
>>> how to get around this (aside from getting a machine with more  
>>> memory)?
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18102644.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>> -- 
>> Regards,
>> Shalin Shekhar Mangar.
> 
> --
> Grant Ingersoll
> http://www.lucidimagination.com
> 
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
> 
> 
> 
> 
> 
> 
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p20263146.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Changing mergeFactor in mid-stream?

2008-10-31 Thread Mark Miller

Otis Gospodnetic wrote:

Yes, you can change the mergeFactor.  More important than the mergeFactor is 
this:

32

Pump it up as much as your hardware/JVM allows.  And use appropriate -Xmx, of 
course.
  
Is that true? I thought there was a sweet spot for the RAM buffer (and 
not as high as youd think)? You might want to test that out a bit before 
riding it too high...




Re: Performanec Lucene / Solr

2008-10-31 Thread Kraus, Ralf | pixelhouse GmbH

Hi,

Thx a lot for the tip !

But when I try it I got

> HTTP/1.1 500 null java.lang.NullPointerException at 
org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37)


My Request is :
INFO: [core_de] webapp=/solr path=/select/ 
params={wt=phps&query=Tools&records=30&start_record=0} status=500 QTime=1


Exception in SOLR:
SCHWERWIEGEND: java.lang.NullPointerException
   at org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37)
   at 
org.apache.solr.search.OldLuceneQParser.parse(LuceneQParserPlugin.java:104)

   at org.apache.solr.search.QParser.getQuery(QParser.java:88)
   at 
org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:82)
   at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:148)
   at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)

   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
   at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
   at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
   at 
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:202)
   at 
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
   at 
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:213)
   at 
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:178)
   at 
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:126)
   at 
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:105)
   at 
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:107)
   at 
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:148)
   at 
org.apache.coyote.http11.Http11AprProcessor.process(Http11AprProcessor.java:833)
   at 
org.apache.coyote.http11.Http11AprProtocol$Http11ConnectionHandler.process(Http11AprProtocol.java:639)
   at 
org.apache.tomcat.util.net.AprEndpoint$Worker.run(AprEndpoint.java:1285)

   at java.lang.Thread.run(Thread.java:595)

Greets -Ralf-


Re: Performanec Lucene / Solr

2008-10-31 Thread Shalin Shekhar Mangar
On Fri, Oct 31, 2008 at 5:10 PM, Kraus, Ralf | pixelhouse GmbH <
[EMAIL PROTECTED]> wrote:

>
> > HTTP/1.1 500 null java.lang.NullPointerException at
> org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37)
>
> My Request is :
> INFO: [core_de] webapp=/solr path=/select/
> params={wt=phps&query=Tools&records=30&start_record=0} status=500 QTime=1
>
>
The parameter name should be "q" instead of "query".

-- 
Regards,
Shalin Shekhar Mangar.


Re: Using Solrj

2008-10-31 Thread Shalin Shekhar Mangar
On Fri, Oct 31, 2008 at 4:32 PM, Raghunandan Rao <
[EMAIL PROTECTED]> wrote:

> I am doing that but the API is in experimental stage. Not sure to use it or
> not. BTW can you also let me know how clustering works on Windows OS cos I
> saw clustering scripts for Unix OS bundled out with Solr release.
>
>
Which API is in experimental stage?

By clustering, I think you mean replication. Until the last release (1.3),
we had support only for *nix platforms for replication. In the next release
we have a Java based replication coming which works for Windows also. If you
want, you can try it with one of the nightly (un-released) builds.

http://wiki.apache.org/solr/SolrReplication

-- 
Regards,
Shalin Shekhar Mangar.


RE: Using Solrj

2008-10-31 Thread Raghunandan Rao
Thank you. 
I was talking about DataImportHandler API. 

-Original Message-
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 31, 2008 5:19 PM
To: solr-user@lucene.apache.org
Subject: Re: Using Solrj

On Fri, Oct 31, 2008 at 4:32 PM, Raghunandan Rao <
[EMAIL PROTECTED]> wrote:

> I am doing that but the API is in experimental stage. Not sure to use
it or
> not. BTW can you also let me know how clustering works on Windows OS
cos I
> saw clustering scripts for Unix OS bundled out with Solr release.
>
>
Which API is in experimental stage?

By clustering, I think you mean replication. Until the last release
(1.3),
we had support only for *nix platforms for replication. In the next
release
we have a Java based replication coming which works for Windows also. If
you
want, you can try it with one of the nightly (un-released) builds.

http://wiki.apache.org/solr/SolrReplication

-- 
Regards,
Shalin Shekhar Mangar.


Re: Solr1.3 / MySql / Tomcat55 multiple delta-import inside a big full-import

2008-10-31 Thread sunnyfr

Sorry I wasn't clear,
The stack is not on solr database or index query, stack request are on our
main database MySql,
When I do a full import to create indexes for solr, MySql honnor it and
won't drive it OOM, but with a batchsize -1, it uses MySql memory which let
less memory for the rest of the request on the dabase like, insert, update,
delete ...

:) thanks for your answer,


Shalin Shekhar Mangar wrote:
> 
> On Fri, Oct 31, 2008 at 3:27 PM, sunnyfr <[EMAIL PROTECTED]> wrote:
> 
>>
>> I would like to know if it's very longer to make a limited full import
>> and
>> multi delta-import to index all the database.
>> If I fire a full-import without limit 4M in my request that will run me
>> OOM
>> because I've 8,5M of document.
>> If I fire a full-import without limit and a batchsize=-1 I will stock the
>> database just for me, and stack other request for 10hours, but it will
>> work.
>>
> 
> A full-import should not stack other requests though response time will be
> more because of the heavy processing.
> 
> Most users have a dedicated Master instance used only for indexing and
> many
> slaves dedicated for serving search requests. Maybe you can try a
> master-slave architecture.
> 
> -- 
> Regards,
> Shalin Shekhar Mangar.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Solr1.3---MySql---Tomcat55--multiple-delta-import-inside-a-big-full-import-tp20262801p20264431.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Performanec Lucene / Solr

2008-10-31 Thread Erik Hatcher


On Oct 31, 2008, at 6:14 AM, Kraus, Ralf | pixelhouse GmbH wrote:

Hey,

I think it will have the disadvantage of being a lot slower though...

How were you handling things with Lucene? You must have used Java  
then? If you even want to get close to that performance I think you  
need to use non http embedded solr.

I am using this :

- I wrote a JAVA JSP file to get an EmbeddedSolrServer
- Now I call this JSP file from my PHP script and the JSP makes my  
search request to SOLR

- after that I generate a CSV file out of the JSP and read it from PHP

It´s the same way I did it with the prior LUCENE engine I used.
But now the peformence is 10% from the prior LUCENE speed :-(


No need to involve JSP at all to get Solr results in PHP.

Rather than those hoops, simply use one of the PHP response writers.   
Look in the example Solr config, uncomment this:


  class="org.apache.solr.request.PHPSerializedResponseWriter"/>


Then in PHP, hit Solr directly like this:

$response = unserialize(file_get_contents($url));

Where $url is something like http://localhost:8983/solr/select?q=*:*

  Erik




Re: Using Solrj

2008-10-31 Thread Shalin Shekhar Mangar
On Fri, Oct 31, 2008 at 5:21 PM, Raghunandan Rao <
[EMAIL PROTECTED]> wrote:

> Thank you.
> I was talking about DataImportHandler API.
>
>
Most likely, you will not need to use the API. DataImportHandler will let
you index your database without writing code -- you just need an XML
configuration file. Only if you need to do some custom tasks, will you need
to touch the API.

The API is marked as experimental just because it is new and we're not sure
of the use-cases that might come up in the near future and therefore we'd
like the freedom of modifying it. As always, we strive to maintain
backwards-compatibility as much as possible. I know that DataImportHandler
is being used in production in a few high traffic websites.

-- 
Regards,
Shalin Shekhar Mangar.


RE: Using Solrj

2008-10-31 Thread Raghunandan Rao
I am doing that but the API is in experimental stage. Not sure to use it or 
not. BTW can you also let me know how clustering works on Windows OS cos I saw 
clustering scripts for Unix OS bundled out with Solr release. 

-Original Message-
From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 31, 2008 11:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Using Solrj

First of all you need to index your data in Solr. I suggest
DataImportHandler because it can help you join multiple tables and
index data

On Fri, Oct 31, 2008 at 10:20 AM, Raghunandan Rao
<[EMAIL PROTECTED]> wrote:
> Thank you so much.
>
> Here goes my Use case:
>
> I need to search the database for collection of input parameters which 
> touches 'n' number of tables. The data is very huge. The search query itself 
> is so dynamic. I use lot of views for same search. How do I make use of Solr 
> in this case?
>
> -Original Message-
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Thursday, October 30, 2008 7:01 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Solrj
>
> Generally, you need to get your head out of the database world and into
> the search world to be successful with Lucene. For instance, one
> of the cardinal tenets of database design is to normalize your
> data. It goes against every instinct to *denormalize* your data when
> creating an Lucene index explicitly so you do NOT have to think
> in terms of joins or sub-queries. Whenever I start thinking this
> way, I try to back up and think again.
>
> Both your posts indicate to me that you're thinking in database
> terms. There are no views in Lucene, for instance. You refer
> to tables. There are no tables in Lucene, there are only documents
> with various numbers of fields. You could conceivable make your index
> look like a database by creatively naming your document fields. But
> that doesn't play to the strengths of Lucene *or* the database.
>
> In fact, there is NO requirement that documents have the *same* fields.
> Which is really difficult to get into when thinking like a DBA.
>
> Lucene is designed to search text. Fast and well. It is NOT intended to
> efficiently manipulate relationships *between* documents. There
> are various hybrid solutions that people have used. That is, put the
> data you really need to do text searching on in a Lucene index,
> along with enough data to be able to get the *rest* of what you need
> from your database. But it all depends upon the problem you're trying to
> solve.
>
> But as Noble says, all this is too general to be really useful, you need
> to provide quite more detail about the problem you're trying to
> solve to get useful recommendations.
>
> Best
> Erick
>
> On Thu, Oct 30, 2008 at 8:50 AM, Raghunandan Rao <
> [EMAIL PROTECTED]> wrote:
>
>> Thanks Noble.
>>
>> So you mean to say that I need to create a view according to my query and
>> then index on the view and fetch?
>>
>> -Original Message-
>> From: Noble Paul നോബിള്‍ नोब्ळ् [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, October 30, 2008 6:16 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Using Solrj
>>
>> hi ,
>> There are two sides to this .
>> 1. indexing (getting data into Solr) SolrJ or DataImportHandler can be
>> used for this
>> 2.querying . getting data out of solr. Here you do not have the choice
>> of joining multiple tables. There only one index for Solr
>>
>>
>>
>> On Thu, Oct 30, 2008 at 5:34 PM, Raghunandan Rao
>> <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > I am trying to use Solrj for my web application. I am indexing a table
>> > using the @Field annotation tag. Now I need to index or query multiple
>> > tables. Like, get all the employees who are managers in Finance
>> > department (interacting with 3 entities). How do I do that?
>> >
>> >
>> >
>> > Does anyone have any idea?
>> >
>> >
>> >
>> > Thanks
>> >
>> >
>>
>>
>>
>> --
>> --Noble Paul
>>
>



-- 
--Noble Paul


Re: Performanec Lucene / Solr

2008-10-31 Thread Erik Hatcher


On Oct 31, 2008, at 7:42 AM, Shalin Shekhar Mangar wrote:


On Fri, Oct 31, 2008 at 5:10 PM, Kraus, Ralf | pixelhouse GmbH <
[EMAIL PROTECTED]> wrote:




HTTP/1.1 500 null java.lang.NullPointerException at

org.apache.solr.common.util.StrUtils.splitSmart(StrUtils.java:37)

My Request is :
INFO: [core_de] webapp=/solr path=/select/
params={wt=phps&query=Tools&records=30&start_record=0} status=500  
QTime=1




The parameter name should be "q" instead of "query".


And rows instead of records, and start instead of start_record.  :)

Erik



Re: Solr1.3 / MySql / Tomcat55 multiple delta-import inside a big full-import

2008-10-31 Thread Shalin Shekhar Mangar
On Fri, Oct 31, 2008 at 3:27 PM, sunnyfr <[EMAIL PROTECTED]> wrote:

>
> I would like to know if it's very longer to make a limited full import and
> multi delta-import to index all the database.
> If I fire a full-import without limit 4M in my request that will run me OOM
> because I've 8,5M of document.
> If I fire a full-import without limit and a batchsize=-1 I will stock the
> database just for me, and stack other request for 10hours, but it will
> work.
>

A full-import should not stack other requests though response time will be
more because of the heavy processing.

Most users have a dedicated Master instance used only for indexing and many
slaves dedicated for serving search requests. Maybe you can try a
master-slave architecture.

-- 
Regards,
Shalin Shekhar Mangar.


Re: Performanec Lucene / Solr

2008-10-31 Thread Kraus, Ralf | pixelhouse GmbH

Hi,


And rows instead of records, and start instead of start_record.  :)

Erik



You´re my man :-)

Greets -Ralf-


Re: Performanec Lucene / Solr

2008-10-31 Thread Kraus, Ralf | pixelhouse GmbH

Hi,

  class="org.apache.solr.request.PHPSerializedResponseWriter"/>


Then in PHP, hit Solr directly like this:

$response = unserialize(file_get_contents($url));

Where $url is something like http://localhost:8983/solr/select?q=*:*

No SOLR is 2times faster than LUCENE => Strike !
Hello weekend I am comming :-)

Greets -Ralf-


Re: DataImportHandler running out of memory

2008-10-31 Thread Noble Paul നോബിള്‍ नोब्ळ्
I've moved the FAQ to a new Page
http://wiki.apache.org/solr/DataImportHandlerFaq
The DIH page is too big and editing has become harder

On Thu, Jun 26, 2008 at 6:07 PM, Shalin Shekhar Mangar
<[EMAIL PROTECTED]> wrote:
> I've added a FAQ section to DataImportHandler wiki page which captures
> question on out of memory exception with both MySQL and MS SQL Server
> drivers.
>
> http://wiki.apache.org/solr/DataImportHandler#faq
>
> On Thu, Jun 26, 2008 at 9:36 AM, Noble Paul നോബിള്‍ नोब्ळ्
> <[EMAIL PROTECTED]> wrote:
>> We must document this information in the wiki.  We never had a chance
>> to play w/ ms sql server
>> --Noble
>>
>> On Thu, Jun 26, 2008 at 12:38 AM, wojtekpia <[EMAIL PROTECTED]> wrote:
>>>
>>> It looks like that was the problem. With responseBuffering=adaptive, I'm 
>>> able
>>> to load all my data using the sqljdbc driver.
>>> --
>>> View this message in context: 
>>> http://www.nabble.com/DataImportHandler-running-out-of-memory-tp18102644p18119732.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>>
>> --
>> --Noble Paul
>>
>
>
>
> --
> Regards,
> Shalin Shekhar Mangar.
>



-- 
--Noble Paul


Re: DIH and rss feeds

2008-10-31 Thread Jon Baer
Is that right?  I find the wording of "clean" a little confusing.  I  
would have thought this is what I had needed earlier but the topic  
came up regarding the fact that you can not deleteByQuery for an  
entity you want to flush w/ delta-import.


I just noticed that the original JIRA request says it was implemented  
recently ...


https://issues.apache.org/jira/browse/SOLR-801

Im assuming this means your war needs to come from trunk copy?  Does  
this patch affect that param @ all?


- Jon

On Oct 31, 2008, at 2:05 AM, Noble Paul നോബിള്‍  
नोब्ळ् wrote:



run full-import with clean=false

for full-import clean is set to true by default and for delta-import
clean is false by default.

On Fri, Oct 31, 2008 at 9:16 AM, Lance Norskog <[EMAIL PROTECTED]>  
wrote:
I have a DataImportHandler configured to index from an RSS feed. It  
is a

"latest stuff" feed. It reads the feed and indexes the 100 documents
harvested from the feed. So far, works great.

Now: a few hours later there are a different 100 "lastest"  
documents. How do
I add those to the index so I will have 200 documents?  'full- 
import' throws
away the first 100. 'delta-import' is not implemented. What is the  
special

trick here?  I'm using the Solr-1.3.0 release.

Thanks,

Lance Norskog





--
--Noble Paul




What are the way to update / delete solr datas?

2008-10-31 Thread Vincent Pérès

Hello,

I'm trying to find the best way to update / delete datas according to my
project (developed with javascript and rails).
I would like to do something like that : 
http://localhost:8983/solr/update/?q=id:1&rate=4
and
http://localhost:8983/solr/delete/?q=id:1
Is it possible ?

But I found only these two ways :
http://localhost:8983/solr/update/csv?commit=true&separator=%09&escape=\&stream.file=/tmp/result.text
or using an xml file with that :
load_id:20070424150841 

The last possibility is to use the solr-ruby library.

Is there any way I forgot ?

Thanks,
Vincent
-- 
View this message in context: 
http://www.nabble.com/What-are-the-way-to-update---delete-solr-datas--tp20268507p20268507.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: DIH and rss feeds

2008-10-31 Thread Shalin Shekhar Mangar
The "clean" parameter is there in the 1.3 release. The full-import is by
definition "full" so we delete all existing documents at the start. If you
don't want to clean the index, you can pass clean=false and DIH will just
add them.
On Fri, Oct 31, 2008 at 8:58 PM, Jon Baer <[EMAIL PROTECTED]> wrote:

> Is that right?  I find the wording of "clean" a little confusing.  I would
> have thought this is what I had needed earlier but the topic came up
> regarding the fact that you can not deleteByQuery for an entity you want to
> flush w/ delta-import.
>
> I just noticed that the original JIRA request says it was implemented
> recently ...
>
> https://issues.apache.org/jira/browse/SOLR-801
>
> Im assuming this means your war needs to come from trunk copy?  Does this
> patch affect that param @ all?
>
> - Jon
>
>
> On Oct 31, 2008, at 2:05 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>  run full-import with clean=false
>>
>> for full-import clean is set to true by default and for delta-import
>> clean is false by default.
>>
>> On Fri, Oct 31, 2008 at 9:16 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:
>>
>>> I have a DataImportHandler configured to index from an RSS feed. It is a
>>> "latest stuff" feed. It reads the feed and indexes the 100 documents
>>> harvested from the feed. So far, works great.
>>>
>>> Now: a few hours later there are a different 100 "lastest" documents. How
>>> do
>>> I add those to the index so I will have 200 documents?  'full-import'
>>> throws
>>> away the first 100. 'delta-import' is not implemented. What is the
>>> special
>>> trick here?  I'm using the Solr-1.3.0 release.
>>>
>>> Thanks,
>>>
>>> Lance Norskog
>>>
>>>
>>
>>
>> --
>> --Noble Paul
>>
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: What are the way to update / delete solr datas?

2008-10-31 Thread Erik Hatcher


On Oct 31, 2008, at 11:40 AM, Vincent Pérès wrote:

The last possibility is to use the solr-ruby library.


If you're using Ruby, that's what I'd use.  Were your other proposals  
to still do those calls from Ruby, but with the HTTP library directly?


Erik



Re: What are the way to update / delete solr datas?

2008-10-31 Thread Vincent Pérès

Thanks for your quick answer.

I'm using only HTTP to display my results, that's why I would like to
continue with this way.
If I can use HTTP instead of solr, it will be better for me.





Erik Hatcher wrote:
> 
> 
> On Oct 31, 2008, at 11:40 AM, Vincent Pérès wrote:
>> The last possibility is to use the solr-ruby library.
> 
> If you're using Ruby, that's what I'd use.  Were your other proposals  
> to still do those calls from Ruby, but with the HTTP library directly?
> 
>   Erik
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/What-are-the-way-to-update---delete-solr-datas--tp20268507p20268773.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: date range query performance

2008-10-31 Thread Chris Hostetter
: Concrete example, this query just look 18s:
: 
:   instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO
: 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"

: I saw a thread from Apr 2008 which explains the problem being due to too much
: precision on the DateField type, and the range expansion leading to far too
: many elements being checked.  Proposed solution appears to be a hack where you
: index date fields as strings and hacking together date functions to generate
: proper queries/format results.

forteh record, you don't need to index as a "StrField" to get this 
benefit, you can still index using DateField you just need to round your 
dates to some less graunlar level .. if you always want to round down, you 
don't even need to do the rounding yourself, just add "/SECOND" 
or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr.  
(SOLR-741 proposes adding a config option to DateField to let this be done 
server side)

your example query seems to be happy with hour resolution, but in theory 
if sometimes you needed hour resolution when doing "big ranges" but more 
precise resolution when doing "small ranges" you could even in theory have 
a "course" date field that you round to an hour, and redundently index 
the same data in a "fine" date field with minute or second resolution.


Also: if you frequently reuse the same ranges over and over (ie: you have 
a form widget people pick from so on any given day there is only N 
discrete ranges being used) putting them in an "fq" param will let them be 
cached uniquely from the main query string 
(instance:client\-csm.symplicity.com) so differnet searches using the same 
date ranges will be faster.  ditto for your label_facet:"Added to 
Position" clause.

-Hoss



Re: date range query performance

2008-10-31 Thread Alok Dhir
We have implemented the suggested reduction in granularity by dropping  
time altogether and simply disallowing time filtering.  This, in light  
of other search filters we have provided, should prove be sufficient  
for our user base.


We did keep the fine granularity field not for filtering, but for  
sorting.  We definitely need the log entries to be presented in  
chronological order, so the finer resolution date field is useful for  
that at least.


Thanks for the detailed response.

Alok

On Oct 31, 2008, at 2:16 PM, Chris Hostetter wrote:


: Concrete example, this query just look 18s:
:
:   instance:client\-csm.symplicity.com AND dt:[2008-10-01T04:00:00Z TO
: 2008-10-30T03:59:59Z] AND label_facet:"Added to Position"

: I saw a thread from Apr 2008 which explains the problem being due  
to too much
: precision on the DateField type, and the range expansion leading  
to far too
: many elements being checked.  Proposed solution appears to be a  
hack where you
: index date fields as strings and hacking together date functions  
to generate

: proper queries/format results.

forteh record, you don't need to index as a "StrField" to get this
benefit, you can still index using DateField you just need to round  
your
dates to some less graunlar level .. if you always want to round  
down, you

don't even need to do the rounding yourself, just add "/SECOND"
or "/MINUTE" or "/HOUR" to each of your dates before sending them to  
solr.
(SOLR-741 proposes adding a config option to DateField to let this  
be done

server side)

your example query seems to be happy with hour resolution, but in  
theory
if sometimes you needed hour resolution when doing "big ranges" but  
more
precise resolution when doing "small ranges" you could even in  
theory have

a "course" date field that you round to an hour, and redundently index
the same data in a "fine" date field with minute or second resolution.


Also: if you frequently reuse the same ranges over and over (ie: you  
have

a form widget people pick from so on any given day there is only N
discrete ranges being used) putting them in an "fq" param will let  
them be

cached uniquely from the main query string
(instance:client\-csm.symplicity.com) so differnet searches using  
the same

date ranges will be faster.  ditto for your label_facet:"Added to
Position" clause.

-Hoss





Re: corrupt solr index on ec2

2008-10-31 Thread Michael McCandless


Bill Graham wrote:


Then it seemed to run well for about an hour and I saw this:

Oct 28, 2008 10:38:51 PM org.apache.solr.update.DirectUpdateHandler2  
commit

INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true)
Oct 28, 2008 10:38:51 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: after flush: fdx size mismatch:  
1156 docs vs 0 length in bytes of _2rv.fdx
   at  
org 
.apache 
.lucene 
.index.StoredFieldsWriter.closeDocStore(StoredFieldsWriter.java:94)
   at  
org 
.apache 
.lucene.index.DocFieldConsumers.closeDocStore(DocFieldConsumers.java: 
83)
   at  
org 
.apache 
.lucene.index.DocFieldProcessor.closeDocStore(DocFieldProcessor.java: 
47)
   at  
org 
.apache 
.lucene.index.DocumentsWriter.closeDocStore(DocumentsWriter.java:367)
   at  
org.apache.lucene.index.IndexWriter.flushDocStores(IndexWriter.java: 
1774)


This particular exception is very spooky -- it really looks like  
something is removing the index files (such as accidentally opening a  
2nd writer on the index).


Mike


TermVectorComponent for tag generation?

2008-10-31 Thread Jon Baer

Hi,

So Im looking to either use this or build a component which might do  
what Im looking for.  Id like to figure out if its possible use a  
single doc to get tag generation based on the matches within that  
document for example:


1 News Doc -> contains 5 Players and 8 Teams (show them as possible  
tags for this article)


In this case Players and Teams are also docs.  It's almost like I want  
to use MoreLikeThis w/ a different filter query than what Im using.


Is there any easy hack to get this going?

Thanks.

- Jon 


Re: TermVectorComponent for tag generation?

2008-10-31 Thread Grant Ingersoll

Hey Jon,

Not following how the TVC (TermVectorComp) would help here.I  
suppose you could use the "most important" terms, as defined by TF- 
IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to generate  
query terms.


However, I'm not following the different filter query piece.  Can you  
provide a bit more details?


One thing you did make me think, though, is it might be interesting to  
extend TermVectorMapper so that it can output a NamedList and then  
allow people to implement their own SolrTermVectorMapper and have it  
customize the TV output...


Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:


Hi,

So Im looking to either use this or build a component which might do  
what Im looking for.  Id like to figure out if its possible use a  
single doc to get tag generation based on the matches within that  
document for example:


1 News Doc -> contains 5 Players and 8 Teams (show them as possible  
tags for this article)


In this case Players and Teams are also docs.  It's almost like I  
want to use MoreLikeThis w/ a different filter query than what Im  
using.


Is there any easy hack to get this going?

Thanks.

- Jon


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











Re: TermVectorComponent for tag generation?

2008-10-31 Thread Jon Baer

Well for example in any given text (which is field on a document);

"While suitable for any application which requires full text indexing  
and searching capability, Lucene has been widely recognized for its  
utility in the implementation of Internet search engines and local,  
single-site searching.


At the core of Lucene's logical architecture is the idea of a document  
containing fields of text. This flexibility allows Lucene's API to be  
independent of file format. Text from PDFs, HTML, Microsoft Word  
documents, as well as many others can all be indexed so long as their  
textual information can be extracted."


Id like to be able to say the tags for this article should be [Lucene,  
PDF, HTML, Microsoft Word] because they are in field values from other  
documents.  Basically how to generate tags from just a single document  
based on other document field values.


- Jon


On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:


Hey Jon,

Not following how the TVC (TermVectorComp) would help here.I  
suppose you could use the "most important" terms, as defined by TF- 
IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to  
generate query terms.


However, I'm not following the different filter query piece.  Can  
you provide a bit more details?


One thing you did make me think, though, is it might be interesting  
to extend TermVectorMapper so that it can output a NamedList and  
then allow people to implement their own SolrTermVectorMapper and  
have it customize the TV output...


Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:


Hi,

So Im looking to either use this or build a component which might  
do what Im looking for.  Id like to figure out if its possible use  
a single doc to get tag generation based on the matches within that  
document for example:


1 News Doc -> contains 5 Players and 8 Teams (show them as possible  
tags for this article)


In this case Players and Teams are also docs.  It's almost like I  
want to use MoreLikeThis w/ a different filter query than what Im  
using.


Is there any easy hack to get this going?

Thanks.

- Jon


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ













RE: DIH and rss feeds

2008-10-31 Thread Lance Norskog
Thanks all. I knew there had to be something :) Perhaps I should read the 
complete wiki page over and over again some more. It is a complex tool.

Lance 

-Original Message-
From: Shalin Shekhar Mangar [mailto:[EMAIL PROTECTED] 
Sent: Friday, October 31, 2008 8:42 AM
To: solr-user@lucene.apache.org
Subject: Re: DIH and rss feeds

The "clean" parameter is there in the 1.3 release. The full-import is by 
definition "full" so we delete all existing documents at the start. If you 
don't want to clean the index, you can pass clean=false and DIH will just add 
them.
On Fri, Oct 31, 2008 at 8:58 PM, Jon Baer <[EMAIL PROTECTED]> wrote:

> Is that right?  I find the wording of "clean" a little confusing.  I 
> would have thought this is what I had needed earlier but the topic 
> came up regarding the fact that you can not deleteByQuery for an 
> entity you want to flush w/ delta-import.
>
> I just noticed that the original JIRA request says it was implemented 
> recently ...
>
> https://issues.apache.org/jira/browse/SOLR-801
>
> Im assuming this means your war needs to come from trunk copy?  Does 
> this patch affect that param @ all?
>
> - Jon
>
>
> On Oct 31, 2008, at 2:05 AM, Noble Paul നോബിള്‍ नोब्ळ् wrote:
>
>  run full-import with clean=false
>>
>> for full-import clean is set to true by default and for delta-import 
>> clean is false by default.
>>
>> On Fri, Oct 31, 2008 at 9:16 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:
>>
>>> I have a DataImportHandler configured to index from an RSS feed. It 
>>> is a "latest stuff" feed. It reads the feed and indexes the 100 
>>> documents harvested from the feed. So far, works great.
>>>
>>> Now: a few hours later there are a different 100 "lastest" 
>>> documents. How do I add those to the index so I will have 200 
>>> documents?  'full-import'
>>> throws
>>> away the first 100. 'delta-import' is not implemented. What is the 
>>> special trick here?  I'm using the Solr-1.3.0 release.
>>>
>>> Thanks,
>>>
>>> Lance Norskog
>>>
>>>
>>
>>
>> --
>> --Noble Paul
>>
>
>


--
Regards,
Shalin Shekhar Mangar.



DIH Http input bug - problem with two-level RSS walker

2008-10-31 Thread Lance Norskog
I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss feed
which contains N links to other rss feeds. The nested loop then reads each
one of those to create documents. (Yes, this is an obnoxious thing to do.)
Let's say the outer RSS feed gives 10 items. Both feeds use the same
structure: /rss/channel with a  node and then N  nodes inside
the channel. This should create two separate XML streams with two separate
Xpath iterators, right?










This does indeed walk each url from the outer feed and then fetch the inner
rss feed. Bravo! 

However, I found two separate problems in xpath iteration. They may be
related. The first problem is that it only stores the first document from
each "inner" feed. Each feed has several documents with different title
fields but it only grabs the first.

The other is an off-by-one bug. The outer loop iterates through the 10 items
and then tries to pull an 11th.  It then gives this exception trace:

INFO: Created URL to:  [inner url]
Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
getData
SEVERE: Exception thrown while getting data
java.net.MalformedURLException: no protocol: null/account.rss
at java.net.URL.(URL.java:567)
at java.net.URL.(URL.java:464)
at java.net.URL.(URL.java:413)
at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:90)
at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:47)
at
org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
3)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
yProcessor.java:210)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
tityProcessor.java:180)
at
org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
rocessor.java:160)
at
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
285)
 ...
Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
buildDocument
SEVERE: Exception while processing: album document :
SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
invoking url null Processing Document # 11
at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:115)
at
org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
a:47)







Re: date range query performance

2008-10-31 Thread Michael Lackhoff
On 31.10.2008 19:16 Chris Hostetter wrote:

> forteh record, you don't need to index as a "StrField" to get this 
> benefit, you can still index using DateField you just need to round your 
> dates to some less graunlar level .. if you always want to round down, you 
> don't even need to do the rounding yourself, just add "/SECOND" 
> or "/MINUTE" or "/HOUR" to each of your dates before sending them to solr.  
> (SOLR-741 proposes adding a config option to DateField to let this be done 
> server side)

Is this also possible for the timestamp that is automatically added to
all new/updated docs? I would like to be able to search (quickly) for
everything that was added within the last week or month or whatever. And
because I update the index only once a day a granuality of /DAY (if that
exists) would be fine.

- Michael


Re: date range query performance

2008-10-31 Thread Erik Hatcher


On Nov 1, 2008, at 1:07 AM, Michael Lackhoff wrote:


On 31.10.2008 19:16 Chris Hostetter wrote:


forteh record, you don't need to index as a "StrField" to get this
benefit, you can still index using DateField you just need to round  
your
dates to some less graunlar level .. if you always want to round  
down, you

don't even need to do the rounding yourself, just add "/SECOND"
or "/MINUTE" or "/HOUR" to each of your dates before sending them  
to solr.
(SOLR-741 proposes adding a config option to DateField to let this  
be done

server side)


Is this also possible for the timestamp that is automatically added to
all new/updated docs? I would like to be able to search (quickly) for
everything that was added within the last week or month or whatever.  
And
because I update the index only once a day a granuality of /DAY (if  
that

exists) would be fine.


Yeah, this should work fine:

   default="NOW/DAY" multiValued="false"/>


Erik



Re: DIH Http input bug - problem with two-level RSS walker

2008-10-31 Thread Shalin Shekhar Mangar
On Sat, Nov 1, 2008 at 10:30 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:

> I wrote a nested HttpDataSource RSS poller. The outer loop reads an rss
> feed
> which contains N links to other rss feeds. The nested loop then reads each
> one of those to create documents. (Yes, this is an obnoxious thing to do.)
> Let's say the outer RSS feed gives 10 items. Both feeds use the same
> structure: /rss/channel with a  node and then N  nodes inside
> the channel. This should create two separate XML streams with two separate
> Xpath iterators, right?
>
> 
>
>
>
>
>
>
> 
>
> This does indeed walk each url from the outer feed and then fetch the inner
> rss feed. Bravo!
>
> However, I found two separate problems in xpath iteration. They may be
> related. The first problem is that it only stores the first document from
> each "inner" feed. Each feed has several documents with different title
> fields but it only grabs the first.
>

The idea behind nested entities is to join them together so that one Solr
document is created for each root entity and the child entities provide more
fields which are added to the parent document.

I guess you want to create separate Solr documents from the root entity as
well as the child entities. I don't think that is possible with nested
entities. Essentially, you are trying to crawl feeds, not join them.

Probably an integration with Apache Droids can be thought about.
http://incubator.apache.org/projects/droids.html
http://people.apache.org/~thorsten/droids/

If you are going to crawl only one level, there may be a workaround.
However, it may be easier to implement all this with your own Java program
and just post results to Solr as usual.



> The other is an off-by-one bug. The outer loop iterates through the 10
> items
> and then tries to pull an 11th.  It then gives this exception trace:
>
> INFO: Created URL to:  [inner url]
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.HttpDataSource
> getData
> SEVERE: Exception thrown while getting data
> java.net.MalformedURLException: no protocol: null/account.rss
>at java.net.URL.(URL.java:567)
>at java.net.URL.(URL.java:464)
>at java.net.URL.(URL.java:413)
>at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:90)
>at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:47)
>at
>
> org.apache.solr.handler.dataimport.DebugLogger$2.getData(DebugLogger.java:18
> 3)
>at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.initQuery(XPathEntit
> yProcessor.java:210)
>at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.fetchNextRow(XPathEn
> tityProcessor.java:180)
>at
>
> org.apache.solr.handler.dataimport.XPathEntityProcessor.nextRow(XPathEntityP
> rocessor.java:160)
>at
>
> org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:
> 285)
>  ...
> Oct 31, 2008 11:21:20 PM org.apache.solr.handler.dataimport.DocBuilder
> buildDocument
> SEVERE: Exception while processing: album document :
> SolrInputDocumnt[{name=name(1.0)={Groups of stuff}}]
> org.apache.solr.handler.dataimport.DataImportHandlerException: Exception in
> invoking url null Processing Document # 11
>at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:115)
>at
>
> org.apache.solr.handler.dataimport.HttpDataSource.getData(HttpDataSource.jav
> a:47)
>
>
>
>
>
>


-- 
Regards,
Shalin Shekhar Mangar.


Re: date range query performance

2008-10-31 Thread Michael Lackhoff
On 01.11.2008 06:10 Erik Hatcher wrote:

> Yeah, this should work fine:
> 
>  default="NOW/DAY" multiValued="false"/>

Wow, that was fast, thanks!

-Michael


Re: TermVectorComponent for tag generation?

2008-10-31 Thread Vaijanath N. Rao

Hi Jon,

Isn't it similar to what Grant just said the top most terms ( after 
removing the stop words ).


You would need to get how many terms are there and there related 
frequency and any term which is beyond a certain threshold you would 
mark it as an member of tag set.


One can also build a set of related entities or terms which are 
following the current term, and than can decide on which all can become 
part of the tagset.


It that the requirement or I am missing something here.

-- Thanks and Regards
Vaijanath N. Rao

Jon Baer wrote:

Well for example in any given text (which is field on a document);

"While suitable for any application which requires full text indexing 
and searching capability, Lucene has been widely recognized for its 
utility in the implementation of Internet search engines and local, 
single-site searching.


At the core of Lucene's logical architecture is the idea of a document 
containing fields of text. This flexibility allows Lucene's API to be 
independent of file format. Text from PDFs, HTML, Microsoft Word 
documents, as well as many others can all be indexed so long as their 
textual information can be extracted."


Id like to be able to say the tags for this article should be [Lucene, 
PDF, HTML, Microsoft Word] because they are in field values from other 
documents.  Basically how to generate tags from just a single document 
based on other document field values.


- Jon


On Oct 31, 2008, at 6:17 PM, Grant Ingersoll wrote:


Hey Jon,

Not following how the TVC (TermVectorComp) would help here.I 
suppose you could use the "most important" terms, as defined by 
TF-IDF, as suggested tags.  The MLT (MoreLikeThis) uses this to 
generate query terms.


However, I'm not following the different filter query piece.  Can you 
provide a bit more details?


One thing you did make me think, though, is it might be interesting 
to extend TermVectorMapper so that it can output a NamedList and then 
allow people to implement their own SolrTermVectorMapper and have it 
customize the TV output...


Thanks,
Grant

On Oct 31, 2008, at 5:20 PM, Jon Baer wrote:


Hi,

So Im looking to either use this or build a component which might do 
what Im looking for.  Id like to figure out if its possible use a 
single doc to get tag generation based on the matches within that 
document for example:


1 News Doc -> contains 5 Players and 8 Teams (show them as possible 
tags for this article)


In this case Players and Teams are also docs.  It's almost like I 
want to use MoreLikeThis w/ a different filter query than what Im 
using.


Is there any easy hack to get this going?

Thanks.

- Jon


--
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ