Re: nutch and solr

2012-02-25 Thread alessio crisantemi
thi is the problem!
Becaus in my root there is a url!

I write you my step-by-step configuration of nutch:
(I use cygwin because I work on windows)

*1. Extract the Nutch package*

*2. Configure Solr*
(*Copy the provided Nutch schema from directory apache-nutch-1.0/conf to
directory apache-solr-1.3.0/example/solr/conf (override the existing file)
for *to allow Solr to create the snippets for search results so we need to
store the content in addition to indexing it:

*b. Change schema.xml so that the stored attribute of field “content” is
true.*

**

We want to be able to tweak the relevancy of queries easily so we’ll create
new dismax request handler configuration for our use case:

*d. Open apache-solr-1.3.0/example/solr/conf/solrconfig.xml and paste
following fragment to it*





dismax

explicit

0.01



content^0.5 anchor^1.0 title^1.2





content^0.5 anchor^1.5 title^1.2 site^1.5





url





2<-1 5<-2 6<90%



100



*:*

title url content

0

title

0

url

regex





*3. Start Solr*

cd apache-solr-1.3.0/example

java -jar start.jar

*4. Configure Nutch*

*a. Open nutch-site.xml in directory apache-nutch-1.0/conf, replace it’s
contents with the following (we specify our crawler name, active plugins
and limit maximum url count for single host per run to be 100) :*







http.agent.name

nutch-solr-integration





generate.max.per.host

100





plugin.includes

protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)





*b. Open regex-urlfilter.txt in directory apache-nutch-1.0/conf, replace
it’s content with following:*

-^(https|telnet|file|ftp|mailto):



# skip some suffixes

-\.(swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|
WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|
PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG
|bmp|BMP)$



# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]



# allow urls in foofactory.fi domain

+^http:*//([a-z0-9\-A-Z]*\.)*google.it/*



# deny anything *else*

-.

*5. Create a seed list (the initial urls to fetch)*

mkdir urls *(crea una cartella ‘urls’)*

echo "http://www.google.it/"; > urls/seed.txt

*6. Inject seed url(s) to nutch crawldb (execute in nutch directory)*

bin/nutch inject crawl/crawldb urls
AND HERE, THE MESSAGE ERROR about empty path. Why, in your opinion?
thank you
alessio

Il giorno 24 febbraio 2012 17:51, tamanjit.bin...@yahoo.co.in <
tamanjit.bin...@yahoo.co.in> ha scritto:

> The empty path message is becayse nutch is unable to find a url in the url
> location that you provide.
>
> Kindly ensure there is a url there.
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/nutch-and-solr-tp3765166p3773089.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: SIREn integration with SOLR

2012-02-25 Thread Anuj Kumar
Hi Chitra,

You can download the distribution using the details given here-
http://siren.sindice.com/download.html
License has been changed to AGPL3.0

Source code is available here- https://github.com/rdelbru/SIREn/

- Anuj

On Wed, Feb 22, 2012 at 3:45 PM, chitra  wrote:

> Hi,
>
>   We would like  to implement semantic search in our websites. We
> already have the full text search service by using SOLR. Heard that SIREn
> plug-in for SOLR would be able to allow to index & query the semi-structred
> data.
>
> Could any one of you provide me more details about SIREn, its integration
> with SOLR and how to use it with PHP
>
> Thanks in advance...
>
> Regards
> Chitra
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SIREn-integration-with-SOLR-tp3766056p3766056.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


upgrading Solr - org.apache.lucene.search.Filter and acceptDocs

2012-02-25 Thread Jamie Johnson
I'm trying to upgrade an application I have from an old snapshot of
solr to the latest stable trunk and see that the constructor for
Filter has changed, specifically there is another parameter named
acceptDocs, the API says the following

acceptDocs - Bits that represent the allowable docs to match
(typically deleted docs but possibly filtering other documents)

but I'm not sure what specifically this means to my filter.  How
should this be used when trying to upgrade a filter?


Re: upgrading Solr - org.apache.lucene.search.Filter and acceptDocs

2012-02-25 Thread Yonik Seeley
On Sat, Feb 25, 2012 at 3:16 PM, Jamie Johnson  wrote:
> I'm trying to upgrade an application I have from an old snapshot of
> solr to the latest stable trunk and see that the constructor for
> Filter has changed, specifically there is another parameter named
> acceptDocs, the API says the following
>
> acceptDocs - Bits that represent the allowable docs to match
> (typically deleted docs but possibly filtering other documents)
>
> but I'm not sure what specifically this means to my filter.  How
> should this be used when trying to upgrade a filter?

If a document doesn't match acceptDocs, it should be returned by the filter.
Lucene is basically asking "what documents match your filter AND match
acceptDocs"

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


Re: upgrading Solr - org.apache.lucene.search.Filter and acceptDocs

2012-02-25 Thread Jamie Johnson
I am assuming you meant should not be returned right?  I basically
return a filtered doc id set and do the following


return new FilteredDocIdSet(startingFilter.getDocIdSet(readerCtx, acceptDocs)) {
@Override
public boolean match(int doc) {

//do custom stuff
}
};


does the filtereddocidset give me only the ones that match, or is
there something additional I need to do in addition to my custom match
logic here?  I.e. just do if(!acceptDocs.get(doc)) return false; at
the top?

On Sat, Feb 25, 2012 at 3:23 PM, Yonik Seeley
 wrote:
> On Sat, Feb 25, 2012 at 3:16 PM, Jamie Johnson  wrote:
>> I'm trying to upgrade an application I have from an old snapshot of
>> solr to the latest stable trunk and see that the constructor for
>> Filter has changed, specifically there is another parameter named
>> acceptDocs, the API says the following
>>
>> acceptDocs - Bits that represent the allowable docs to match
>> (typically deleted docs but possibly filtering other documents)
>>
>> but I'm not sure what specifically this means to my filter.  How
>> should this be used when trying to upgrade a filter?
>
> If a document doesn't match acceptDocs, it should be returned by the filter.
> Lucene is basically asking "what documents match your filter AND match
> acceptDocs"
>
> -Yonik
> lucenerevolution.com - Lucene/Solr Open Source Search Conference.
> Boston May 7-10


Solr 4.0 Question

2012-02-25 Thread Jamie Johnson
I just got done reading
http://www.searchworkings.org/blog/-/blogs/uwe-says%3A-is-your-reader-atomic
and was specifically interested in the following line

"Unfortunately, Apache Solr still uses this horrible code in a lot of
places, leaving us with a major piece of work undone. Major parts of
Solr’s facetting and filter caching need to be rewritten to work per
atomic segment! For those implementing plugins or other components for
Solr, SolrIndexSearcher exposes a “atomic view” of its underlying
reader via SolrIndexSearcher.getAtomicReader()."

Can someone give more details around this?  Is there a JIRA to address
this in Solr?  I'm assuming that this is not something new, just
something that can be improved?


Re: upgrading Solr - org.apache.lucene.search.Filter and acceptDocs

2012-02-25 Thread Yonik Seeley
On Sat, Feb 25, 2012 at 3:37 PM, Jamie Johnson  wrote:
>  I.e. just do if(!acceptDocs.get(doc)) return false; at
> the top?

Yep, that should do it.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


Re: Solr 4.0 Question

2012-02-25 Thread Yonik Seeley
On Sat, Feb 25, 2012 at 3:39 PM, Jamie Johnson  wrote:
> "Unfortunately, Apache Solr still uses this horrible code in a lot of
> places, leaving us with a major piece of work undone. Major parts of
> Solr’s facetting and filter caching need to be rewritten to work per
> atomic segment! For those implementing plugins or other components for
> Solr, SolrIndexSearcher exposes a “atomic view” of its underlying
> reader via SolrIndexSearcher.getAtomicReader()."

Some of this is just a misunderstanding, and some of it is a
difference of opinion.

Solr uses a top-level FieldCache entry for certain types of faceting,
but it's optional. Solr can also use per-segment FieldCache entries
when faceting.  The reason we haven't removed the top-level FieldCache
faceting is that it's faster unless you are doing near-realtime (NRT)
search (due to the cost of merging terms across segments).  Top level
fieldcache entries are also more memory efficient for Strings as
string values are not repeated across each segment.  The right
approach depends on the specific use-case, and Solr will continue to
strive to have faceting algorithms optimized for both NRT and non-NRT.

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10


RE: Problem with SolrCloud + Zookeeper + DataImportHandler

2012-02-25 Thread Agnieszka Kukałowicz
Hi,

As you've asked.
https://issues.apache.org/jira/browse/SOLR-3165

If you have any questions or need more details I can debug this problem
more.

Agnieszka

> -Original Message-
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: Friday, February 24, 2012 10:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Problem with SolrCloud + Zookeeper + DataImportHandler
>
> The key piece is "ZkSolrResourceLoader does not support getConfigDir()
> "
>
> Apparently DIH is doing something that requires getting the local
> config dir path - but this is on ZK in SolrCloud mode, not the local
> filesystem.
>
> Could you make a JIRA issue for this? I could look into a work around
> depending on why DIH needs to do this.
>
> - Mark
>
> On Feb 20, 2012, at 7:28 AM, Agnieszka Kukałowicz wrote:
>
> > Hi All,
> >
> > I've recently downloaded latest solr trunk to configure solrcloud
> with
> > zookeeper
> > using standard configuration from wiki:
> > http://wiki.apache.org/solr/SolrCloud.
> >
> > The problem occurred when I tried to configure DataImportHandler in
> > solrconfig.xml:
> >
> >   > class="org.apache.solr.handler.dataimport.DataImportHandler">
> >
> >   db-data-config.xml
> >
> >  
> >
> >
> > After starting solr with zookeeper I've got errors:
> >
> > Feb 20, 2012 11:30:12 AM org.apache.solr.common.SolrException log
> > SEVERE: null:org.apache.solr.common.SolrException
> >at org.apache.solr.core.SolrCore.(SolrCore.java:606)
> >at org.apache.solr.core.SolrCore.(SolrCore.java:490)
> >at
> > org.apache.solr.core.CoreContainer.create(CoreContainer.java:705)
> >at
> org.apache.solr.core.CoreContainer.load(CoreContainer.java:442)
> >at
> org.apache.solr.core.CoreContainer.load(CoreContainer.java:313)
> >at
> >
> org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer
> .ja
> > va:262)
> >at
> >
> org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java
> :98
> > )
> >at
> > org.mortbay.jetty.servlet.FilterHolder.doStart(FilterHolder.java:97)
> >at
> >
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
> )
> >at
> >
> org.mortbay.jetty.servlet.ServletHandler.initialize(ServletHandler.java
> :71
> > 3)
> >at
> > org.mortbay.jetty.servlet.Context.startContext(Context.java:140)
> >at
> >
> org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:
> 128
> > 2)
> >at
> >
> org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:51
> 8)
> >at
> >
> org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:499)
> >at
> >
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
> )
> >at
> >
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.j
> ava
> > :152)
> >at
> >
> org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandl
> erC
> > ollection.java:156)
> >at
> >
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
> )
> >at
> >
> org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.j
> ava
> > :152)
> >at
> >
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
> )
> >at
> >
> org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:13
> 0)
> >at org.mortbay.jetty.Server.doStart(Server.java:224)
> >at
> >
> org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50
> )
> >at
> > org.mortbay.xml.XmlConfiguration.main(XmlConfiguration.java:985)
> >at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.ja
> va:
> > 39)
> >at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccesso
> rIm
> > pl.java:25)
> >at java.lang.reflect.Method.invoke(Method.java:597)
> >at org.mortbay.start.Main.invokeMain(Main.java:194)
> >at org.mortbay.start.Main.start(Main.java:534)
> >at org.mortbay.start.Main.start(Main.java:441)
> >at org.mortbay.start.Main.main(Main.java:119)
> > Caused by: org.apache.solr.common.SolrException: FATAL: Could not
> create
> > importer. DataImporter config invalid
> >at
> >
> org.apache.solr.handler.dataimport.DataImportHandler.inform(DataImportH
> and
> > ler.java:120)
> >at
> >
> org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:
> 542
> > )
> >at org.apache.solr.core.SolrCore.(SolrCore.java:601)
> >... 31 more
> > Caused by: org.apache.solr.common.cloud.ZooKeeperException:
> > ZkSolrResourceLoader does not support getConfigDir() - likely, w
> >at
> >
> org.apache.solr.cloud.ZkSolrResourceLoader.getConfigDir(ZkSolrResourceL
> oad
> > er.java:99)
> >at
> >
> org.apache.solr.handler.dataimport.SimplePropertiesWriter.init(SimplePr
> ope

Re: lucene operators interfearing in edismax

2012-02-25 Thread William Bell
Please backport to 3x.

On Mon, Feb 20, 2012 at 2:22 PM, Yonik Seeley
 wrote:
> This should be fixed in trunk by LUCENE-2566
>
> QueryParser: Unary operators +,-,! will not be treated as operators if
> they are followed by whitespace.
>
> -Yonik
> lucidimagination.com
>
>
>
> On Mon, Feb 20, 2012 at 2:09 PM, jmlucjav  wrote:
>> Hi,
>>
>> I am using edismax with end user entered strings. One search was not finding
>> what appeared to be the best match. The search was:
>>
>> Sage Creek Organics - Enchanted
>>
>> If I remove the -, the doc I want is found as best score. Turns out (I
>> think) the - is the culprit as the best match has 'enchanted' and this makes
>> it 'NOT enchanted'
>>
>> Is my analisys correct? I tried looking at the debug output but saw not NOT
>> entries there...
>>
>> If so, is there a standard way (any filter) to remove lucene operators from
>> user entered queries? I thought this must be something usual.
>>
>> thanks
>> javi
>>
>> --
>> View this message in context: 
>> http://lucene.472066.n3.nabble.com/lucene-operators-interfearing-in-edismax-tp3761577p3761577.html
>> Sent from the Solr - User mailing list archive at Nabble.com.



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: TikaLanguageIdentifierUpdateProcessorFactory(since Solr3.5.0) to be used in Solr3.3.0?

2012-02-25 Thread Erick Erickson
Well, you can give it a try, I don't know if anyone's done that
before. And you're on your own, I haven't a clue what
the results would be...

Sorry I can't be more help here...
Erick

On Thu, Feb 23, 2012 at 10:44 PM, bing  wrote:
> Hi, all,
>
> I am using
> org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory
> (since Solr3.5.0) to do language detection, and it's cool.
>
> An issue: if I deploy Solr3.3.0, is it possible to import that factory in
> Solr3.5.0 to be used in Solr3.3.0?
>
> Why I stick on Solr3.3.0 is because I am working on Dspace (discovery) to
> call solr, and for now the highest version that Solr can be upgraded to is
> 3.3.0.
>
> I would hope to do this while keep Dspace + Solr at the most. Say, import
> that factory into Solr3.3.0, is it possible? Does any one happen to know
> certain way to solve this?
>
> Best Regards,
> Bing
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/TikaLanguageIdentifierUpdateProcessorFactory-since-Solr3-5-0-to-be-used-in-Solr3-3-0-tp3771620p3771620.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: Indexing taking so much time to complete.

2012-02-25 Thread Erick Erickson
You have to tell us a lot more about what you're trying to do. I can
import 32G in about 20 minutes, so obviously you're doing
something different than I am...

Perhaps you might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Sat, Feb 25, 2012 at 12:00 AM, Suneel  wrote:
> Hi All,
>
> I am using Apache solr 3.1 and trying to caching 50 gb records but it is
> taking more then 20 hours this is very painful to update records.
>
> 1. Is there any way to reduce caching time or this time is ok for 50 gb
> records ?.
>
> 2. What is the delta-import, this will be helpful for me cache only updated
> record not rather then caching all records ?.
>
>
>
> Please help me in above mentioned question.
>
>
> Thanks & Regards,
>
> -
> Suneel Pandey
> Sr. Software Developer
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-complete-tp3774464p3774464.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: TermsComponent show only terms that matched query?

2012-02-25 Thread Erick Erickson
Jay:

I've seen the this question go 'round before, but don't remember
a satisfactory solution. Are you talking on a per-document basis
here? If so, I vaguely remember it being possible to do something
with highlighting, just counting the tags returned after highlighting.

Best
Erick

On Fri, Feb 24, 2012 at 3:31 PM, Jay Hill  wrote:
> I have a situation where I want to show the term counts as is done in the
> TermsComponent, but *only* for terms that are *matched* in a query, so I
> get something returned like this (pseudo code):
>
> q=title:(golf swing)
>
> 
> title: golf legends show how to improve your golf swing on the golf course
> ...other fields
> 
>
> 
> golf (3)
> swing (1)
> 
>
> rather than getting back all of the terms in the doc.
>
> Thanks,
> -Jay


Re: TermsComponent show only terms that matched query?

2012-02-25 Thread Lance Norskog
I think you have to walk the term positions and offsets, look in the
stored field, and find the terms that matched. Which is exactly what
highlighting does. And this will only find the actual terms in the
text, no synonyms. So if you search for Sempranillo and find
Sempranillo in some wines and Tempranillo in others, you have to know
yourself that they are synonyms.

On Sat, Feb 25, 2012 at 2:54 PM, Erick Erickson  wrote:
> Jay:
>
> I've seen the this question go 'round before, but don't remember
> a satisfactory solution. Are you talking on a per-document basis
> here? If so, I vaguely remember it being possible to do something
> with highlighting, just counting the tags returned after highlighting.
>
> Best
> Erick
>
> On Fri, Feb 24, 2012 at 3:31 PM, Jay Hill  wrote:
>> I have a situation where I want to show the term counts as is done in the
>> TermsComponent, but *only* for terms that are *matched* in a query, so I
>> get something returned like this (pseudo code):
>>
>> q=title:(golf swing)
>>
>> 
>> title: golf legends show how to improve your golf swing on the golf course
>> ...other fields
>> 
>>
>> 
>> golf (3)
>> swing (1)
>> 
>>
>> rather than getting back all of the terms in the doc.
>>
>> Thanks,
>> -Jay



-- 
Lance Norskog
goks...@gmail.com


RE: Indexing taking so much time to complete.

2012-02-25 Thread Mike O'Leary
What's your secret?

OK, that question is not the kind recommended in the UsingMailingLists 
suggestions, so I will write again soon with a description of my data and what 
I am trying to do, and ask more specific questions. And I don't mean to hijack 
the thread, but I am in the same boat as the poster.

I just started working with Solr less than two months ago, and after beginning 
with a completely naïve approach to indexing database contents with 
DataImportHandler and then making small adjustments to improve performance as I 
learned about them, I have gotten some smaller datasets to import in a 
reasonable amount of time, but the 60GB data set that I will need to index for 
the project I am working on would take over three days to import using the 
configuration that I have now. Obviously you're doing something different than 
I am...

What things would you say have made the biggest improvement in indexing 
performance with the 32GB data set that you mentioned? How long do you think it 
would take to index that same data set if you used Solr more or less out of the 
box with no attempts to improve its performance?
Thanks,
Mike

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Saturday, February 25, 2012 2:51 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing taking so much time to complete.

You have to tell us a lot more about what you're trying to do. I can import 32G 
in about 20 minutes, so obviously you're doing something different than I am...

Perhaps you might review:
http://wiki.apache.org/solr/UsingMailingLists

Best
Erick

On Sat, Feb 25, 2012 at 12:00 AM, Suneel  wrote:
> Hi All,
>
> I am using Apache solr 3.1 and trying to caching 50 gb records but it 
> is taking more then 20 hours this is very painful to update records.
>
> 1. Is there any way to reduce caching time or this time is ok for 50 
> gb records ?.
>
> 2. What is the delta-import, this will be helpful for me cache only 
> updated record not rather then caching all records ?.
>
>
>
> Please help me in above mentioned question.
>
>
> Thanks & Regards,
>
> -
> Suneel Pandey
> Sr. Software Developer
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-com
> plete-tp3774464p3774464.html Sent from the Solr - User mailing list 
> archive at Nabble.com.


Re: Indexing taking so much time to complete.

2012-02-25 Thread Erick Erickson
Right. My situation is simple, I have a 32G dump of
Wikipedia data in a big XML file. I can parse it and
dump it into a (local) Solr instance at 5-7K
records/second. But it's stupid-simple, just a few
fields and no database involved. Much of the 32G
is XML. But that serves to illustrate
that the size of the data to be imported isn't much
information to go on...

bq: 60GB data set that I will need to index for the
project I am working on would take over three days
to import using the configuration that I have now.

OK, first thing I'd do is figure out what's taking the
time. Consider switching to SolrJ for your indexing
process, it can make debugging things much
easier. Here's a blog post:
http://www.lucidimagination.com/blog/2012/02/14/indexing-with-solrj/
When you start getting to 60G of data to import,
you might want finer control over what you're
doing, better error reporting, etc. as well as
being better able to pinpoint where your problems
are.

And, you can do things like just spin through the
data-retrieval part to answer the first question you
need to answer, "what's taking the time?" Is it
fetching the data? Sending it to Solr? Do you
have Tika in here somewhere? Network latency?
If you set up the SolrJ process, you can just selectively
remove steps in the process to determine what
the bottleneck is and go from there.

Hope that helps
Erick


On Sat, Feb 25, 2012 at 8:55 PM, Mike O'Leary  wrote:
> What's your secret?
>
> OK, that question is not the kind recommended in the UsingMailingLists 
> suggestions, so I will write again soon with a description of my data and 
> what I am trying to do, and ask more specific questions. And I don't mean to 
> hijack the thread, but I am in the same boat as the poster.
>
> I just started working with Solr less than two months ago, and after 
> beginning with a completely naïve approach to indexing database contents with 
> DataImportHandler and then making small adjustments to improve performance as 
> I learned about them, I have gotten some smaller datasets to import in a 
> reasonable amount of time, but the 60GB data set that I will need to index 
> for the project I am working on would take over three days to import using 
> the configuration that I have now. Obviously you're doing something different 
> than I am...
>
> What things would you say have made the biggest improvement in indexing 
> performance with the 32GB data set that you mentioned? How long do you think 
> it would take to index that same data set if you used Solr more or less out 
> of the box with no attempts to improve its performance?
> Thanks,
> Mike
>
> -Original Message-
> From: Erick Erickson [mailto:erickerick...@gmail.com]
> Sent: Saturday, February 25, 2012 2:51 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing taking so much time to complete.
>
> You have to tell us a lot more about what you're trying to do. I can import 
> 32G in about 20 minutes, so obviously you're doing something different than I 
> am...
>
> Perhaps you might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
> Best
> Erick
>
> On Sat, Feb 25, 2012 at 12:00 AM, Suneel  wrote:
>> Hi All,
>>
>> I am using Apache solr 3.1 and trying to caching 50 gb records but it
>> is taking more then 20 hours this is very painful to update records.
>>
>> 1. Is there any way to reduce caching time or this time is ok for 50
>> gb records ?.
>>
>> 2. What is the delta-import, this will be helpful for me cache only
>> updated record not rather then caching all records ?.
>>
>>
>>
>> Please help me in above mentioned question.
>>
>>
>> Thanks & Regards,
>>
>> -
>> Suneel Pandey
>> Sr. Software Developer
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/Indexing-taking-so-much-time-to-com
>> plete-tp3774464p3774464.html Sent from the Solr - User mailing list
>> archive at Nabble.com.


Re: Solr Transaction Log Question

2012-02-25 Thread Yonik Seeley
On Sat, Feb 25, 2012 at 11:30 PM, Jamie Johnson  wrote:
> How large will the transaction log grow, and how long should it be kept 
> around?

We keep around enough logs to satisfy a minimum of 100 updates
lookback.  Unneeded log files are deleted automatically.
When a hard commit is done, we create a new log file (since we know
the normal index files have been sync'd and hence we no longer need
the update log for durability).

-Yonik
lucenerevolution.com - Lucene/Solr Open Source Search Conference.
Boston May 7-10