date:20110813

ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)

Hi all, I want to ask about the best way to implement a solution for indexing a large amount of pdf documents between 10-60 MB each one. 100 to 1000 users connected simultaneously. I actually have 1 core of solr 3.3.0 and it works fine for a few number of pdf docs but I'm afraid about the mome

Re: xpath expression not working

2011-08-13 Thread abhayd

thanks Karsten i was able to use ur suggestion -- View this message in context: http://lucene.472066.n3.nabble.com/xpath-expression-not-working-tp3218133p3251481.html Sent from the Solr - User mailing list archive at Nabble.com.

Re: sorting issue with solr 3.3

2011-08-13 Thread Bernd Fehling

The issue was located in a 31 million docs index and i have already reduced it to a reproducable 4 documents index. It is stock solr 3.3.0. Yes, the documents are also in the wrong order as the field sort values. Just added only the field sort values to the email to keep it short. I will produce a

Re: paging size in SOLR

2011-08-13 Thread Erick Erickson

Jame: You control the number via settings in solrconfig.xml, so it's up to you. Jonathan: Hmmm, that's seems right, after all the "deep paging" penalty is really about keeping a large sorted array in memory but at least you only pay it once per 10,000, rather than 100 times (assuming page siz

Re: Example Solr Config on EC2

2011-08-13 Thread Erick Erickson

Do keep in mind though that your index will have to have any documents that have not yet been replicated added to the promoted slave. The easy to do this is just re-index documents from a "safe" point. If you're using time-based deltas, this is just some time interval "far enough in the past to gua

Re: Indexing tweet and searching "@keyword" OR "#keyword"

2011-08-13 Thread Erick Erickson

I don't see an easy way to do that with the standard set of filters. You'll probably need to write something custom (note, this is actually pretty easy). I suspect you'll need to do something like Synonyms, where when you get a token like #ipod, you essentially make it a synonym for ipod and insert

Re: Sorting suggest results on specific field

2011-08-13 Thread Erick Erickson

You can't, sorting only works with indexed data and only really makes sense for fields that have a single value. Sometimes using KeywordTokenizer helps if your fields has more than one word, perhaps with copyField. Best Erick On Thu, Aug 11, 2011 at 3:34 AM, Anshum wrote: > How can I have my s

Re: NRT in Master- Slave setup, crazy?

2011-08-13 Thread Erick Erickson

Hmmm, it almost seems like you're better off turning off replication entirely. Your "master" becomes a machine used as a source for rapidly spinning up a new slave or resetting a slave. I have no hard data to back up my misgivings about committing to the slaves then having replication overwrite th

Re: Some questions about SolrJ

2011-08-13 Thread Michael Sokolov

On 8/12/2011 4:18 PM, Shawn Heisey wrote: On 8/12/2011 1:49 PM, Shawn Heisey wrote: I am sure that I have more questions, but I may be able to answer a lot of them myself if I can see better examples. Thought of another question. My Perl build system uses DIH for all indexing, but with the J

Re: strip html from data

2011-08-13 Thread Erick Erickson

Right, this is expected behavior, it trips a lot of people up. When you specify ' indexed="true" ' in your field definitions, the contents of the input stream are put into the inverted index etc, *after* all the transformations you specify via tokenizers, filters, charFilters, etc are applied. In

Re: unique terms and multi-valued fields

2011-08-13 Thread Erick Erickson

Here's a very useful page for looking at what "index size" means. http://lucene.apache.org/java/3_0_2/fileformats.html#file-names Note that the files having to do with stored data (e.g. *.fdt) have very little impact on searching, they don't consume very many valuable resources. The "stored=true"-

Re: Fuzzy search with sort combination - drawback

2011-08-13 Thread Erick Erickson

I'm puzzled by what this means: "Is there a way to achieve the customized sort as well as the relevant content on top in this scenario." You say you remove the sorting part, which means your results are returned by relevance calculations. So I'm guessing that a &debugQuery=on would show you that

Re: Not update on duplicate key

2011-08-13 Thread Erick Erickson

If you mean just throw the new document on the floor if the index already contains a document with that key, I don't think you can do that. You could write a custom updateHandler that checks first to see whether the particular uniqueKey is in the index I suppose... Best Erick On Fri, Aug 12, 2011

Re: Post content to be indexed to Solr

2011-08-13 Thread Erick Erickson

I don't think this is really do-able. The only thing that comes to my mind is that you could (and this is assuming you're using Tika to handle the file evenutally) send the document through Tika on the client and construct a SolrJ document on the parts you care about. This would give you substantia

Re: Some questions about SolrJ

2011-08-13 Thread Michael Sokolov

Shawn, my experience with SolrJ in that configuration (no autoCommit) is that you have control over commits: if you don't issue an explicit commit, it won't happen. Re lifecycle: we don't use a static instance; rather our app maintains a small pool of CommonsHttpSolrServer instances that we

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Erick Erickson

The problem I've always had is that I don't quite know what "sorting on multivalued fields" means. If your field had tokens a and z, would sorting on that field put the doc at the beginning or end of the list? Sure, you can define rules (first token, last token, average of all tokens (whate

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson

Yeah, parsing PDF files can be pretty resource-intensive, so one solution is to offload it somewhere else. You can use the Tika libraries in SolrJ to parse the PDFs on as many clients as you want, just transmitting the results to Solr for indexing. HOw are all these docs being submitted? Is this s

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff

On 13.08.2011 18:03 Erick Erickson wrote: > The problem I've always had is that I don't quite know what > "sorting on multivalued fields" means. If your field had tokens > a and z, would sorting on that field put the doc > at the beginning or end of the list? Sure, you can define > rules (

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)

Hi Erick, Our app insert the pdf from a backoffice site and the people can search/consult throught a front end site. Both written in php. I've installed a tomcat for solr exclusivelly. the pdf docs are indexed and not stored using the standard solr.extraction.ExtractingRequestHandler (solr-ce

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Martijn v Groningen

The first solution would make sense to me. Some kind of a strategy mechanism for this would allow anyone to define their own rules. Duplicating results would be confusing to me. On 13 August 2011 18:39, Michael Lackhoff wrote: > On 13.08.2011 18:03 Erick Erickson wrote: > > > The problem I've al

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Bill Bell

I have a different use case. Consider a spatial multivalued field with latlong values for addresses. I would want sort by geodist() to return the closest distance in each group. For example find me the closest restaurant which each doc being a chain name like pizza hut. Or doctors with multiple

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff

On 13.08.2011 20:31 Martijn v Groningen wrote: > The first solution would make sense to me. Some kind of a strategy > mechanism > for this would allow anyone to define their own rules. Duplicating results > would be confusing to me. That is why I would only activate it on request (setting a speci

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Bill Bell

You could send PDF for processing using a queue solution like Amazon SQS. Kick off Amazon instances to process the queue. Once you process with Tika to text just send the update to Solr. Bill Bell Sent from mobile On Aug 13, 2011, at 10:13 AM, Erick Erickson wrote: > Yeah, parsing PDF files

Re: Problem with xinclude in solrconfig.xml

2011-08-13 Thread Bill Bell

What was it? Bill Bell Sent from mobile On Aug 10, 2011, at 2:21 PM, Way Cool wrote: > Sorry for the spam. I just figured it out. Thanks. > > On Wed, Aug 10, 2011 at 2:17 PM, Way Cool wrote: > >> Hi, Guys, >> >> Based on the document below, I should be able to include a file under the >> s

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Erick Erickson

Fair enough, but what's "first value in the list"? There's nothing special about "mutliValued" fields, that is where the schema has "multiValued=true". under the covers, this is no different than just concatenating all the values together and putting them in at one go, except for some games with th

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Erick Erickson

Ahhh, ok, my reply was irrelevant ... Here's a good write-up on this problem: http://www.lucidimagination.com/content/scaling-lucene-and-solr But Solr handles millions of documents on a single server in many cases, so waiting until the search app falls over is actually feasible. In general, if y

Re: ideas for indexing large amount of pdf docs

2011-08-13 Thread Rode Gonzalez (libnova)

Thanks Erick, Bill. Your answers tell me that we're in the right way ;) I will study the master/slave architecture for many slaves. In the future perhaps we will need it =) Best regards, Rode. -Original Message- From: Erick Erickson To: solr-user@lucene.apache.org Date: Sat, 13 Aug

Re: SOLR 3.3.0 multivalued field sort problem

2011-08-13 Thread Michael Lackhoff

On 13.08.2011 21:28 Erick Erickson wrote: > Fair enough, but what's "first value in the list"? > There's nothing special about "mutliValued" fields, > that is where the schema has "multiValued=true". > under the covers, this is no different than just > concatenating all the values together and put

Re: updating existing data in index vs inserting new data in index

2011-08-13 Thread Alexandre Sompheng

Hi Mark, I guess the "commit=true" when doing a "delta-import" is the solution for the JIRA I just submit SOLR-2711. Can you explain to me where you configured this info commit=true ? thanks, Alex On Thu, Jul 7, 2011 at 6:44 PM, Mark juszczec wrote: > First thanks for all the help. > > I think

Re: updating existing data in index vs inserting new data in index

2011-08-13 Thread Alexandre Sompheng

Actually I requested .../dataimport?command=delta-import&commit=true And DIH in delta-import mode does not commit. Do you have any guess ??? INFO: Starting Delta Import Aug 14, 2011 1:42:02 AM org.apache.solr.core.SolrCore execute INFO: [] webapp=/apache-solr-3.3.0 path=/dataimport params=

exceeded limit of maxWarmingSearchers ERROR

2011-08-13 Thread Naveen Gupta

Hi, Most of the settings are default. We have single node (Memory 1 GB, Index Size 4GB) We have a requirement where we are doing very fast commit. This is kind of real time requirement where we are polling many threads from third party and indexes into our system. We want these results to be av

Date Facet Question

2011-08-13 Thread Jamie Johnson

When doing Date faceting I've noticed that if the query is something like: start: NOW-1YEAR end: NOW GAP: +1MONTH when the response comes back the facet names are 2010-08-14T01:50:58.813Z 2010-09-14T01:50:58.813Z 2010-10-14T01:50:58.813Z 2010-11-14T01:50:58.813Z 2010-12-14T01:50:58.813Z etc ins

Re: NRT in Master- Slave setup, crazy?

2011-08-13 Thread Mark Miller

On Aug 11, 2011, at 9:53 AM, eks dev wrote: > Thinking aloud and grateful for sparing .. > > I need to support high commit rate (low update latency) in a master > slave setup and I have a bad feelings about it, even with disabling > warmup and stripping everything down that slows down refresh.

ideas for indexing large amount of pdf docs

Re: xpath expression not working

Re: sorting issue with solr 3.3

Re: paging size in SOLR

Re: Example Solr Config on EC2

Re: Indexing tweet and searching "@keyword" OR "#keyword"

Re: Sorting suggest results on specific field

Re: NRT in Master- Slave setup, crazy?

Re: Some questions about SolrJ

Re: strip html from data

Re: unique terms and multi-valued fields

Re: Fuzzy search with sort combination - drawback

Re: Not update on duplicate key

Re: Post content to be indexed to Solr

Re: Some questions about SolrJ

Re: SOLR 3.3.0 multivalued field sort problem

Re: ideas for indexing large amount of pdf docs

Re: SOLR 3.3.0 multivalued field sort problem

Re: ideas for indexing large amount of pdf docs

Re: SOLR 3.3.0 multivalued field sort problem

Re: SOLR 3.3.0 multivalued field sort problem

Re: SOLR 3.3.0 multivalued field sort problem

Re: ideas for indexing large amount of pdf docs

Re: Problem with xinclude in solrconfig.xml

Re: SOLR 3.3.0 multivalued field sort problem

Re: ideas for indexing large amount of pdf docs

Re: ideas for indexing large amount of pdf docs

Re: SOLR 3.3.0 multivalued field sort problem

Re: updating existing data in index vs inserting new data in index

Re: updating existing data in index vs inserting new data in index

exceeded limit of maxWarmingSearchers ERROR

Date Facet Question

Re: NRT in Master- Slave setup, crazy?

33 matches

Site Navigation

Mail list logo

Footer information