Hi all,
I want to ask about the best way to implement a solution for indexing a
large amount of pdf documents between 10-60 MB each one. 100 to 1000 users
connected simultaneously.
I actually have 1 core of solr 3.3.0 and it works fine for a few number of
pdf docs but I'm afraid about the mome
thanks Karsten
i was able to use ur suggestion
--
View this message in context:
http://lucene.472066.n3.nabble.com/xpath-expression-not-working-tp3218133p3251481.html
Sent from the Solr - User mailing list archive at Nabble.com.
The issue was located in a 31 million docs index and i have already reduced it
to a reproducable 4 documents index. It is stock solr 3.3.0.
Yes, the documents are also in the wrong order as the field sort values.
Just added only the field sort values to the email to keep it short.
I will produce a
Jame:
You control the number via settings in solrconfig.xml, so it's
up to you.
Jonathan:
Hmmm, that's seems right, after all the "deep paging" penalty is really
about keeping a large sorted array in memory but at least you only
pay it once per 10,000, rather than 100 times (assuming page siz
Do keep in mind though that your index will have to have any documents that
have not yet been replicated added to the promoted slave. The easy to do this
is just re-index documents from a "safe" point. If you're using time-based
deltas, this is just some time interval "far enough in the past to gua
I don't see an easy way to do that with the standard set of
filters. You'll probably need to write something custom (note,
this is actually pretty easy). I suspect you'll
need to do something like Synonyms, where when you
get a token like #ipod, you essentially make it a synonym
for ipod and insert
You can't, sorting only works with indexed data and only
really makes sense for fields that have a single value.
Sometimes using KeywordTokenizer helps if your
fields has more than one word, perhaps with
copyField.
Best
Erick
On Thu, Aug 11, 2011 at 3:34 AM, Anshum wrote:
> How can I have my s
Hmmm, it almost seems like you're better off turning off replication
entirely. Your "master" becomes a machine used as a source for rapidly
spinning up a new slave or resetting a slave.
I have no hard data to back up my misgivings about committing to the
slaves then having replication overwrite th
On 8/12/2011 4:18 PM, Shawn Heisey wrote:
On 8/12/2011 1:49 PM, Shawn Heisey wrote:
I am sure that I have more questions, but I may be able to answer a
lot of them myself if I can see better examples.
Thought of another question. My Perl build system uses DIH for all
indexing, but with the J
Right, this is expected behavior, it trips a lot of people up.
When you specify ' indexed="true" ' in your field definitions, the
contents of the input stream are put into the inverted index etc, *after*
all the transformations you specify via tokenizers, filters, charFilters,
etc are applied. In
Here's a very useful page for looking at what "index size" means.
http://lucene.apache.org/java/3_0_2/fileformats.html#file-names
Note that the files having to do with stored data (e.g. *.fdt) have very
little impact on searching, they don't consume very many valuable
resources.
The "stored=true"-
I'm puzzled by what this means:
"Is there a way to achieve the customized sort as well as the relevant
content on top in this scenario."
You say you remove the sorting part, which means your results
are returned by relevance calculations. So I'm guessing that
a &debugQuery=on would show you that
If you mean just throw the new document on the floor if
the index already contains a document with that key, I don't
think you can do that. You could write a custom updateHandler
that checks first to see whether the particular uniqueKey is
in the index I suppose...
Best
Erick
On Fri, Aug 12, 2011
I don't think this is really do-able. The only thing that
comes to my mind is that you could (and this is assuming
you're using Tika to handle the file evenutally) send the
document through Tika on the client and construct a
SolrJ document on the parts you care about. This would
give you substantia
Shawn, my experience with SolrJ in that configuration (no autoCommit)
is that you have control over commits: if you don't issue an explicit
commit, it won't happen. Re lifecycle: we don't use a static
instance; rather our app maintains a small pool of
CommonsHttpSolrServer instances that we
The problem I've always had is that I don't quite know what
"sorting on multivalued fields" means. If your field had tokens
a and z, would sorting on that field put the doc
at the beginning or end of the list? Sure, you can define
rules (first token, last token, average of all tokens (whate
Yeah, parsing PDF files can be pretty resource-intensive, so one solution
is to offload it somewhere else. You can use the Tika libraries in SolrJ
to parse the PDFs on as many clients as you want, just transmitting the
results to Solr for indexing.
HOw are all these docs being submitted? Is this s
On 13.08.2011 18:03 Erick Erickson wrote:
> The problem I've always had is that I don't quite know what
> "sorting on multivalued fields" means. If your field had tokens
> a and z, would sorting on that field put the doc
> at the beginning or end of the list? Sure, you can define
> rules (
Hi Erick,
Our app insert the pdf from a backoffice site and the people can
search/consult throught a front end site. Both written in php. I've
installed a tomcat for solr exclusivelly.
the pdf docs are indexed and not stored using the standard
solr.extraction.ExtractingRequestHandler (solr-ce
The first solution would make sense to me. Some kind of a strategy
mechanism
for this would allow anyone to define their own rules. Duplicating results
would be confusing to me.
On 13 August 2011 18:39, Michael Lackhoff wrote:
> On 13.08.2011 18:03 Erick Erickson wrote:
>
> > The problem I've al
I have a different use case. Consider a spatial multivalued field with latlong
values for addresses. I would want sort by geodist() to return the closest
distance in each group. For example find me the closest restaurant which each
doc being a chain name like pizza hut. Or doctors with multiple
On 13.08.2011 20:31 Martijn v Groningen wrote:
> The first solution would make sense to me. Some kind of a strategy
> mechanism
> for this would allow anyone to define their own rules. Duplicating results
> would be confusing to me.
That is why I would only activate it on request (setting a speci
You could send PDF for processing using a queue solution like Amazon SQS. Kick
off Amazon instances to process the queue.
Once you process with Tika to text just send the update to Solr.
Bill Bell
Sent from mobile
On Aug 13, 2011, at 10:13 AM, Erick Erickson wrote:
> Yeah, parsing PDF files
What was it?
Bill Bell
Sent from mobile
On Aug 10, 2011, at 2:21 PM, Way Cool wrote:
> Sorry for the spam. I just figured it out. Thanks.
>
> On Wed, Aug 10, 2011 at 2:17 PM, Way Cool wrote:
>
>> Hi, Guys,
>>
>> Based on the document below, I should be able to include a file under the
>> s
Fair enough, but what's "first value in the list"?
There's nothing special about "mutliValued" fields,
that is where the schema has "multiValued=true".
under the covers, this is no different than just
concatenating all the values together and putting them
in at one go, except for some games with th
Ahhh, ok, my reply was irrelevant ...
Here's a good write-up on this problem:
http://www.lucidimagination.com/content/scaling-lucene-and-solr
But Solr handles millions of documents on a single server in many cases,
so waiting until the search app falls over is actually feasible.
In general, if y
Thanks Erick, Bill. Your answers tell me that we're in the right way ;) I
will study the master/slave architecture for many slaves. In the future
perhaps we will need it =)
Best regards,
Rode.
-Original Message-
From: Erick Erickson
To: solr-user@lucene.apache.org
Date: Sat, 13 Aug
On 13.08.2011 21:28 Erick Erickson wrote:
> Fair enough, but what's "first value in the list"?
> There's nothing special about "mutliValued" fields,
> that is where the schema has "multiValued=true".
> under the covers, this is no different than just
> concatenating all the values together and put
Hi Mark,
I guess the "commit=true" when doing a "delta-import" is the solution for
the JIRA I just submit SOLR-2711.
Can you explain to me where you configured this info commit=true ?
thanks,
Alex
On Thu, Jul 7, 2011 at 6:44 PM, Mark juszczec wrote:
> First thanks for all the help.
>
> I think
Actually I requested .../dataimport?command=delta-import&commit=true
And DIH in delta-import mode does not commit. Do you have any guess ???
INFO: Starting Delta Import
Aug 14, 2011 1:42:02 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/apache-solr-3.3.0 path=/dataimport
params=
Hi,
Most of the settings are default.
We have single node (Memory 1 GB, Index Size 4GB)
We have a requirement where we are doing very fast commit. This is kind of
real time requirement where we are polling many threads from third party and
indexes into our system.
We want these results to be av
When doing Date faceting I've noticed that if the query is something like:
start: NOW-1YEAR
end: NOW
GAP: +1MONTH
when the response comes back the facet names are
2010-08-14T01:50:58.813Z
2010-09-14T01:50:58.813Z
2010-10-14T01:50:58.813Z
2010-11-14T01:50:58.813Z
2010-12-14T01:50:58.813Z
etc
ins
On Aug 11, 2011, at 9:53 AM, eks dev wrote:
> Thinking aloud and grateful for sparing ..
>
> I need to support high commit rate (low update latency) in a master
> slave setup and I have a bad feelings about it, even with disabling
> warmup and stripping everything down that slows down refresh.
33 matches
Mail list logo