Hello All,
I am trying to find a better approach ( perfomance wise
) to index documents. Document count is approximately a million+.
First, I thought of writing multiple threads using
CommonsHttpSolrServer to submit documents. But later I found out
StreamingUpdateSolrServer, which sa
2) Also, is CommonsHttpSolrServer thread safe?
it is only if you initialize it with the MultiThreadedHttpConnectionManager:
http://hc.apache.org/httpclient-3.x/apidocs/org/apache/commons/httpclient/MultiThreadedHttpConnectionManager.html
Cheers,
Chantal
Hello,
I now believe that i really did misunderstand the problem and,
unfortunately, i don't believe i can be of much assistance as i did not
have to implement a similar problem.
Cheers,
-
Markus Jelsma Buyways B.V.
Technisch ArchitectFriesestraatweg 215c
http://
Hi Kelly,
"...the criteria for this hypothetical search involves multi-valued fields,
where the index of one matching criteria needs to correspond to the same
value in another multi-valued field in the same index. You can't do that..."
Just my two cents:
By storing values in two different multi-
Hi All
I am trying to do a data import but I am getting the following error.
INFO: [] webapp=/solr path=/dataimport params={command=status} status=0
QTime=405
2010-01-12 03:08:08.576::WARN: Error for /solr/dataimport
java.lang.OutOfMemoryError: Java heap space
Jan 12, 2010 3:08:05 AM org.apach
Hi, ALL!
I have two tables in database.
t_article {
title,
content,
author
}
t_friend {
person_A,
person_B
}
note that in t_friend is many-to-many relation。
When a logged-in user, search articles with a query word, 3 factors should
be considered in.
factor 1. relevency score
facto
I have considered building lucene index like:
Document: { title, content, author, friends }
Thus, author and friends are two seperate fields. so I can boost them
seperately.
The problem is, if a document's author is the logged-in user, it's uncessary
to search the friends field, because it would n
2010/1/12 Wangsheng Mei
> I have considered building lucene index like:
> Document: { title, content, author, friends }
> Thus, author and friends are two seperate fields. so I can boost them
> seperately.
> The problem is, if a document's author is the logged-in user, it's
> uncessary to search
You need more memory to run dataimport.
On Tue, Jan 12, 2010 at 4:46 PM, Lee Smith wrote:
> Hi All
>
> I am trying to do a data import but I am getting the following error.
>
> INFO: [] webapp=/solr path=/dataimport params={command=status} status=0
> QTime=405
> 2010-01-12 03:08:08.576::WARN:
Thank you for your response.
Will I just need to adjust the allowed memory in a config file or is this a
server issue. ?
Sorry I know nothing about Java.
Hope you can advise !
On 12 Jan 2010, at 12:26, Noble Paul നോബിള് नोब्ळ् wrote:
> You need more memory to run dataimport.
>
>
> On Tue,
it is the way you start your solr server( -Xmx option)
On Tue, Jan 12, 2010 at 6:00 PM, Lee Smith wrote:
> Thank you for your response.
>
> Will I just need to adjust the allowed memory in a config file or is this a
> server issue. ?
>
> Sorry I know nothing about Java.
>
> Hope you can advise !
Hi,
I'm interested in near-dupe removal as mentioned (briefly) here:
http://wiki.apache.org/solr/Deduplication
However the link for TextProfileSignature hasn't been filled in yet.
Does anyone have an example of using TextProfileSignature that demonstrates
the tunable parameters mentioned in th
Hi all,
I started working with solr about 1 month ago, and everything was
running well both indexing as searching documents.
I have a 40GB index with about 10 000 000 documents available. I index
3k docs for each 10m and commit after each insert.
Since yesterday, I can't commit no articles to in
Do you have a stack trace?
On Jan 12, 2010, at 2:54 AM, Ellery Leung wrote:
> When I am building the index for around 2 ~ 25000 records, sometimes I
> came across with this error:
>
>
>
> Uncaught exception "Exception" with message '0' Status: Communication Error
>
>
>
> I search Goog
There are several possibilities:
1> you have some process holding open your indexes, probably
other searchers. You *probably* are OK just committing
new changes if there is exactly *one* searcher keeping
your index open. If you have some process whereby
you periodically open a
I have 2 ways to update the index, either I use solrj using
SolrEmbeddedServer or I do it with an http query. If I do it with an
http query I indeed don't stop tomcat but I have to do some operations
(mainly taking instance out of the cluster) and I can't automate this
process when I can automate u
On Jan 12, 2010, at 7:56 AM, Andrew Clegg wrote:
I'm interested in near-dupe removal as mentioned (briefly) here:
http://wiki.apache.org/solr/Deduplication
However the link for TextProfileSignature hasn't been filled in yet.
Does anyone have an example of using TextProfileSignature that
dem
Thanks Erik, but I'm still a little confused as to exactly where in the Solr
config I set these parameters.
The example on the wiki page uses Lookup3Signature which (presumably) takes
no parameters, so there's no indication in the XML examples of where you
would set them. Unless I'm missing some
Hi Erik,
I'm a newbie to solr... By IR, you mean searcher? Is there a place where I can
check the open searchers? And rebooting the machine shouldn't closed that
searchers?
Thanks,
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: terça-feira, 12 de Janeir
On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
Thanks Erik, but I'm still a little confused as to exactly where in
the Solr
config I set these parameters.
You'd configure them within the element, something like
this:
5
The example on the wiki page uses Lookup3Signature which
(p
Am I doing this right.
I have made changes to my schema so as per guide I done the following.
Stopped the application
Updated the Schema
Re-Started
Deleted the index folder
Then ran a full import & optimize command ie:
/dataimport?command=full-import&optimize=true
In the status it shows Indexi
What does a search of *:* give you?
As far as your steps, delete the index folder *before* restarting
Solr, not after. That might be the issue.
Erik
On Jan 12, 2010, at 9:23 AM, Lee Smith wrote:
Am I doing this right.
I have made changes to my schema so as per guide I done the f
I have a document that has a multi-valued field where each value in
the field itself is comprised of two values itself. Think of an invoice doc
with multi value line items - each line item having quantity and product name.
One option I see is to have a line item multi value field and when produc
Hi Erik
Done as suggested and still only showing 1 Document
Doing a *:* give me 1 document
Cant understand why ?
On 12 Jan 2010, at 14:25, Erik Hatcher wrote:
> What does a search of *:* give you?
>
> As far as your steps, delete the index folder *before* restarting Solr, not
> after. That
Kelly,
This is a good question you have posed and illustrates a challenge with Solr's
limited schema. I don't see how the dedup will help. I would continue with
the SKU based approach and use this patch:
https://issues.apache.org/jira/browse/SOLR-236
You'll collapse on the product id. My book,
Erik Hatcher-4 wrote:
>
>
> On Jan 12, 2010, at 9:15 AM, Andrew Clegg wrote:
>> Thanks Erik, but I'm still a little confused as to exactly where in
>> the Solr
>> config I set these parameters.
>
> You'd configure them within the element, something like
> this:
>
> 5
>
>
OK, thank
Dont worry my bad.
I made a mistake in my dataimport to all have the same ID !
All working now thank you
On 12 Jan 2010, at 14:33, Lee Smith wrote:
> Hi Erik
>
> Done as suggested and still only showing 1 Document
>
> Doing a *:* give me 1 document
>
> Cant understand why ?
>
> On 12 Jan
Rebooting the machine certainly closes the searchers, but
depending upon how you shut it down there may be stale files
After reboot (but before you start SOLR), how much space
is on your disk? If it's 40G, you have no stale files
Yes, IR is IndexReader, which is a searcher.
I'll have to l
On Tue, Jan 12, 2010 at 3:48 AM, Smith G wrote:
> Hello All,
> I am trying to find a better approach ( perfomance wise
> ) to index documents. Document count is approximately a million+.
> First, I thought of writing multiple threads using
> CommonsHttpSolrServer to submit documents.
Hi,
I found that there's no explicit option to run DataImportHandler in a
synchronous mode. I need that option to run DIH from SolrJ (
EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
to DIH as a workaround for this, but I think it makes sense to add
specific option for that.
I restarted the solr and stopped all searches. After that, the commit() was
normal (2 secs) and it's been working for 3h without problems (indexing and a
few searches too)... I haven't done any optimize yet, mainly because I had no
deletes on the index and the performance is ok, so no need to op
Hello ,
I am using add() method which receives Collection of
SolrInputDocuments instead of add() which receives a single document.
I am afraid, is sending a group of documents being called as
"batching" in Solr terminology? . If yes, then I am doing it ( by
including additional logic i
The beauty of StreamingUpdateSolrServer is that you don't have to worry about
batch sizes; it streams them all. Just keep calling add() with one document
and it'll get enqueued. You can pass a collection but there's no performance
benefit.
StreamingUpdateSolrServer can be configured to use mu
On Tue, Jan 12, 2010 at 1:09 PM, Smiley, David W. wrote:
> The beauty of StreamingUpdateSolrServer is that you don't have to worry about
> batch sizes; it streams them all. Just keep calling add() with one document
> and it'll get enqueued. You can pass a collection but there's no performance
I have schema.xml that uses a Tokenizer that I wrote.
I understand the standard way of deploying Solr is
to place solr.war in webapps directory, have a separate
directory that has conf files under its conf subdirectory,
and specify that directory as Solr home dir via either
JVM property or JNDI.
For your first question, wouldn't it be possible to achieve that with some
simple boolean logic? I mean, if you have a requirement to match any of the
other fields AND description2, but not if it ONLY matches description 2:
say matching x against field A, B, and description 2:
((A:x OR B:x) AND de
They have probably added the logic for that server-side. Solr does not
support these type of features, but they are easy to implement.
Saving a search could be as easy as storing the selected query parameters.
Then creating an alert (or RSS feed) for that would be a process on the
server that exec
There will be a San Francisco/Bay Area meetup on Jan. 21st at 7:15 PM at the
"Hacker Dojo" (don't ask me...) location.
RSVP and all the details are at http://www.meetup.com/SFBay-Lucene-Solr-Meetup/
Hope to see you there,
Grant
You'll be able to get some valuable info by monitoring your free space on
disk.
If this occurs again, it'd help if you posted your your SOLR
configuration and told us about any warmups you're doing...
Of course, there are always gremlins...
On Tue, Jan 12, 2010 at 12:36 PM, Frederico Azeiteiro <
There's a connect exception on the client, however I'd expect this to
show up in the slave replication console (it's not). Is this correct
behavior (i.e. not showing replication errors)?
On Mon, Jan 11, 2010 at 9:50 AM, Jason Rutherglen
wrote:
> Yonik,
>
> I added startup to replicateAfter, howe
: Subject: Problem comitting on 40GB index
: In-Reply-To: <7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com>
http://people.apache.org/~hossman/#threadhijack
Thread Hijacking on Mailing Lists
When starting a new discussion on a mailing list, please do not reply to
an existing message,
Hmm...Even with the IP address in the master URL on the slave, the
indexversion command to the master mysteriously doesn't show the
latest commit... Totally freakin' bizarre!
On Tue, Jan 12, 2010 at 10:53 AM, Jason Rutherglen
wrote:
> There's a connect exception on the client, however I'd expect
Hello,
If "Search Engine Integration, Deployment and Scaling in the Cloud" sounds
interesting to you, and you are going to be in or near New York next Wednesday
(Jan 20) evening:
http://www.meetup.com/NYC-Search-and-Discovery/calendar/12238220/
Sorry for dupes to those of you subscribed to mul
Huh?
On Tue, Jan 12, 2010 at 2:00 PM, Chris Hostetter
wrote:
>
> : Subject: Problem comitting on 40GB index
> : In-Reply-To: <
> 7a9c48b51001120345h5a57dbd4o8a8a39fc4a98a...@mail.gmail.com>
>
> http://people.apache.org/~hossman/#threadhijack
> Thread Hijacking on Mailing Lists
>
> When starting a
It was having multiple replicateAfter values... Perhaps a bug, though
I probably won't spend time investigating the why right now, nor
reproducing in the test cases.
On Tue, Jan 12, 2010 at 11:10 AM, Jason Rutherglen
wrote:
> Hmm...Even with the IP address in the master URL on the slave, the
> in
Hi Solr List,
We're trying to set up java-based replication with Solr 1.4 (dist
tarball). We are running this to start with on a pair of test servers
just to see how things go.
There's one major problem we can't seem to get past. When we
replicate manually (via the admin page) things se
On Tue, Jan 12, 2010 at 2:17 PM, Jason Rutherglen
wrote:
> It was having multiple replicateAfter values... Perhaps a bug, though
> I probably won't spend time investigating the why right now, nor
> reproducing in the test cases.
Do you mean that you changed the config and now it's working?
We hav
David,
Thanks, and yes, I decided to travel that path last night (applying SOLR-236
patch) and plan to have some results by the end of the day; I'll post a
summary.
I read about field collapsing in your book last night. The book is an
excellent resource by the way (shameless commendation plug!),
: Subject: Meaning of this error: Failure to meet condition(s) of
: required/prohibited clause(s)???
First of all: it's not an error -- it's a debuging statment generated when
you asekd for an explanation of a document's score...
: 0.0 = (NON-MATCH) Failure to meet condition(s) of required/
Multiple replicateAfter for example: startup,optimize I believe this
was causing the issue, I limited it to commit, and it started to work
(with no other changes to solrconfig.xml)
On Tue, Jan 12, 2010 at 11:24 AM, Yonik Seeley
wrote:
> On Tue, Jan 12, 2010 at 2:17 PM, Jason Rutherglen
> wrote:
I wouldn't use the patches of the sub issues right now as they are
under development right now (the are currently a POC). I also think
that the latest patch in SOLR-236 is currently the best option. There
are some memory related problems with the patch that have to do with
caching. The fieldCollaps
> I can't put the extra JARs in the Solr home dir's lib subdir, can I?
Why, this is indeed what you should do, Kuro.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
> From: Teruhiko Kurosaka
> To: "solr-user@lucene.apache.org"
> Sent: Tue, Januar
Hello,
Yeah, to be brief.. I wanted to read documents and update them
simoultaneously with different threads. Main issue I considered is To
call add / commit for " how many " documents, because I can not keep
adding millions of documents one after another to
StreamingUpdateSolrServer by just
We've got Localsolr (2.9.1 lucene-spatial library) running on Solr 1.4 with
Tomcat 1.6. Everything's looking good, except for a couple little issues.
If we specify fl=id (or fl= anything) and wt=json it seems that the fl
parameter is ignored (thus we get a lot more detail in our results than we'd
On Tue, Jan 12, 2010 at 2:53 PM, Smith G wrote:
> 4) queuesize parameter of Streaming constructer: What could be the
> rough-value when it comes
> to real time application having a million+ documents to be indexed ? ..
> So what does "queuesize" is exactly for ? , if we can go on
> addin
This is in Solr 1.3.
I have some text in our database in the form 0088698183939. The leading zeros
are useless, but I want to able to search it with no leading zeros or several
leading zeros. So, I decided to index this as a long, expecting it to just
store it as a number. But, instead, I see t
: I have some text in our database in the form 0088698183939. The leading
: zeros are useless, but I want to able to search it with no leading zeros
: or several leading zeros. So, I decided to index this as a long,
: expecting it to just store it as a number. But, instead, I see this in
: the
Thanks. Is there any performance penalty vs. LongField? I don't need to do any
range queries on these value. I am basically treating them as numerical
strings. I thought it would just be a shortcut to strip leading zeros, which I
can easily do on my own.
From
: Thanks. Is there any performance penalty vs. LongField? I don't need to
The other ones do normalization by converting to a Long internally -- i
have no idea if you would see some micro performance benefit in doing
the 0 stripping yourself.
Sorting a LongField should take less RAM then a Sor
You can do this stripping in the DataImportHandler. You would have to
write your own stripping code using regular expresssions. Also, the
ExtractingRequestHandler strips out the html markup when you use it to
index an html file:
http://wiki.apache.org/solr/ExtractingRequestHandler
On Mon, Jan 11,
There are a lot of projects that don't use stopwords any more. You
might consider dropping them altogether.
On Mon, Jan 11, 2010 at 2:25 PM, Don Werve wrote:
> This is the way I've implemented multilingual search as well.
>
> 2010/1/11 Markus Jelsma
>
>> Hello,
>>
>>
>> We have implemented langu
The index files are corrupted. You have to create index again from scratch.
This should have reported CorruptIndexException. The code in handling
index files does not catch all exceptions and wrap them as it should.
On Mon, Jan 11, 2010 at 3:10 PM, Osborn Chan wrote:
> Hi all,
>
> I got followin
Hi, here is the stack trace:
Fatal error: Uncaught exception 'Exception' with message '"0"
Status: Communication Error' in
C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Serv
ice.php:385
Stack trace:
#0 C:\nginx\html\lib\SolrPhpClient\Apache\Solr\Service.php(652):
Apache_Solr_Ser
vice->_sendRawPos
I don't think this is something to consider across the board for all
languages. The same grammatical units that are part of a word in one
language (and removed by stemmers) are independent morphemes in others
(and should be stopwords)
so please take this advice on a case-by-case basis for each lan
sorry, i forgot to include this 2009 paper comparing what stopwords do
across 3 languages:
http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf
in my opinion, if stopwords annoy your users for very special cases
like 'th
Field Collapsing is what you want - this is a classic problem with
retail store product indexing and everyone uses field collapsing.
(That is, everyone who is willing to apply the patch on their own
code.)
Dedupe is completely the wrong word. Deduping is something else
entirely - it is about tryin
Hello,
I'm trying to boost results based on date using the first example
here:http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
However, I'm getting an error that reads, "Can't use ms() function on
non-numeric legacy date field"
The date field uses so
Ellery,
A preliminary look at the source code indicates that the error is happening
because the solr server is taking longer than expected to respond to the
client
http://code.google.com/p/solr-php-client/source/browse/trunk/Apache/Solr/Service.php
The default time out handed down to Apache_Solr
Hi
I am new to the solr technology. We have been using lucene for handling
searching in our web application www.toostep.com which is a knowledge
sharing platform developed in java using Spring MVC architecture and iBatis
as the persistance framework. Now that the application is getting very
comple
I think you need to use the new trieDateField
On 01/12/2010 07:06 PM, Daniel Higginbotham wrote:
Hello,
I'm trying to boost results based on date using the first example
here:http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
However, I'm getting an er
There is a band named "The The". And a producer named "Don Was". For a list of
all-stopword movie titles at Netflix, see this post:
http://wunderwood.org/most_casual_observer/2007/05/invisible_titles.html
My favorite is "To Be and To Have (Être et Avoir)", which is all stopwords in
two language
it can be added
On Tue, Jan 12, 2010 at 10:18 PM, Alexey Serba wrote:
> Hi,
>
> I found that there's no explicit option to run DataImportHandler in a
> synchronous mode. I need that option to run DIH from SolrJ (
> EmbeddedSolrServer ) in the same thread. Currently I pass dummy stream
> to DIH as
72 matches
Mail list logo