Solr 4- You mean the Solr 'trunk' source or the Solr 1.4.1 release?
The 1.4.1 release does not have the TikaEntityProcessor, only the /extract code.
The Solr 3.x branch and the trunk have the TikaEP. I use the 3.x
branch and, well, the TikaEP has a few problems but can be hacked
around.
Whatever
Wow, would you put a diagram somewhere up on the Solr site?
Or, here, and I will put it somewhere there.
And, what is a VIP?
Dennis Gearon
Signature Warning
It is always a good idea to learn from your own mistakes. It is usually a
better
idea to learn from others’ mistakes,
On Wed, Dec 1, 2010 at 10:56 AM, Jerry Li wrote:
> Hi team
>
> My solr version is 1.4
> There is an ArrayIndexOutOfBoundsException when i sort one field and the
> following is my code and log info,
> any help will be appreciated.
>
> Code:
>
> SolrQuery query = new SolrQuery();
> que
Greetings,
The Seattle Scalability Meetup isn't slacking for the holidays. We've
got an awesome lineup for Wed, December 8 at 7pm:
http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/
-Jake Mannix from Twitter will talk about the Twitter Search
infrastructure (with distributed Lucene)
-Chris
Hi team
My solr version is 1.4
There is an ArrayIndexOutOfBoundsException when i sort one field and the
following is my code and log info,
any help will be appreciated.
Code:
SolrQuery query = new SolrQuery();
query.setSortField("author", ORDER.desc);
query.addFilterQuery
Hi Upayavira,
this is a good start for solving my problem, can you please tell how does
such a replication URL look like?
Thanks,
Tommaso
2010/12/1 Upayavira
> Hi Tommaso,
>
> I believe you can tell each server to act as a master (which means it
> can have its indexes pulled from it).
>
> You ca
Hi, A diagram will be very much appreciated.
Thanks,
Jayant
> From: u...@odoko.co.uk
> To: solr-user@lucene.apache.org
> Subject: Re: distributed architecture
> Date: Wed, 1 Dec 2010 00:39:40 +
>
> I cannot say how mature the code for B) is, but it is not yet included
> in a release.
>
> I
On 11/30/2010 3:49 PM, Robert Petersen wrote:
That raises another question: top can show only 20 GB free out of 64
but the tomcat/solr process only shows its using half of that. What is
using the rest? The numbers don't add up...
Chances are that it's your operating system disk cache. Below
you may implement your own MergePolicy to keep on large index and
merge all other small ones
or simply set merge factor to 2 and the largest index not be merged by
set maxMergeDocs less than the docs in the largest one.
So there is one large index and a small one. when adding a little
docs, they wi
On Tue, Nov 30, 2010 at 6:04 PM, Robert Petersen wrote:
> My question is this. Why in the world would all of my slaves, after
> running fine for some days, suddenly all at the exact same minute
> experience OOM heap errors and go dead?
If there is no change in query traffic when this happens, th
1. make sure the the port is not used.
2. ./bin/shutdown.sh && tail -f logs/xxx to see what the server is doing
if you just feed data or modified index, and don't flush/commit,
when shutdowning, it will do something.
2010/12/1 Robert Petersen :
> Greetings, we're wondering why we can issue th
What would I do with the heap dump though? Run one of those java heap
analyzers looking for memory leaks or something? I have no experience
with thoseI saw there was a bug fix in solr 1.4.1 for a 100 byte memory
leak occurring on each commit, but it would take thousands of commits to
make that ad
On 11/30/2010 2:27 PM, Cinquini, Luca (3880) wrote:
Hi,
I'd like to know if anybody has suggestions/opinions on what is
currently the best architecture for a distributed search system using Solr. The
use case is that of a system composed
of N indexes, each hosted on a separate machine,
I cannot say how mature the code for B) is, but it is not yet included
in a release.
If you want the ability to distribute content across multiple nodes (due
to volume) and want resilience, then use both.
I've had one setup where we have two master servers, each with four
cores. Then we have two
Hi Tommaso,
I believe you can tell each server to act as a master (which means it
can have its indexes pulled from it).
You can then include the master hostname in the URL that triggers a
replication process. Thus, if you triggered replication from outside
solr, you'd have control over which mast
After a recent Windows 7 crash (:-\), upon restart, Solr starts giving
LockObtainFailedException errors: (excerpt)
30-Nov-2010 23:10:51 org.apache.solr.common.SolrException log
SEVERE: org.apache.lucene.store.LockObtainFailedException: Lock
obtain timed out:
nativefsl...@solr\.\.\data0\index
I don't know who you are replying to here, but...
There's nothing to stop you doing:
* import 2m docs
* sleep 2 days
* import 2m docs
* sleep 2 days
* repeat above until done
* commit
There's no reason why you should commit regularly. If you need to slow
down for your DB, do, but that does
Hi Robert,
I'd recommend launching Tomcat with -XX:+HeapDumpOnOutOfMemoryError
and -XX:HeapDumpPath=, so then
you have something to look at versus a Gedankenexperiment :)
-- Ken
On Nov 30, 2010, at 3:04pm, Robert Petersen wrote:
Greetings, we are running one master and four slaves of our
Greetings, we are running one master and four slaves of our multicore
solr setup. We just served searches for our catalog of 8 million
products with this farm during black Friday and cyber Monday, our
busiest days of the year, and the servers did not break a sweat! Index
size is about 28GB.
H
Greetings, we're wondering why we can issue the command to shutdown
tomcat/solr but the process remains visible in memory (by using the top
command) and we have to manually kill the PID for it to release its
memory before we can (re)start tomcat/solr? Anybody have any ideas?
The process is using 1
Hi,
I'd like to know if anybody has suggestions/opinions on what is
currently the best architecture for a distributed search system using Solr. The
use case is that of a system composed
of N indexes, each hosted on a separate machine, each index containing unique
content.
Options that I
I set maxFieldLength to 2147483647, restarted tomcat and re-indexed pdf files
again. I also commented out the one in the section. Unfortunately
the files are still chopped out if the size of file is more than 20MB.
Any suggestions? I really appreciate your help!
Xiaohui
-Original Message-
Bump. Anyone?
-J
On Nov 29, 2010, at 3:17 PM, John Williams wrote:
> Recently, we have started to get "Bad file descriptor" errors in one of our
> Solr instances. This instance is a searcher and its index is stored on a
> local SSD. The master however has it's index stored on NFS, which seems
We've got a largish corpus (~94 million documents). We'd like to be able
to sort on one of the string fields. However this takes an incredibly
long time. A warming query for that field takes about ~20 minutes.
However most of the time the result sets are small since we use filters
heavily - typ
Thanks so much for your help!
Xiaohui
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Tuesday, November 30, 2010 2:01 PM
To: solr-user@lucene.apache.org
Subject: Re: how to set maxFieldLength to unlimitd
Set the value in solrconfig.xml to, say, 2147483647
Hi,
I am using cached SQL entity processor in my data config, please find below
the structure of my data config file.
Object property and relationship needs to be matched against each object.
Whenever the data is being returned by all the 3 entities (all 3 sel
Hi,
Thanks Jacob and Ken for your replies.
I am not able to change project architecture to add Lucandra even if it
looks like a nice solution.
Going the VIP way can definitely an option even if I'd be more keen to solve
that "inside" Solr.
I am thinking to try and play with Collection Distribution
Set the value in solrconfig.xml to, say, 2147483647
Also, see this thread for a common gotcha:
http://lucene.472066.n3.nabble.com/Solr-ignoring-maxFieldLength-td473263.html
,
it appears you can just comment out the one in the section.
Best
Erick
On Tue, Nov 30, 2010 at 1:48 PM, Ma, Xiaohui (NI
On Tue, Nov 30, 2010 at 3:09 PM, Yonik Seeley
wrote:
> On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke
> wrote:
>> Still I'm wondering, why this issue does not occur with the plain
>> example solr setup with 2 indexed docs. Any explanation?
>
> It's an old option you have in your solrconfig.xml t
I need index and search some pdf files which are very big (around 1000 pages
each). How can I set maxFieldLength to unlimited?
Thanks so much for your help in advance,
Xiaohui
Hi Tommaso,
On Nov 30, 2010, at 7:41am, Tommaso Teofili wrote:
Hi all,
in a replication environment if the host where the master is running
goes
down for some reason, is there a way to communicate to the slaves to
point
to a different (backup) master without manually changing
configurati
okay.
the query kills the database, because no index of modified is set ...
--
View this message in context:
http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1993750.html
Sent from the Solr - User mailing list archive at Nabble.com.
I don't know, you'll have to debug it to see if it's the thing that takes so
long. Solr
should be able to handle 1,200 updates in a very short time unless there's
something
else going on, like you're committing after every update or something.
This may help you track down performance with DIH
htt
Hi,
I am running multiple Solr cores (solr-tomcat 1.4.0+ds1-1ubuntu1) under
Tomcat (6.0.24-2ubuntu1.4) on Ubuntu 10.04.1. I have a master server where
all Solr writes go, and a slave server that replicates all cores from the
master, and accepts all read-only queries.
After maxing out PermGen spac
Rather have a Master and multiple Slave combination, with master only being
used for writes and slaves used for reads.
Master to Slave replication is easily configurable.
Two Solr instances sharing the same index is not at all good idea with both
writing to the same index.
Regards,
Jayendra
On T
fieldNorm is the combination of length of the field with index and query
time boosts.
1. lengthNorm = measure of the importance of a term according to the
total number of terms in the field
1. Implementation: 1/sqrt(numTerms)
2. Implication: a term matched in fields with
Your best bet might be to look into Lucandra:
https://github.com/tjake/Lucandra
On Tue, Nov 30, 2010 at 10:41 AM, Tommaso Teofili wrote:
> Hi all,
>
> in a replication environment if the host where the master is running goes
> down for some reason, is there a way to communicate to the slaves to
Hi all,
in a replication environment if the host where the master is running goes
down for some reason, is there a way to communicate to the slaves to point
to a different (backup) master without manually changing configuration (and
restarting the slaves or their cores)?
Basically I'd like to be
On Tue, Nov 30, 2010 at 9:45 AM, Jacob Elder wrote:
> Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
> past, we were using a patched version of StandardTokenizer which treated
> @twitteruser and #hashtag better, but this became a release engineering
> nightmare so we swi
Hmm, I found some similar queries on stackoverflow and they did not recommend
exposing the lucene docId.
So, I guess my question becomes: What is the best way, from within my custom
QParser, to take a list of solr primary keys (that were retrieved from
elsewhere) and turn them into docIds? I a
Right. CJK doesn't tend to have a lot of whitespace to begin with. In the
past, we were using a patched version of StandardTokenizer which treated
@twitteruser and #hashtag better, but this became a release engineering
nightmare so we switched to Whitespace.
Perhaps I could rephrase the question a
Hello,
someone can explain the difference between queryNorm and FieldNorm in
debugQuery??
why if i push one bf boost up, the queryNorm goes down??
i made some modifies..before the situation was different. why??
thanx
--
Gastone Penzo
+1
That's exactly what we need, too.
On Mon, Nov 29, 2010 at 5:28 PM, Shawn Heisey wrote:
> On 11/29/2010 3:15 PM, Jacob Elder wrote:
>
>> I am looking for a clear example of using more than one tokenizer for a
>> source single field. My application has a single "body" field which until
>> rece
i copied the wrong query, because 10 hours ;)
i didnt test the query with 28 million records . but wiht a few million and
it works fine. ...
before i used DIH, i used php and import direclty documents into solr. but i
want use dih because the better performance, i think so ... grml ...
--
Vie
On Tue, Nov 30, 2010 at 8:24 AM, Martin Grotzke
wrote:
> Still I'm wondering, why this issue does not occur with the plain
> example solr setup with 2 indexed docs. Any explanation?
It's an old option you have in your solrconfig.xml that causes a
different code path to be followed in Solr:
how do you think is the deltaQuery better ? XD
--
View this message in context:
http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992774.html
Sent from the Solr - User mailing list archive at Nabble.com.
everyday ~30.000 Documents and every hour ~1200
multiple thread with DIH ? how it works ?
--
View this message in context:
http://lucene.472066.n3.nabble.com/Best-practice-for-Delta-every-2-Minutes-tp1992714p1992767.html
Sent from the Solr - User mailing list archive at Nabble.com.
Please provide more data. Specifically:
> how many documents are updated?
> Have you tried running this query without Solr? In other words
have you investigated whether the speed issue is simply your
SQL executing slowly?
> Why are you selecting the last 10 hours' data when all you want
i
Hello.
index is about 28 Million documents large. When i starts an delta-import is
look at modified. but delta import takes to long. over an hour need solr for
delta.
thats my query. all sessions from the last hour should updated and all
changed. i think its normal that solr need long time for t
See below. If this still doesn't make sense, could you show us some
examples?
Best
Erick
On Tue, Nov 30, 2010 at 8:33 AM, Greg Smith wrote:
> Bernd,
>
> Looking at the results returned in the search results the field is
> populated
> with all of the information regardless of whether there was an
Bernd,
Looking at the results returned in the search results the field is populated
with all of the information regardless of whether there was an email
contained in the contents.
Would the way the analysers and tokens be handled different if using a copy
field?
Thanks
On 30 November 2010 10:54
On Tue, Nov 30, 2010 at 10:29 AM, Michael McCandless
wrote:
> Hmm this is in fact a regression.
>
> TopFieldCollector expects (but does not verify) that numHits is > 0.
>
> I guess to fix this we could fix TopFieldCollector.create to return a
> NullCollector when numHits is 0.
Fixing this in luce
Solr doesn't lock anything as far as I know, it just executes the
query you specify. The query you specify may well do bad things
to your database, but that's not Solr's fault. What happens if you
simply try executing the query outside Solr? Do you see the
same "locking" behavior?
You might want t
We do a lot of precisely this sort of thing. Ours is a commercial
product (Honeycomb Lexicon) that extracts behavioural information from
logs, events and network data (don't worry, I'm not pushing this on
you!) - only to say that there are a lot of considerations beyond base
Solr when it comes to h
i know, it's not solr .. but perhaps you should have a look at it:
http://www.cloudera.com/blog/2010/09/using-flume-to-collect-apache-2-web-server-logs/
On Tue, Nov 30, 2010 at 12:58 PM, Peter Karich wrote:
> take a look into this:
> http://vimeo.com/16102543
>
> for that amount of data it isn'
Hi,
I have a windows cluster that I would like to install Solr onto, there
are two nodes that provide basic failover. I was thinking of this setup:
Tomcat installed as win service
Two solr instances sharing the same index
The second instance would take over when the first fails, so you should
take a look into this:
http://vimeo.com/16102543
for that amount of data it isn't that easy :-)
We are looking into building a reporting feature and investigating solutions
which will allow us to search though our logs for downloads, searches and
view history.
Each log item is relatively smal
Hi,
I was wondering how I would go about getting the lucene docid included in the
results from a solr query?
I've built a QueryParser to query another solr instance and and join the
results of the two instances through the use of a Filter. The Filter needs the
lucene docid to work. This is th
On Nov 29, 2010, at 5:17 PM, Shawn Heisey wrote:
> I was just in a meeting where we discussed customer feedback on our website.
> One thing that the users would like to see is "galleries" where photos that
> are part of a set are grouped together under a single result. This is
> basically fi
Hi,
I found the problem:
The class name has been changed to 1.4.1:
From: import org.apache.solr.response.SolrQueryResponse;
To: import org.apache.solr.request.SolrQueryResponse;
Best,
---
Hong-Thai
-Message d'origine-
De : Hong-Thai Nguyen [mailto:hong-thai.ngu...@polyspot
Am 30.11.2010 10:56, schrieb Greg Smith:
> Hi,
>
> I have written a plugin to filter on email types and keep those tokens,
> however when I run it in the analysis in the admin it all works fine.
>
> But when I use the data import handler to import the data and set the field
> type it doesn't rem
Hi,
I have written a plugin to filter on email types and keep those tokens,
however when I run it in the analysis in the admin it all works fine.
But when I use the data import handler to import the data and set the field
type it doesn't remove the other tokens and keeps the field in the original
ahhh I see..good point..yes, for a high number of unique scores the
secondary sort won't have any effect..
On 30 November 2010 09:32, Jason Brown wrote:
> Hi - you do understand may case - we tried what you suggested but as the
> relevancy is very precise we couldn't get it it to do a dual-sort.
Hi - you do understand may case - we tried what you suggested but as the
relevancy is very precise we couldn't get it it to do a dual-sort.
I like the idea of using one of the dismax parameters (bf) to in-effect
increase the boost on a newer document.
Thanks for all replies, most useful.
---
Hmm this is in fact a regression.
TopFieldCollector expects (but does not verify) that numHits is > 0.
I guess to fix this we could fix TopFieldCollector.create to return a
NullCollector when numHits is 0.
But: why is your app doing this? Ie, if numHits (rows) is 0, the only
useful thing you ca
hi,
I might not understand your case right but can you not add an extra
publishedDate field and then specify a secondary (after relevance) sort by
that?
On 30 November 2010 08:05, wrote:
> You could also put a short representation of the data (I suggest days since
> 01.01.2010) as payload and c
The index itself isn't corrupt - just one of the segment files. This
means you can read the index (less the offending segment(s)), but once
this happens it's no longer possible to
access the documents that were in that segment (they're gone forever),
nor write/commit to the index (depending on the
> As mentioned, in the typical case it's important that the field names be
> included in the signature, but i imagine there would be cases where you
> wouldn't want them included (like a simple concat Signature for building
> basic composite keys)
>
> I think the Signature API could definitely
aha aha :D
hm i dont know. we import in 2MillionSteps because we think that solr locks
our database and we want a better controll of the import ...
--
View this message in context:
http://lucene.472066.n3.nabble.com/Large-Hdd-Space-using-during-commit-optimize-tp1985807p1991392.html
Sent from
We had the same problem for our fields and we wrote a Tokenizer using the icu4j
library. Breaking tokens at script changes, and dealing with them according the
script and the configured Breakiterators.
This works out very well, as we also add the "scrip" information to the token
so later filter
I found the problem: solr.EnglishPorterFilterFactory in the form that parsedquery.
--
View this message in context:
http://lucene.472066.n3.nabble.com/search-strangeness-tp1986895p1991321.html
Sent from the Solr - User mailing list archive at Nabble.com.
Here result with &debugQuery:
For term annual:
annual
annual
text:year text:twelve-month text:onceayear
text:yearbook
text:year text:twelve-month text:onceayear
text:yearbook
LuceneQParser
63.0
For term welcome:
welcome
welcome
text:welcom
text:welcom
You could also put a short representation of the data (I suggest days since
01.01.2010) as payload and calculate boost with payload function of the
similarity.
>-Original Message-
>From: ext Jason Brown [mailto:jason.br...@sjp.co.uk]
>Sent: Montag, 29. November 2010 17:28
>To: solr-user@
73 matches
Mail list logo