there a way it can be done with a plug-in using
lower level Lucene SDK? Maybe some custom implementation of TermQuery
where value of "?" always matches any term in the query?
Thanks!
Robert Stewart
better.
From: Toke Eskildsen [t...@statsbiblioteket.dk]
Sent: Wednesday, August 07, 2013 7:45 AM
To: solr-user@lucene.apache.org
Subject: Re: poor facet search performance
On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote:
[Custom facet structure
We have a lot of cores on our servers so
it works well.
From: Toke Eskildsen [t...@statsbiblioteket.dk]
Sent: Wednesday, August 07, 2013 7:45 AM
To: solr-user@lucene.apache.org
Subject: Re: poor facet search performance
On Tue, 2013-07-30 at 21:48 +0200, R
A little bit of history:
We built a solr-like solution on Lucene.NET and C# about 5 years ago, which
including faceted search. In order to get really good facet performance, what
we did was pre-cache all the facet fields in RAM as efficient compressed data
structures (either a variable byte en
.org]
Sent: Tuesday, July 16, 2013 6:35 PM
To: solr-user@lucene.apache.org
Subject: Re: Where to specify numShards when startup up a cloud setup
On 7/16/2013 3:36 PM, Robert Stewart wrote:
> I want to script the creation of N solr cloud instances (on ec2).
>
> But its not clear to me wh
I want to script the creation of N solr cloud instances (on ec2).
But its not clear to me where I would specify numShards setting.
>From documentation, I see you can specify on the "first node" you start up, OR
>alternatively, use the "collections" API to create a new collection - but in
>that c
I would like to be able to do it without consulting Zookeeper. Is there some
variable or API I can call on a specific Solr cloud node to know if it is
currently a shard leader? The reason I want to know is I want to perform index
backup on the shard leader from a cron job *only* if that node is
I have a project where I am porting existing application from direct
Lucene API usage to using SOLR and SOLRJ client API.
The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR
than using direct Lucene API.
I am creating batches of documents between 200 and 500 documents per
call to
Hi,
I have a client who uses Lucene in a home grown CMS system they
developed in Java. They have a lot of code that uses the Lucene API
directly and they can't change it now. But they also need to use SOLR
for some other apps which must use the same Lucene index data. So I
need to make a good w
We used to have one large index - then moved to 10 shards (7 million docs each)
- parallel search across all shards, and we get better performance that way.
We use a 40 core box with 128GB ram. We do a lot of faceting so maybe that is
why since facets can be built in parallel on different thre
Is your balance field multi-valued by chance? I dont have much
experience with stats component but it may be very innefficient for
larger indexes. How is memory/performance if you turn stats off?
On Thu, Mar 15, 2012 at 11:58 AM, harisundhar wrote:
> I am using apache solr 3.5.0
>
> I have a i
Split up index into say 100 cores, and then route each search to a specific
core by some mod operator on the user id:
core_number = userid % num_cores
core_name = "core"+core_number
That way each index core is relatively small (maybe 100 million docs or less).
On Mar 9, 2012, at 2:02 PM, Glen
It very much depends on your data and also what query features you will use.
How many fields, the size of each field, how many unique values per field, how
many fields are stored vs. only indexed, etc. I have a system with 3+ billion
does, and each instance (each index core) has 120million doc
mapping="mappings.txt"/>
>
>
>
>
>
>
> --- On Thu, 3/8/12, Robert Stewart wrote:
>
>> From: Robert Stewart
>> Subject: Re: wildcard queries with edismax and lucene q
Any help on this? I am really stuck on a client project. I need to
know how scoring works with wildcard queries under SOLR 3.2.
Thanks
Bob
On Mon, Mar 5, 2012 at 4:22 PM, Robert Stewart wrote:
> How is scoring affected by wildcard queries? Seems when I use a
> wildcard query I g
How is scoring affected by wildcard queries? Seems when I use a
wildcard query I get all constant scores in response (all scores =
1.0). That occurs with both edismax as well as lucene query parser.
I am trying to implement auto-suggest feature so I need to use wild
card to return all results tha
Any segment files on SSD will be faster in cases where the file is not
in OS cache. If you have enough RAM a lot of index segment files will
end up in OS system cache so it wont have to go to disk anyway. Since
most indexes are bigger than RAM an SSD helps a lot. But if index is
much larger than
2 at 5:31 AM, Robert Stewart wrote:
>
>> I implemented an index shrinker and it works. I reduced my test index
>> from 6.6 GB to 3.6 GB by removing a single shingled field I did not
>> need anymore. I'm actually using Lucene.Net for this project so code
>> is C# us
alues, it can also override AtomicReader.docValues(). just
> return null for fields you want to remove. maybe it should
> traverse CompositeReader's getSequentialSubReaders() and wrapper each
> AtomicReader
>
> other things like term vectors norms are similar.
> On Wed
if you want to get
>> stored fields, and the useless fields are very long, then it will slow
>> down.
>> also it's possible to hack with it. but need more effort to
>> understand the index file format and traverse the fdt/fdx file.
>> http://lucene.apache
Lets say I have a large index (100M docs, 1TB, split up between 10 indexes).
And a bunch of the "stored" and "indexed" fields are not used in search at all.
In order to save memory and disk, I'd like to rebuild that index *without*
those fields, but I don't have original documents to rebuild e
I concur with this. As long as index segment files are cached in OS file cache
performance is as about good as it gets. Pulling segment files into RAM inside
JVM process may actually be slower, given Lucene's existing data structures and
algorithms for reading segment file data. If you have
You are probably better off splitting up each book into separate SOLR
documents, one document per paragraph (each document with same book ID, ISBN,
etc.). Then you can use field-collapsing on the book ID to return a single
document per book. And you can use highlighting to show the paragraph
I have a multi-core setup, and for each core I have a shared
data-config.xml which specifies a SQL query for data import. What I
want to do is have the same data-config.xml file shared between my
cores (linked to same physical file). I'd like to specify core
properties in solr.xml such that each c
Is it possible to configure schema to remove HTML tags from stored
field content? As far as I can tell analyzers can only be applied to
indexed content, but they don't affect stored content. I want to
remove HTML tags from text fields so that returned text content from
stored field has no HTML ta
I have a project where the client wants to store time series data
(maybe in SOLR if it can work). We want to store daily "prices" over
last 20 years (about 6000 values with associate dates), for up to
500,000 entities.
This data currently exists in a SQL database. Access to SQL is too
slow for c
Any idea how many documents your 5TB data contains? Certain features such as
faceting depends more on # of total documents than on actual size of data.
I have tested approx. 1 TB (100 million documents) running on a single machine
(40 cores, 128 GB RAM), using distributed search across 10 shard
I have a SOLR instance running as a proxy (no data of its own), it just uses
multicore setup where each core has a shards parameter in the search handler.
So my setup looks like this:
solr_proxy/
multicore/
/public - solrconfig.xml has "shards" pointing to some other
SOL
his sort of
> setup...
>
> Otis
>
> Performance Monitoring SaaS for Solr -
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
>
> - Original Message -
>> From: Robert Stewart
>> To: solr-user@lucene.apache.org
>> Cc:
>> Sen
of heap size in worst case.
On Thu, Dec 15, 2011 at 2:14 PM, Robert Stewart wrote:
> It is true number of terms may be much more than N/10 (or even N for
> each core), but it is the number of docs per term that will really
> matter. So you can have N terms in each core but each term
It is true number of terms may be much more than N/10 (or even N for
each core), but it is the number of docs per term that will really
matter. So you can have N terms in each core but each term has 1/10
number of docs on avg.
2011/12/15 Yury Kats :
> On 12/15/2011 1:07 PM, Robert Stew
I dont have any measured data, but here are my thoughts.
I think overall memory usage would be close to the same.
Speed will be slower in general, because if search speed is approx
log(n) then 10 * log(n/10) > log(n), and also if merging results you
have overhead in the merge step and also if fetc
rchive is very simple. No "holes" in the index (which
> is often when deleting document by document).
> The index done against core [today-0].
> The query is done against cores [today-0],[today-1]...[today-99]. Quite a
> headache.
>
> Itamar
>
> -O
We have a large (100M) index where we add about 1M new docs per day.
We want to keep index at a constant size so the oldest ones are
removed and/or archived each day (so index contains around 100 days of
data). What is the best way to do this? We still want to keep older
data in some archive inde
I am about to try exact same thing, running SOLR on top of Lucene indexes
created by Lucene.Net 2.9.2. AFAIK, it should work. Not sure if indexes
become non-backwards compatible once any new documents are written to them by
SOLR though. Probably good to make a backup first.
On Dec 13, 2011,
Has anyone implemented some social/collaboration features on top of SOLR? What
I am thinking is ability to add ratings and comments to documents in SOLR and
then be able to fetch comments and ratings for each document in results (and
have as part of response from SOLR), similar in fashion to ML
I have used two different ways:
1) Store mapping from users to documents in some external database
such as MySQL. At search time, lookup mapping for user to some unique
doc ID or some group ID, and then build query or doc set which you can
cache in SOLR process for some period. Then use that as
If you request 1000 docs from each shard, then aggregator is really
fetching 30,000 total documents, which then it must merge (re-sort
results, and take top 1000 to return to client). Its possible that
SOLR merging implementation needs optimized, but it does not seem like
it could be that slow. H
Is there any way to give a name to a facet query, so you can pick
facet values from results using some name as a key (rather than
looking for match via the query itself)?
For example, in request handler I have:
publish_date:[NOW-7DAY TO NOW]
publish_date:[NOW-1MONTH TO NOW]
I'd like results to h
If I have 2 masters in a master-master setup, where one master is the
live master and the other master acts as backup in slave mode, and
then during failover the slave master accepts new documents, such that
indexes become out of sync, how can original live master index get
back into sync with the
I think you can address a lot of these concerns by running some proxy in front
of SOLR, such as HAProxy. You should be able to limit only certain URIs (so
you can prevent /select queries).HAProxy is a free software load-balancer,
and it is very configurable and fairly easy to setup.
On No
You would need to setup request handlers in solrconfig.xml to limit what types
of queries people can send to SOLR (and define things like max page size, etc).
You need to restrict people from sending update/delete commands as well.
Then at the minimum, setup some proxy in front of SOLR that y
gt; Without optimizing the search speed stays the same, however the index size
> increases to 70+ GB.
>
> Perhaps there is a different way to restrict disk usage.
>
> Thanks,
> Jason
>
> Robert Stewart wrote:
>
>
> Optimization merges index to a single segment
One other potentially huge consideration is how "updatable" you need documents
to be. Lucene only can replace existing documents, it cannot modify existing
documents directly (so an update is essentially a delete followed by an insert
of a new document with the same primary key). There are per
It is not a horrible idea. Lucene has a pretty reliable index now (it should
not get corrupted). And you can do backups with replication.
If you need ranked results (sort by relevance), and lots of free-text queries
then using it makes sense. If you just need boolean search and maybe some sor
Optimization merges index to a single segment (one huge file), so entire index
will be copied on replication. So you really do need 2x disk in some cases
then.
Do you really need to optimize? We have a pretty big total index (about 200
million docs) and we never optimize. But we do have a sh
BTW, this would be good standard feature for SOLR, as I've run into this
requirement more than once.
On Oct 27, 2011, at 9:49 AM, karsten-s...@gmx.de wrote:
> Hi Robert,
>
> take a look to
> http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-threshold-td3219064.html#a32191
Sounds like a custom sorting collector would work - one that throws away docs
with less than some minimum score, so that it only collects/sorts documents
with some minimum score. AFAIK score is calculated even if you sort by some
other field.
On Oct 27, 2011, at 9:49 AM, karsten-s...@gmx.de wr
It is really not very difficult to build a decent web front-end to SOLR using
one of the available client libraries (such as solrpy for python).
I recently build pretty full-featured search front-end to SOLR in python (using
tornado web server and templates) and it was not difficult at all to bu
If your "documents" are products, then 100,000 documents is a pretty small
index for solr. Do you know approximately how many accessories are related to
each product on average? If # if relatively small (around 100 or less), then
it should be ok to create product documents with all the related
SOLR stores all data in the directory you specify in solrconfig.xml in dataDir
setting.
SOLR uses Lucene to store all the data in one or more proprietary binary files
called segment files. As a SOLR user typically you should not be too concerned
with binary index structure. You can see detail
See below...
On Oct 17, 2011, at 11:15 AM, lorenlai wrote:
> 1) I would like to know if it is possible to import data (feeding) while
> Solr is still running ?
Yes. You can search and index new content at the same time. But typically in
production systems you may have one or more "master" SOL
: Replication with an HA master
>
> Hello,
> - Original Message -
>
>> From: Robert Stewart
>> To: solr-user@lucene.apache.org
>> Cc:
>> Sent: Tuesday, October 11, 2011 3:37 PM
>> Subject: Re: Replication with an HA master
>>
>> In t
org
>> Subject: Re: Replication with an HA master
>>
>> A few alternatives:
>> * Have the master keep the index on a shared disk (e.g. SAN)
>> * Use LB to easily switch to between masters, potentially even automatically
>> if LB can detect the primary is dow
Your idea sounds like the correct path. Setup 2 masters, one running in
"slave" mode which pulls replicas from the live master. When/if live master
goes down, you just reconfigure and restart the backup master to be the live
master. You'd also need to then start data import on the backup mast
I need some recommendations for a new SOLR project.
We currently have a large (200M docs) production system using Lucene.Net and
what I would call our own .NET implementation of SOLR (built early on when SOLR
was less mature and did not run as well on Windows).
Our current architecture works
56 matches
Mail list logo