Solr query to match document "templates" - sort of a reverse wildcard match

2015-03-06 Thread Robert Stewart
there a way it can be done with a plug-in using lower level Lucene SDK? Maybe some custom implementation of TermQuery where value of "?" always matches any term in the query? Thanks! Robert Stewart

RE: poor facet search performance

2013-08-07 Thread Robert Stewart
better. From: Toke Eskildsen [t...@statsbiblioteket.dk] Sent: Wednesday, August 07, 2013 7:45 AM To: solr-user@lucene.apache.org Subject: Re: poor facet search performance On Tue, 2013-07-30 at 21:48 +0200, Robert Stewart wrote: [Custom facet structure

RE: poor facet search performance

2013-08-07 Thread Robert Stewart
We have a lot of cores on our servers so it works well. From: Toke Eskildsen [t...@statsbiblioteket.dk] Sent: Wednesday, August 07, 2013 7:45 AM To: solr-user@lucene.apache.org Subject: Re: poor facet search performance On Tue, 2013-07-30 at 21:48 +0200, R

poor facet search performance

2013-07-30 Thread Robert Stewart
A little bit of history: We built a solr-like solution on Lucene.NET and C# about 5 years ago, which including faceted search. In order to get really good facet performance, what we did was pre-cache all the facet fields in RAM as efficient compressed data structures (either a variable byte en

RE: Where to specify numShards when startup up a cloud setup

2013-07-17 Thread Robert Stewart
.org] Sent: Tuesday, July 16, 2013 6:35 PM To: solr-user@lucene.apache.org Subject: Re: Where to specify numShards when startup up a cloud setup On 7/16/2013 3:36 PM, Robert Stewart wrote: > I want to script the creation of N solr cloud instances (on ec2). > > But its not clear to me wh

Where to specify numShards when startup up a cloud setup

2013-07-16 Thread Robert Stewart
I want to script the creation of N solr cloud instances (on ec2). But its not clear to me where I would specify numShards setting. >From documentation, I see you can specify on the "first node" you start up, OR >alternatively, use the "collections" API to create a new collection - but in >that c

Is there an easy way to know if a Solr cloud node is a shard leader?

2013-07-09 Thread Robert Stewart
I would like to be able to do it without consulting Zookeeper. Is there some variable or API I can call on a specific Solr cloud node to know if it is currently a shard leader? The reason I want to know is I want to perform index backup on the shard leader from a cron job *only* if that node is

Indexing performance with solrj vs. direct lucene API

2012-11-28 Thread Robert Stewart
I have a project where I am porting existing application from direct Lucene API usage to using SOLR and SOLRJ client API. The problem I have is that indexing is 2-5x slower using SOLRJ+SOLR than using direct Lucene API. I am creating batches of documents between 200 and 500 documents per call to

replication from lucene to solr

2012-08-07 Thread Robert Stewart
Hi, I have a client who uses Lucene in a home grown CMS system they developed in Java. They have a lot of code that uses the Lucene API directly and they can't change it now. But they also need to use SOLR for some other apps which must use the same Lucene index data. So I need to make a good w

Re: Solr Shards multi core slower then single big core

2012-05-14 Thread Robert Stewart
We used to have one large index - then moved to 10 shards (7 million docs each) - parallel search across all shards, and we get better performance that way. We use a 40 core box with 128GB ram. We do a lot of faceting so maybe that is why since facets can be built in parallel on different thre

Re: Solr loads entire Index into Memory

2012-03-15 Thread Robert Stewart
Is your balance field multi-valued by chance? I dont have much experience with stats component but it may be very innefficient for larger indexes. How is memory/performance if you turn stats off? On Thu, Mar 15, 2012 at 11:58 AM, harisundhar wrote: > I am using apache solr 3.5.0 > > I have a i

Re: Lucene vs Solr design decision

2012-03-09 Thread Robert Stewart
Split up index into say 100 cores, and then route each search to a specific core by some mod operator on the user id: core_number = userid % num_cores core_name = "core"+core_number That way each index core is relatively small (maybe 100 million docs or less). On Mar 9, 2012, at 2:02 PM, Glen

Re: indexing bigdata

2012-03-09 Thread Robert Stewart
It very much depends on your data and also what query features you will use. How many fields, the size of each field, how many unique values per field, how many fields are stored vs. only indexed, etc. I have a system with 3+ billion does, and each instance (each index core) has 120million doc

Re: wildcard queries with edismax and lucene query parsers

2012-03-08 Thread Robert Stewart
                mapping="mappings.txt"/> >                         >                         >                 >         > > > --- On Thu, 3/8/12, Robert Stewart wrote: > >> From: Robert Stewart >> Subject: Re: wildcard queries with edismax and lucene q

Re: wildcard queries with edismax and lucene query parsers

2012-03-08 Thread Robert Stewart
Any help on this? I am really stuck on a client project. I need to know how scoring works with wildcard queries under SOLR 3.2. Thanks Bob On Mon, Mar 5, 2012 at 4:22 PM, Robert Stewart wrote: > How is scoring affected by wildcard queries?  Seems when I use a > wildcard query I g

wildcard queries with edismax and lucene query parsers

2012-03-05 Thread Robert Stewart
How is scoring affected by wildcard queries? Seems when I use a wildcard query I get all constant scores in response (all scores = 1.0). That occurs with both edismax as well as lucene query parser. I am trying to implement auto-suggest feature so I need to use wild card to return all results tha

Re: flashcache and solr/lucene

2012-03-01 Thread Robert Stewart
Any segment files on SSD will be faster in cases where the file is not in OS cache. If you have enough RAM a lot of index segment files will end up in OS system cache so it wont have to go to disk anyway. Since most indexes are bigger than RAM an SSD helps a lot. But if index is much larger than

Re: Can I rebuild an index and remove some fields?

2012-02-16 Thread Robert Stewart
2 at 5:31 AM, Robert Stewart wrote: > >> I implemented an index shrinker and it works.  I reduced my test index >> from 6.6 GB to 3.6 GB by removing a single shingled field I did not >> need anymore.  I'm actually using Lucene.Net for this project so code >> is C# us

Re: Can I rebuild an index and remove some fields?

2012-02-15 Thread Robert Stewart
alues, it can also override AtomicReader.docValues(). just > return null for fields you want to remove. maybe it should > traverse CompositeReader's getSequentialSubReaders() and wrapper each > AtomicReader > >    other things like term vectors norms are similar. > On Wed

Re: Can I rebuild an index and remove some fields?

2012-02-14 Thread Robert Stewart
if you want to get >> stored fields, and the useless fields are very long, then it will slow >> down. >> also it's possible to hack with it. but need more effort to >> understand the index file format and traverse the fdt/fdx file. >> http://lucene.apache

Can I rebuild an index and remove some fields?

2012-02-13 Thread Robert Stewart
Lets say I have a large index (100M docs, 1TB, split up between 10 indexes). And a bunch of the "stored" and "indexed" fields are not used in search at all. In order to save memory and disk, I'd like to rebuild that index *without* those fields, but I don't have original documents to rebuild e

Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Robert Stewart
I concur with this. As long as index segment files are cached in OS file cache performance is as about good as it gets. Pulling segment files into RAM inside JVM process may actually be slower, given Lucene's existing data structures and algorithms for reading segment file data. If you have

Re: Searching context within a book

2012-02-06 Thread Robert Stewart
You are probably better off splitting up each book into separate SOLR documents, one document per paragraph (each document with same book ID, ISBN, etc.). Then you can use field-collapsing on the book ID to return a single document per book. And you can use highlighting to show the paragraph

using pre-core properties in dih config

2012-01-24 Thread Robert Stewart
I have a multi-core setup, and for each core I have a shared data-config.xml which specifies a SQL query for data import. What I want to do is have the same data-config.xml file shared between my cores (linked to same physical file). I'd like to specify core properties in solr.xml such that each c

analyzing stored fields (removing HTML tags)

2012-01-24 Thread Robert Stewart
Is it possible to configure schema to remove HTML tags from stored field content? As far as I can tell analyzers can only be applied to indexed content, but they don't affect stored content. I want to remove HTML tags from text fields so that returned text content from stored field has no HTML ta

using solr for time series data

2012-01-19 Thread Robert Stewart
I have a project where the client wants to store time series data (maybe in SOLR if it can work). We want to store daily "prices" over last 20 years (about 6000 values with associate dates), for up to 500,000 entities. This data currently exists in a SQL database. Access to SQL is too slow for c

Re: Can Apache Solr Handle TeraByte Large Data

2012-01-13 Thread Robert Stewart
Any idea how many documents your 5TB data contains? Certain features such as faceting depends more on # of total documents than on actual size of data. I have tested approx. 1 TB (100 million documents) running on a single machine (40 cores, 128 GB RAM), using distributed search across 10 shard

error when specifying shards parameter in multicore setup

2011-12-19 Thread Robert Stewart
I have a SOLR instance running as a proxy (no data of its own), it just uses multicore setup where each core has a shards parameter in the search handler. So my setup looks like this: solr_proxy/ multicore/ /public - solrconfig.xml has "shards" pointing to some other SOL

Re: how to setup to archive expired documents?

2011-12-16 Thread Robert Stewart
his sort of > setup... > > Otis > > Performance Monitoring SaaS for Solr - > http://sematext.com/spm/solr-performance-monitoring/index.html > > > > - Original Message - >> From: Robert Stewart >> To: solr-user@lucene.apache.org >> Cc: >> Sen

Re: Core overhead

2011-12-15 Thread Robert Stewart
of heap size in worst case. On Thu, Dec 15, 2011 at 2:14 PM, Robert Stewart wrote: > It is true number of terms may be much more than N/10 (or even N for > each core), but it is the number of docs per term that will really > matter.  So you can have N terms in each core but each term

Re: Core overhead

2011-12-15 Thread Robert Stewart
It is true number of terms may be much more than N/10 (or even N for each core), but it is the number of docs per term that will really matter. So you can have N terms in each core but each term has 1/10 number of docs on avg. 2011/12/15 Yury Kats : > On 12/15/2011 1:07 PM, Robert Stew

Re: Core overhead

2011-12-15 Thread Robert Stewart
I dont have any measured data, but here are my thoughts. I think overall memory usage would be close to the same. Speed will be slower in general, because if search speed is approx log(n) then 10 * log(n/10) > log(n), and also if merging results you have overhead in the merge step and also if fetc

Re: how to setup to archive expired documents?

2011-12-15 Thread Robert Stewart
rchive is very simple. No "holes" in the index (which > is often when deleting document by document). > The index done against core [today-0]. > The query is done against cores [today-0],[today-1]...[today-99]. Quite a > headache. > > Itamar > > -O

how to setup to archive expired documents?

2011-12-15 Thread Robert Stewart
We have a large (100M) index where we add about 1M new docs per day. We want to keep index at a constant size so the oldest ones are removed and/or archived each day (so index contains around 100 days of data). What is the best way to do this? We still want to keep older data in some archive inde

Re: Migrate Lucene 2.9 To SOLR

2011-12-13 Thread Robert Stewart
I am about to try exact same thing, running SOLR on top of Lucene indexes created by Lucene.Net 2.9.2. AFAIK, it should work. Not sure if indexes become non-backwards compatible once any new documents are written to them by SOLR though. Probably good to make a backup first. On Dec 13, 2011,

social/collaboration features on top of solr

2011-12-13 Thread Robert Stewart
Has anyone implemented some social/collaboration features on top of SOLR? What I am thinking is ability to add ratings and comments to documents in SOLR and then be able to fetch comments and ratings for each document in results (and have as part of response from SOLR), similar in fashion to ML

Re: Separate ACL and document index

2011-11-23 Thread Robert Stewart
I have used two different ways: 1) Store mapping from users to documents in some external database such as MySQL. At search time, lookup mapping for user to some unique doc ID or some group ID, and then build query or doc set which you can cache in SOLR process for some period. Then use that as

Re: Huge Performance: Solr distributed search

2011-11-23 Thread Robert Stewart
If you request 1000 docs from each shard, then aggregator is really fetching 30,000 total documents, which then it must merge (re-sort results, and take top 1000 to return to client). Its possible that SOLR merging implementation needs optimized, but it does not seem like it could be that slow. H

naming facet queries?

2011-11-15 Thread Robert Stewart
Is there any way to give a name to a facet query, so you can pick facet values from results using some name as a key (rather than looking for match via the query itself)? For example, in request handler I have: publish_date:[NOW-7DAY TO NOW] publish_date:[NOW-1MONTH TO NOW] I'd like results to h

keeping master server indexes in sync after failover recovery

2011-11-10 Thread Robert Stewart
If I have 2 masters in a master-master setup, where one master is the live master and the other master acts as backup in slave mode, and then during failover the slave master accepts new documents, such that indexes become out of sync, how can original live master index get back into sync with the

Re: Questions about Solr's security

2011-11-01 Thread Robert Stewart
I think you can address a lot of these concerns by running some proxy in front of SOLR, such as HAProxy. You should be able to limit only certain URIs (so you can prevent /select queries).HAProxy is a free software load-balancer, and it is very configurable and fairly easy to setup. On No

Re: Questions about Solr's security

2011-11-01 Thread Robert Stewart
You would need to setup request handlers in solrconfig.xml to limit what types of queries people can send to SOLR (and define things like max page size, etc). You need to restrict people from sending update/delete commands as well. Then at the minimum, setup some proxy in front of SOLR that y

Re: Replicating Large Indexes

2011-11-01 Thread Robert Stewart
gt; Without optimizing the search speed stays the same, however the index size > increases to 70+ GB. > > Perhaps there is a different way to restrict disk usage. > > Thanks, > Jason > > Robert Stewart wrote: > > > Optimization merges index to a single segment

Re: simple persistance layer on top of Solr

2011-11-01 Thread Robert Stewart
One other potentially huge consideration is how "updatable" you need documents to be. Lucene only can replace existing documents, it cannot modify existing documents directly (so an update is essentially a delete followed by an insert of a new document with the same primary key). There are per

Re: simple persistance layer on top of Solr

2011-11-01 Thread Robert Stewart
It is not a horrible idea. Lucene has a pretty reliable index now (it should not get corrupted). And you can do backups with replication. If you need ranked results (sort by relevance), and lots of free-text queries then using it makes sense. If you just need boolean search and maybe some sor

Re: Replicating Large Indexes

2011-11-01 Thread Robert Stewart
Optimization merges index to a single segment (one huge file), so entire index will be copied on replication. So you really do need 2x disk in some cases then. Do you really need to optimize? We have a pretty big total index (about 200 million docs) and we never optimize. But we do have a sh

Re: Limit by score? sort by other field

2011-10-27 Thread Robert Stewart
BTW, this would be good standard feature for SOLR, as I've run into this requirement more than once. On Oct 27, 2011, at 9:49 AM, karsten-s...@gmx.de wrote: > Hi Robert, > > take a look to > http://lucene.472066.n3.nabble.com/How-to-cut-off-hits-with-score-below-threshold-td3219064.html#a32191

Re: Limit by score? sort by other field

2011-10-27 Thread Robert Stewart
Sounds like a custom sorting collector would work - one that throws away docs with less than some minimum score, so that it only collects/sorts documents with some minimum score. AFAIK score is calculated even if you sort by some other field. On Oct 27, 2011, at 9:49 AM, karsten-s...@gmx.de wr

Re: Is there a good web front end application / interface for solr

2011-10-25 Thread Robert Stewart
It is really not very difficult to build a decent web front-end to SOLR using one of the available client libraries (such as solrpy for python). I recently build pretty full-featured search front-end to SOLR in python (using tornado web server and templates) and it was not difficult at all to bu

Re: how to handle large relational data in Solr

2011-10-20 Thread Robert Stewart
If your "documents" are products, then 100,000 documents is a pretty small index for solr. Do you know approximately how many accessories are related to each product on average? If # if relatively small (around 100 or less), then it should be ok to create product documents with all the related

Re: solr/lucene and its database (a silly question)

2011-10-18 Thread Robert Stewart
SOLR stores all data in the directory you specify in solrconfig.xml in dataDir setting. SOLR uses Lucene to store all the data in one or more proprietary binary files called segment files. As a SOLR user typically you should not be too concerned with binary index structure. You can see detail

Re: feeding while solr is running ?

2011-10-17 Thread Robert Stewart
See below... On Oct 17, 2011, at 11:15 AM, lorenlai wrote: > 1) I would like to know if it is possible to import data (feeding) while > Solr is still running ? Yes. You can search and index new content at the same time. But typically in production systems you may have one or more "master" SOL

Re: Replication with an HA master

2011-10-13 Thread Robert Stewart
: Replication with an HA master > > Hello, > - Original Message - > >> From: Robert Stewart >> To: solr-user@lucene.apache.org >> Cc: >> Sent: Tuesday, October 11, 2011 3:37 PM >> Subject: Re: Replication with an HA master >> >> In t

Re: Replication with an HA master

2011-10-11 Thread Robert Stewart
org >> Subject: Re: Replication with an HA master >> >> A few alternatives: >> * Have the master keep the index on a shared disk (e.g. SAN) >> * Use LB to easily switch to between masters, potentially even automatically >> if LB can detect the primary is dow

Re: Replication with an HA master

2011-10-07 Thread Robert Stewart
Your idea sounds like the correct path. Setup 2 masters, one running in "slave" mode which pulls replicas from the live master. When/if live master goes down, you just reconfigure and restart the backup master to be the live master. You'd also need to then start data import on the backup mast

SOLR architecture recommendation

2011-09-27 Thread Robert Stewart
I need some recommendations for a new SOLR project. We currently have a large (200M docs) production system using Lucene.Net and what I would call our own .NET implementation of SOLR (built early on when SOLR was less mature and did not run as well on Windows). Our current architecture works