Re: Date for 4.4 solr release

2013-07-19 Thread Gmail
Hahahaha ... Good 1

On 20/07/2013, at 1:43 AM, "Jack Krupansky"  wrote:

> real_soon:[NOW+3DAYS TO NOW+10DAYS]
> 
> -- Jack Krupansky
> 
> -Original Message- From: Jabouille Jean Charles
> Sent: Friday, July 19, 2013 11:10 AM
> To: solr-user@lucene.apache.org
> Subject: Date for 4.4 solr release
> 
> Hi,
> 
> we are currently using solr 4.2.1. There are a lot of fix in the 4.4
> that we need. Can we have an approximative date of the first stable
> release of solr 4.4 please ?
> 
> Regards,
> 
> jean charles
> 
> Kelkoo SAS
> Société par Actions Simplifiée
> Au capital de € 4.168.964,30
> Siège social : 8, rue du Sentier 75002 Paris
> 425 093 069 RCS Paris
> 
> Ce message et les pièces jointes sont confidentiels et établis à l'attention 
> exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce 
> message, merci de le détruire et d'en avertir l'expéditeur. 


Re: lots of inserts very fast, out of heap or file descs

2007-02-24 Thread gmail


do you have a script/data that makes this happen?

I'm on a windows dev box - it does not get "too many open files" but 
i'll figure it out.


ryan


Re: logging off

2007-03-03 Thread gmail


sweet.

the logging is java logging... (not one i really know how to deal with)

Can you try setting system property like this:
http://www.exampledepot.com/egs/java.util.logging/Props.html



Brian Whitman wrote:
I'm trying to disable all logging from Solr, or at least re-route it to 
a file.


I was finally able to disable Jetty logging through a custom 
org.mortbay.log.Logger class, but I am still seeing the Solr logs, which 
seem to come from java.util.logging.Logger.


Is there a thing I can do in solrconfig to do this?

-Brian






Re: Problem of facet on 170M documents

2013-11-05 Thread Fudong-gmail
One way to solve the issue may be to create another field to group the value in 
a range, so you have fewer facet values to query.

Sent from my iPhone

On Nov 5, 2013, at 4:31 AM, Erick Erickson  wrote:

> You're just going to have to accept it being slow. Think of it this way:
> you have
> 4M (say) buckets that have to be counted into. Then the top 500 have to be
> collected to return. That's just going to take some time unless you have
> very beefy machines.
> 
> I'd _really_ back up and consider whether this is a good thing or whether
> this is one of those ideas that doesn't have much use to the user. If your
> results rarely if ever show counts for a URL more than, say, 5, is it
> really giving your users useful info?
> 
> Best,
> Erick
> 
> 
> On Mon, Nov 4, 2013 at 6:54 PM, Mingfeng Yang  wrote:
> 
>> Erick,
>> 
>> It could have more than 4M distinct values.  The purpose of this facet is
>> to display the most frequent, say top 500, urls to users.
>> 
>> Sascha,
>> 
>> Thanks for the info. I will look into this thread thing.
>> 
>> Mingfeng
>> 
>> 
>> On Mon, Nov 4, 2013 at 4:47 AM, Erick Erickson >> wrote:
>> 
>>> How many unique URLs do you have in your 9M
>>> docs? If your 9M hits have 4M distinct URLs, then
>>> this is not very valuable to the user.
>>> 
>>> Sascha:
>>> Was that speedup on a single field or were you faceting over
>>> multiple fields? Because as I remember that code spins off
>>> threads on a per-field basis, and if I'm mis-remembering I need
>>> to look again!
>>> 
>>> Best,
>>> Erick
>>> 
>>> 
>>> On Sat, Nov 2, 2013 at 5:07 AM, Sascha SZOTT  wrote:
>>> 
 Hi Ming,
 
 which Solr version are you using? In case you use one of the latest
 versions (4.5 or above) try the new parameter facet.threads with a
 reasonable value (4 to 8 gave me a massive performance speedup when
 working with large facets, i.e. nTerms >> 10^7).
 
 -Sascha
 
 
 Mingfeng Yang wrote:
> I have an index with 170M documents, and two of the fields for each
> doc is "source" and "url".  And I want to know the top 500 most
> frequent urls from Video source.
> 
> So I did a facet with
> "fq=source:Video&facet=true&facet.field=url&facet.limit=500", and
> the matching documents are about 9 millions.
> 
> The solr cluster is hosted on two ec2 instances each with 4 cpu, and
> 32G memory. 16G is allocated tfor java heap.  4 master shards on one
> machine, and 4 replica on another machine. Connected together via
> zookeeper.
> 
> Whenever I did the query above, the response is just taking too long
> and the client will get timed out. Sometimes,  when the end user is
> impatient, so he/she may wait for a few second for the results, and
> then kill the connection, and then issue the same query again and
> again.  Then the server will have to deal with multiple such heavy
> queries simultaneously and being so busy that we got "no server
> hosting shard" error, probably due to lost communication between solr
> node and zookeeper.
> 
> Is there any way to deal with such problem?
> 
> Thanks, Ming
>> 


Re: Solr Hangs

2011-09-02 Thread Govind @ Gmail
Rohit - for debugging hangs you will can trigger platfom specific dump and 
analyze it.




On Sep 3, 2011, at 9:39 AM, "Rohit"  wrote:

> Thanks Simon, did get that part, it was happening because solr was not able
> to reserve enough memory when it had hung once. The server has 24G of memory
> and I am try to start solr with "-Xms2g -Xmx16g -XX:MaxPermSize=3072m -D64"
> option. But, this is not my main concern, how do I find out why solr hangs
> and is there a way to automatically kill and restart.
> 
> 
> Regards,
> Rohit
> Mobile: +91-9901768202
> About Me: http://about.me/rohitg
> 
> -Original Message-
> From: simon [mailto:mtnes...@gmail.com] 
> Sent: 02 September 2011 14:03
> To: solr-user@lucene.apache.org
> Subject: Re: Solr Hangs
> 
> That error has nothing to do with Solr - it looks as though you are trying
> to start the JVM with a heap size that is too big for the available physical
> memory.
> 
> -Simon
> 
> On Fri, Sep 2, 2011 at 2:15 AM, Rohit  wrote:
> 
>> Hi All,
>> 
>> 
>> 
>> I am using Solr 3.0 and have 4 cores build into it with the following
>> statistics,
>> 
>> 
>> 
>> Core1 : numDocs : 69402640maxDoc : 69404745
>> 
>> Core2 : numDocs : 5669231  maxDoc : 5808496
>> 
>> Core3 : numDocs : 6654951  maxDoc : 6654954
>> 
>> Core4: numDocs : 138872  maxDoc : 185723
>> 
>> 
>> 
>> The number of updates are very high into 2 of the cores, solr is running
> in
>> tomcat with the following JAVA_OPTS values "-Xms2g -Xmx16g
>> -XX:MaxPermSize=3072m -D64" .
>> 
>> 
>> 
>> When Solr hangs and I try to restart it I am getting the following error,
>> which does indicate that it's a memory problem, but how can overcome the
>> problem of hanging.
>> 
>> 
>> 
>> Error occurred during initialization of VM
>> Could not reserve enough space for object heap
>> 
>> 
>> 
>> 
>> 
>> Regards,
>> 
>> Rohit
>> 
>> Mobile: +91-9901768202
>> 
>> About Me:   http://about.me/rohitg
>> 
>> 
>> 
>> 
> 


Re: Performance issues

2011-11-20 Thread Govind @ Gmail
http://www.lucidimagination.com/content/scaling-lucene-and-solr 

Has good guidance.

Wrt 1. What is the  issue - mem, cpu or query perf or indexing process


On Nov 20, 2011, at 11:39 AM, Lalit Kumar 4  wrote:

> Hello:
> We recently have seen performance issues of SOLR (running on jetty). 
> 
> We are looking for help in:
> 
> 1) How can I benchmark our current implementation?
> 2) We are trying core vs another instances. What are pros and cons?
> 3) Any pointers to validate current configuration is correct?
> 
> Sent on my BlackBerry® from Vodafone


how to import index by csv , when field has dynamicField type

2013-01-10 Thread weiwei-gmail

hi, guys 

when u using dataimport handler to import index , it's ok.

but later when data is became bigger and bigger , i found using dataimport 
handler is very slow . it costs nearly 20 mins.

So i try using another way to build index . using script to put  data into csv 
file then import to solr .


But i found one problem , how it handle with dynamicFiled ? becasue in the csv 
, the first line is all field name and i don't know the exactly field name . 


any idea ? Thanks a lot !



Re[2]: Is complex query like this possible?

2012-02-01 Thread asv - gmail
Hello, Mikhail.

Each index record looks like:

DIR:true
PATH:/root/folder1/folder2/
NAME:folder3
SIZE:0
...

This record represents folder /root/folder1/folder2/folder3

DIR:false
PATH:/root/folder1/folder2/folder3/
NAME:image.jpg
SIZE:1234567
...

This is a file /root/folder1/folder2/folder3/image.jpg

E. g. PATH is a path to parent directory, NAME is actual name of file/folder. 
We do not store list of children in folder record (like in your solution). 
Also, in my previous example a file of specified type may be deeper than one 
level: if there are /root/folder1, /root/folder2 and file 
/root/folder1/aaa/bbb/ccc/image.jpg, and I query for "folder", only folder1 
must be returned.   

Thanks


2012/2/1, 21:33:41:

>
Hello Sergey,

if your docs looks like:

PATH:'directory','tree','sements','test1'
FILES:'filename1','ext1','filename2','ext2','filename3','ext3','filename4','ext4'
you can search it: 
+PATH:test1 +FILES:jpg


2012/2/1 Sergei Ananko 

Hello,

We use Solr to search over a filesystem, so there are a lot of files and 
folders indexed, name and path of each file are stored in different fields. The 
task is to find folders by name AND containing at least one file of specific 
type somewhere inside. For example, we search by phrase "test" and for JPG 
files and have two folders:

1) "test1" - empty folder
2) "test2" - contains 1 file "abcd.jpg" inside.

Search result must only contain folder "test2", because "test1" does not 
correspond to second criteria.

SQL equivalent of such search query looks like:

SELECT * FROM indexed_files t1 WHERE t1.name LIKE '%test%' AND (SELECT COUNT(*) 
FROM indexed_files t2 WHERE t2.path LIKE CONCAT(t1.path, '%') AND t2.name LIKE 
'%jpg') > 0;

The question is: is it possible to do such search in Solr by single query? 
Single query is important because we need to use Solr's paging ("start" and 
"rows" parameters), so we should avoid filtering of wrong results in our code. 
I've read Solr wiki about nested queries but haven't found a way to do it. BTW, 
does Solr provide equivalent of "SELECT COUNT(*)" statement to access count of 
found records directly in Solr query? Or such complex query is completely 
impossible?

--
Best regards,
 Asv  mailto:asvs...@gmail.com 




-- 
Sincerely yours
Mikhail Khludnev
Lucid Certified
Apache Lucene/Solr Developer
Grid Dynamics







-- 
С Ñ?важением,
 asv  mailto:asvs...@gmail.com

Sorting

2006-10-11 Thread Gmail Account
I need to sort a query two ways. Should I do the search one way: 
s.getDocListAndSet(query, restrictions, sort, req.getStart(), 
req.getLimit(), flags);
then do the same search again with a different sort value or is there a 
method available to just sort the DocSet (like sortDocSet but it's 
protected)


OR maybe it doesn't  matter because caching will handle it anyway?

Thanks 



Re: Initial import problems

2006-12-05 Thread Gmail Account
I'm having slow performance with my solr index. I'm not sure what to do. I 
need some suggestions on what to try. I have updated all my records in the 
last couple of days. I'm not sure how much it degraded because of that, but 
it now takes about 3 seconds per search. My cache statistics don't look so 
good either.


Also... I'm not sure I was supposed to do a couple of things.
   - I did an optimize index through Luke with compound format and noticed 
in the solrconfig file that useCompoundFile is set to false.

   - I changed one of the fields in the schema from text_ws to string
   - I added a field (type="text" indexed="false" stored="true")

My schema and solrconfig are the same as the example except I have a few 
more fields. My pc is winXP and has 2gig of ram. Below are some stats from 
the solr admin stat page.


Thanks!


caching : true
numDocs : 1185814
maxDoc : 2070472
readerImpl : MultiReader

 name:  filterCache
 class:  org.apache.solr.search.LRUCache
 version:  1.0
 description:  LRU Cache(maxSize=512, initialSize=512, 
autowarmCount=256, 
[EMAIL PROTECTED])

 stats:  lookups : 658446
 hits : 30
 hitratio : 0.00
 inserts : 658420
 evictions : 657908
 size : 512
 cumulative_lookups : 658446
 cumulative_hits : 30
 cumulative_hitratio : 0.00
 cumulative_inserts : 658420
 cumulative_evictions : 657908


 name:  queryResultCache
 class:  org.apache.solr.search.LRUCache
 version:  1.0
 description:  LRU Cache(maxSize=512, initialSize=512, 
autowarmCount=256, 
[EMAIL PROTECTED])

 stats:  lookups : 88
 hits : 83
 hitratio : 0.94
 inserts : 6
 evictions : 0
 size : 5
 cumulative_lookups : 88
 cumulative_hits : 83
 cumulative_hitratio : 0.94
 cumulative_inserts : 6
 cumulative_evictions : 0


 name:  documentCache
 class:  org.apache.solr.search.LRUCache
 version:  1.0
 description:  LRU Cache(maxSize=512, initialSize=512)
 stats:  lookups : 780
 hits : 738
 hitratio : 0.94
 inserts : 42
 evictions : 0
 size : 42
 cumulative_lookups : 780
 cumulative_hits : 738
 cumulative_hitratio : 0.94
 cumulative_inserts : 42
 cumulative_evictions : 0





Performance issue.

2006-12-05 Thread Gmail Account
Sorry.. I put the wrong subject on my message. I also wanted to mention that 
my cpu jumps to to almost 100% each query.


I'm having slow performance with my solr index. I'm not sure what to do. I 
need some suggestions on what to try. I have updated all my records in the 
last couple of days. I'm not sure how much it degraded because of that, 
but it now takes about 3 seconds per search. My cache statistics don't 
look so good either.


Also... I'm not sure I was supposed to do a couple of things.
   - I did an optimize index through Luke with compound format and noticed 
in the solrconfig file that useCompoundFile is set to false.

   - I changed one of the fields in the schema from text_ws to string
   - I added a field (type="text" indexed="false" stored="true")

My schema and solrconfig are the same as the example except I have a few 
more fields. My pc is winXP and has 2gig of ram. Below are some stats from 
the solr admin stat page.


Thanks!


caching : true
numDocs : 1185814
maxDoc : 2070472
readerImpl : MultiReader

 name:  filterCache
 class:  org.apache.solr.search.LRUCache
 version:  1.0
 description:  LRU Cache(maxSize=512, initialSize=512, 
autowarmCount=256, 
[EMAIL PROTECTED])

 stats:  lookups : 658446
 hits : 30
 hitratio : 0.00
 inserts : 658420
 evictions : 657908
 size : 512
 cumulative_lookups : 658446
 cumulative_hits : 30
 cumulative_hitratio : 0.00
 cumulative_inserts : 658420
 cumulative_evictions : 657908


 name:  queryResultCache
 class:  org.apache.solr.search.LRUCache
 version:  1.0
 description:  LRU Cache(maxSize=512, initialSize=512, 
autowarmCount=256, 
[EMAIL PROTECTED])

 stats:  lookups : 88
 hits : 83
 hitratio : 0.94
 inserts : 6
 evictions : 0
 size : 5
 cumulative_lookups : 88
 cumulative_hits : 83
 cumulative_hitratio : 0.94
 cumulative_inserts : 6
 cumulative_evictions : 0


 name:  documentCache
 class:  org.apache.solr.search.LRUCache
 version:  1.0
 description:  LRU Cache(maxSize=512, initialSize=512)
 stats:  lookups : 780
 hits : 738
 hitratio : 0.94
 inserts : 42
 evictions : 0
 size : 42
 cumulative_lookups : 780
 cumulative_hits : 738
 cumulative_hitratio : 0.94
 cumulative_inserts : 42
 cumulative_evictions : 0







Re: Performance issue.

2006-12-05 Thread Gmail Account



There's nothing wrong with CPU jumping to 100% each query, that just
means you aren't IO bound :-)

What do you mean not IO bound?

>- I did an optimize index through Luke with compound format and 
> noticed

> in the solrconfig file that useCompoundFile is set to false.


Don't do this unless you really know what you are doing... Luke is
probably using a different version of Lucene than Solr, and it could
be dangerous.

Do you think I should reindex everything?



- if you are using filters, any larger than 3000 will be double the
size (maxDoc bits)

What do you mean larger than 3000? 3000 what and how do I tell?


Can you give some examples of what your queries look like?

I will get this and send it.

Thanks,
Yonik 



Re: Performance issue.

2006-12-06 Thread Gmail Account
I reindexed and optimized and it helped. However now each query averages 
about 1 second(down from 3-4 seconds). The bottleneck now is the 
getFacetTermEnumCounts function. If I take that call out it is a non 
measurable query time and the filtercache is being used. With the 
getFacetTermEnumCounts in, the filter cache after three queries is below 
with the hitration at 0 and everything is being evicted. This call is for 
the brand/manufacturer so I'm sure it is going through many thousands of 
queries. I'm thinking about pre-processing the brand/manu to get a small set 
of top brands per category and just quering them no matter what the other 
facets are set to.(with certain filters, no brands will be shown)  If I 
still want to call the getFacetTermEnumCounts for ALL brands, why is it not 
using the cache?



lookups : 32849
hits : 0
hitratio : 0.00
inserts : 32850
evictions : 32338
size : 512
cumulative_lookups : 32849
cumulative_hits : 0
cumulative_hitratio : 0.00
cumulative_inserts : 32850
cumulative_evictions : 32338


Thanks,
Mike
- Original Message - 
From: "Yonik Seeley" <[EMAIL PROTECTED]>

To: 
Sent: Tuesday, December 05, 2006 8:46 PM
Subject: Re: Performance issue.



On 12/5/06, Gmail Account <[EMAIL PROTECTED]> wrote:

> There's nothing wrong with CPU jumping to 100% each query, that just
> means you aren't IO bound :-)
What do you mean not IO bound?


There is always going to be a bottleneck somewhere.  In very large
indicies, the bottleneck may be waiting for IO (waiting for data to be
read from the disk).  If you are on a single processor system and you
aren't waiting for data to be read from the disk or the network, then
the request will be using close to 100% CPU, which is actually a good
thing.

The bad thing is how long the query takes, not the fact that it's CPU 
bound.



>> >- I did an optimize index through Luke with compound format and
>> > noticed
>> > in the solrconfig file that useCompoundFile is set to false.
>
> Don't do this unless you really know what you are doing... Luke is
> probably using a different version of Lucene than Solr, and it could
> be dangerous.
Do you think I should reindex everything?


That would be the safest thing to do.


> - if you are using filters, any larger than 3000 will be double the
> size (maxDoc bits)
What do you mean larger than 3000? 3000 what and how do I tell?


From solrconfig.xml:
   
   

The key is that the memory consumed by a HashDocSet is independent of
maxDoc (the maximum internal lucene docid), but a BitSet based set has
maxDoc bits in it.  Thus, an unoptimized index with more deleted
documents causes a higher maxDoc and higher memory usage for any
BitSet based filters.

-Yonik 




Re: Performance issue.

2006-12-06 Thread Gmail Account
It is currently a string type. Here is everything that has to do with manu 
in my schema... Should it have been multi-valued? Do you see anything wrong 
with this?





multiValued="true"/>

.




Thanks...

- Original Message - 
From: "Yonik Seeley" <[EMAIL PROTECTED]>

To: 
Sent: Wednesday, December 06, 2006 9:55 PM
Subject: Re: Performance issue.



It is using the cache, but the number of items is larger than the size
of the cache.

If you want to continue to use the filter method then you need to
increase the size of the filter cache to something larger than the
number of unique values than what you are filtering on.  I don't know
if you will have enough memory to take this approach or not.

The second option is to make brand/manu a non-multi-valued string
type.  When you do that, Solr will use a different method to calculate
the facet counts (it will use the FieldCache rather than filters).
You would need to reindex to try this approach.

-Yonik

On 12/6/06, Gmail Account <[EMAIL PROTECTED]> wrote:

I reindexed and optimized and it helped. However now each query averages
about 1 second(down from 3-4 seconds). The bottleneck now is the
getFacetTermEnumCounts function. If I take that call out it is a non
measurable query time and the filtercache is being used. With the
getFacetTermEnumCounts in, the filter cache after three queries is below
with the hitration at 0 and everything is being evicted. This call is for
the brand/manufacturer so I'm sure it is going through many thousands of
queries. I'm thinking about pre-processing the brand/manu to get a small 
set

of top brands per category and just quering them no matter what the other
facets are set to.(with certain filters, no brands will be shown)  If I
still want to call the getFacetTermEnumCounts for ALL brands, why is it 
not

using the cache?


lookups : 32849
hits : 0
hitratio : 0.00
inserts : 32850
evictions : 32338
size : 512
cumulative_lookups : 32849
cumulative_hits : 0
cumulative_hitratio : 0.00
cumulative_inserts : 32850
cumulative_evictions : 32338


Thanks,
Mike
- Original Message -
From: "Yonik Seeley" <[EMAIL PROTECTED]>
To: 
Sent: Tuesday, December 05, 2006 8:46 PM
Subject: Re: Performance issue.


> On 12/5/06, Gmail Account <[EMAIL PROTECTED]> wrote:
>> > There's nothing wrong with CPU jumping to 100% each query, that just
>> > means you aren't IO bound :-)
>> What do you mean not IO bound?
>
> There is always going to be a bottleneck somewhere.  In very large
> indicies, the bottleneck may be waiting for IO (waiting for data to be
> read from the disk).  If you are on a single processor system and you
> aren't waiting for data to be read from the disk or the network, then
> the request will be using close to 100% CPU, which is actually a good
> thing.
>
> The bad thing is how long the query takes, not the fact that it's CPU
> bound.
>
>> >> >- I did an optimize index through Luke with compound format 
>> >> > and

>> >> > noticed
>> >> > in the solrconfig file that useCompoundFile is set to false.
>> >
>> > Don't do this unless you really know what you are doing... Luke is
>> > probably using a different version of Lucene than Solr, and it could
>> > be dangerous.
>> Do you think I should reindex everything?
>
> That would be the safest thing to do.
>
>> > - if you are using filters, any larger than 3000 will be double the
>> > size (maxDoc bits)
>> What do you mean larger than 3000? 3000 what and how do I tell?
>
> From solrconfig.xml:
>
>
>
> The key is that the memory consumed by a HashDocSet is independent of
> maxDoc (the maximum internal lucene docid), but a BitSet based set has
> maxDoc bits in it.  Thus, an unoptimized index with more deleted
> documents causes a higher maxDoc and higher memory usage for any
> BitSet based filters.
>
> -Yonik 




Tagging

2007-02-12 Thread Gmail Account
I know that I've seen this topic before.. Is there a guidline on the best 
way to create tagging in solr?  For example, keeping track of what user 
tagged what item in solr. And facetting based on tags?


Thanks,
Mike 



Re: convert custom facets to Solr facets...

2007-02-12 Thread Gmail Account
This would be great!  I can't help with the solution but I am very 
interested in using it if one of you guys can figure it out.


I can't wait to see if this works out.

Mike

- Original Message - 
From: "Erik Hatcher" <[EMAIL PROTECTED]>

To: 
Sent: Tuesday, February 06, 2007 4:51 AM
Subject: Re: convert custom facets to Solr facets...


Yonik - this is great!   Thanks for codifying the use cases and  providing 
a possible implementation.  I'll tinker with this more when  I can.


Erik


On Feb 4, 2007, at 2:13 PM, Yonik Seeley wrote:

I was confusing myself too much without nailing down more concrete 
examples,

so I took a shot at coming up with user tagging usecases and
a way to implement them with a flat schema.

The usecases may be biased toward a flat schema since that's what I
had in mind... so feel free to add more, or change the usecase names
or descriptions to make more sense.

http://wiki.apache.org/solr/UserTagDesign

-Yonik






Re: Tagging

2007-02-22 Thread Gmail Account
I use solr for searching and facets and love it.. The performance is 
awesome.


However I am about to add tagging to my application and I'm having a hard 
time deciding if I should just database my tags for now until a better solr 
solution is worked out... Does anyone know what technology some of the 
larger sites use for tagging? Database (MySQL, SQL Server) with denormalized 
cache tables everywhere, something similar to solr/lucene, or something 
else?


Thanks,
Mike

- Original Message - 
From: "Mekin Maheshwari" <[EMAIL PROTECTED]>

To: 
Sent: Thursday, February 22, 2007 7:39 AM
Subject: Re: Tagging



For a more general solution, I'm thinking a separate lucene index
might be ideal.

-Yonik



I dont know if this will work for others, below is what we do. Also, if
there are things I can improve, do let me know.

All tag inserts go to a small DB table.
And I reindex the docs that these tags belong to in a backup index that I
keep, and swap the new Index in from time to time. I dont do this on the
production index, as optimizing the index takes a long time.

A hack that I need to do is when looking up for tags, I also look in this
small table. For me exact matches suffice, hence a db table works, may not
work for others. I understand that searches on this tag dont work, till it
gets into the index.

The solution can obviously be made much smarter.

Basically use a queue, from which the indexUpdater can pick up documents 
to

reindex & update them when search volumes are low.

I am sure a small lucene index can be used as the queue, and while 
searching

both the indices are looked at.

Btw, we are still using lucene for our search, hope to move to solr soon.

-mekin





Re: Solr Design question on spatial search

2012-03-01 Thread Venu Gmail Dev
I don't think Spatial search will fully fit into this. I have 2 approaches in 
mind but I am not satisfied with either one of them.

a) Have 2 separate indexes. First one to store the information about all the 
cities and second one to store the retail stores information. Whenever user 
searches for a city then I return all the matching cities from first index and 
then do a spatial search on each of the matched city in the second index. But 
this is too costly.

b) Index only the cities which have a nearby store. Do all the calculation(s) 
before indexing the data so that the search is fast. The problem that I see 
with this approach is that if a new retail store or a city is added then I 
would have to re-index all the data again.


On Mar 1, 2012, at 7:59 AM, Dirceu Vieira wrote:

> I believe that what you need is spatial search...
> 
> Have a look a the documention:  http://wiki.apache.org/solr/SpatialSearch
> 
> On Wed, Feb 29, 2012 at 10:54 PM, Venu Shankar 
> wrote:
> 
>> Hello,
>> 
>> I have a design question for Solr.
>> 
>> I work for an enterprise which has a lot of retail stores (approx. 20K).
>> These retail stores are spread across the world.  My search requirement is
>> to find all the cities which are within x miles of a retail store.
>> 
>> So lets say if we have a retail Store in San Francisco and if I search for
>> "San" then San Francisco, Santa Clara, San Jose, San Juan, etc  should be
>> returned as they are within x miles from San Francisco. I also want to rank
>> the search results by their distance.
>> 
>> I can create an index with all the cities in it but I am not sure how do I
>> ensure that the cities returned in a search result have a nearby retail
>> store. Any suggestions ?
>> 
>> Thanks,
>> Venu,
>> 
> 
> 
> 
> -- 
> Dirceu Vieira Júnior
> ---
> +47 9753 2473
> dirceuvjr.blogspot.com
> twitter.com/dirceuvjr



Re: Solr Design question on spatial search

2012-03-02 Thread Venu Gmail Dev
Sorry for not being clear enough.

I don't know the point of origin. All I know is that there are 20K retail 
stores. Only the cities within 10 miles radius of these stores should be 
searchable. Any city which is outside these small 10miles circles around these 
20K stores should be ignored.

So when somebody searches for a city, I need to query the cities which are in 
these 20K 10miles circles but I don't know which 10-mile circle I should query.

So the approach that I was thinking were :-

>>>> a) Have 2 separate indexes. First one to store the information about all 
>>>> the cities and second one to store the retail stores information. Whenever 
>>>> user searches for a city then I return all the matching cities ( and hence 
>>>> the lat-long) from first index and then do a spatial search on each of the 
>>>> matched city in the second index. But this is too costly.
>>>> 
>>>> b) Index only the cities which have a nearby store. Do all the 
>>>> calculation(s) before indexing the data so that the search is fast. The 
>>>> problem that I see with this approach is that if a new retail store or a 
>>>> city is added then I would have to re-index all the data again.

Does this answers the problem that you posed ?

Thanks,
Venu.

On Mar 2, 2012, at 9:52 PM, Erick Erickson wrote:

> But again, that doesn't answer the problem I posed. Where is your
> point of origin?
> There's nothing in what you've written that indicates how you would know
> that 10 miles is relative to San Francisco. All you've said is that
> you're searching
> on "San". Which would presumably return San Francisco, San Mateo, San Jose.
> 
> Then, also presumably, you're looking for all the cities with stores
> within 10 miles
> of one of these cities. But nothing in your criteria so far says that
> that city is
> San Francisco.
> 
> If you already know that San Francisco is the locus, simple distance
> will work just
> fine. You can index both city and store info in the same index and
> restrict, say, facets
> (or, indeed search results) by fq clause (e.g. fq=type:city or fq=type:store).
> 
> Or I'm completely missing the boat here.
> 
> Best
> Erick
> 
> 
> On Fri, Mar 2, 2012 at 11:50 AM, Venu Dev  wrote:
>> So let's say x=10 miles. Now if I search for San then San Francisco, San 
>> Mateo should be returned because there is a retail store in San Francisco. 
>> But San Jose should not be returned because it is more than 10 miles away 
>> from San
>> Francisco. Had there been a retail store in San Jose then it should be also 
>> returned when you search for San. I can restrict the queries to a country.
>> 
>> Thanks,
>> ~Venu
>> 
>> On Mar 2, 2012, at 5:57 AM, Erick Erickson  wrote:
>> 
>>> I don't see how this works, since your search for San could also return
>>> San Marino, Italy. Would you then return all retail stores in
>>> X miles of that city? What about San Salvador de Jujuy, Argentina?
>>> 
>>> And even in your example, San would match San Mateo. But should
>>> the search then return any stores within X miles of San Mateo?
>>> You have to stop somewhere
>>> 
>>> Is there any other information you have that restricts how far to expand the
>>> search?
>>> 
>>> Best
>>> Erick
>>> 
>>> On Thu, Mar 1, 2012 at 4:57 PM, Venu Gmail Dev  
>>> wrote:
>>>> I don't think Spatial search will fully fit into this. I have 2 approaches 
>>>> in mind but I am not satisfied with either one of them.
>>>> 
>>>> a) Have 2 separate indexes. First one to store the information about all 
>>>> the cities and second one to store the retail stores information. Whenever 
>>>> user searches for a city then I return all the matching cities from first 
>>>> index and then do a spatial search on each of the matched city in the 
>>>> second index. But this is too costly.
>>>> 
>>>> b) Index only the cities which have a nearby store. Do all the 
>>>> calculation(s) before indexing the data so that the search is fast. The 
>>>> problem that I see with this approach is that if a new retail store or a 
>>>> city is added then I would have to re-index all the data again.
>>>> 
>>>> 
>>>> On Mar 1, 2012, at 7:59 AM, Dirceu Vieira wrote:
>>>> 
>>>>> I believe that what you need is spatial search...
>>>>> 
>>>>> Have a look a the documention:  h