cheking the size of the index using solrj API's

2010-04-02 Thread Na_D

hi,


I need to monitor the index for the following information:

1. Size of the index
2 Last time the index was updated.

Although I did an extensive search of the API's i cant find something which
does the same( as mentioned above) 



please help


-- 
View this message in context: 
http://n3.nabble.com/cheking-the-size-of-the-index-using-solrj-API-s-tp692686p692686.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: cheking the size of the index using solrj API's

2010-04-02 Thread Ahmet Arslan

> I need to monitor the index for the following information:
> 
> 1. Size of the index
> 2 Last time the index was updated.
> 
> Although I did an extensive search of the API's i cant find
> something which
> does the same( as mentioned above) 

solr/admin/stats.jsp is actually an xml converted to html with stats.xsl

There are info about when last commit etc:

 
  Fri Apr 02 17:07:03 EEST 2010



  Fri Apr 02 17:07:03 EEST 2010


Also LukeRequestHandler shows last modified time in UTC 
solr/admin/luke?wt=xml&numTerms=0

2010-04-02T14:07:07Z

I am not sure with the size. I can see it in stats.jsp because i have 
registered Replication Handler. 

 
  226.86 MB




  


Experience with indexing billions of documents?

2010-04-02 Thread Burton-West, Tom
We are currently indexing 5 million books in Solr, scaling up over the next few 
years to 20 million.  However we are using the entire book as a Solr document.  
We are evaluating the possibility of indexing individual pages as there are 
some use cases where users want the most relevant pages regardless of what book 
they occur in.  However, we estimate that we are talking about somewhere 
between 1 and 6 billion pages and have concerns over whether Solr will scale to 
this level.

Does anyone have experience using Solr with 1-6 billion Solr documents?

The lucene file format document 
(http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)  mentions a 
limit of about 2 billion document ids.   I assume this is the lucene internal 
document id and would therefore be a per index/per shard limit.  Is this 
correct?


Tom Burton-West.





Re: Index db data

2010-04-02 Thread MitchK

Hello trueman,

here are some helpful pages from the wiki:
DataImportHandler:
http://wiki.apache.org/solr/DataImportHandler

And if there are some troubles, you may find an answer here:
http://wiki.apache.org/solr/DataImportHandlerFaq

An example for a data-config.xml you can find at the example-directory of
your solr-download.
Look at: example/example-DIH/solr/db/conf
Thats where you can find a db-data-config.xml. Firstly reading through the
wiki, I think you will have no problems in setting up your own DB-Import for
Solr.

Hope this helps
- Mitch
-- 
View this message in context: 
http://n3.nabble.com/Index-db-data-tp693204p693250.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index db data

2010-04-02 Thread MitchK

Additionally to my first post:
At the wiki there is given a http-request for full-import.
I haven't worked yet with SolrJ, but I think you need to copy those parts
from the URL that show the directory-structure of your Solr-instance.

For the example I suggested to have a look at, I think it will looks like
this, if your dataImportHandler is called "yourDataImportHandler": 
/example/example-DIH/solr/db/yourDataImportHandler?command=full-import

Searching for "SolrJ" you may find some examples for a SolrJ-client
application.
-- 
View this message in context: 
http://n3.nabble.com/Index-db-data-tp693204p693269.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Index db data

2010-04-02 Thread MitchK

No HTTP-call. That's a missunderstanding.

For a http-call you need to have an url like this:
http://:/solr/dataimport?command=full-import

For the SolrJ-client I *think* that your query only needs to look like this:
/solr/dataimport?command=full-import

However, I have never worked with the SolrJ-client - so maybe I am wrong.
-- 
View this message in context: 
http://n3.nabble.com/Index-db-data-tp693204p693343.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Experience with indexing billions of documents?

2010-04-02 Thread darren
My guess is that you will need to take advantage of Solr 1.5's upcoming
cloud/cluster renovations and use multiple indexes to comfortably achieve
those numbers. Hypthetically, in that case, you won't be limited by single
index docid limitations of Lucene.

> We are currently indexing 5 million books in Solr, scaling up over the
> next few years to 20 million.  However we are using the entire book as a
> Solr document.  We are evaluating the possibility of indexing individual
> pages as there are some use cases where users want the most relevant pages
> regardless of what book they occur in.  However, we estimate that we are
> talking about somewhere between 1 and 6 billion pages and have concerns
> over whether Solr will scale to this level.
>
> Does anyone have experience using Solr with 1-6 billion Solr documents?
>
> The lucene file format document
> (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> mentions a limit of about 2 billion document ids.   I assume this is the
> lucene internal document id and would therefore be a per index/per shard
> limit.  Is this correct?
>
>
> Tom Burton-West.
>
>
>
>



Re: Experience with indexing billions of documents?

2010-04-02 Thread Peter Sturge
You can do this today with multiple indexes, replication and distributed
searching.
SolrCloud/clustering will certainly make life easier when it comes to
managing these,
but with distributed searches over multiple indexes, you're limited only by
how much hardware you can throw at it.


On Fri, Apr 2, 2010 at 6:17 PM,  wrote:

> My guess is that you will need to take advantage of Solr 1.5's upcoming
> cloud/cluster renovations and use multiple indexes to comfortably achieve
> those numbers. Hypthetically, in that case, you won't be limited by single
> index docid limitations of Lucene.
>
> > We are currently indexing 5 million books in Solr, scaling up over the
> > next few years to 20 million.  However we are using the entire book as a
> > Solr document.  We are evaluating the possibility of indexing individual
> > pages as there are some use cases where users want the most relevant
> pages
> > regardless of what book they occur in.  However, we estimate that we are
> > talking about somewhere between 1 and 6 billion pages and have concerns
> > over whether Solr will scale to this level.
> >
> > Does anyone have experience using Solr with 1-6 billion Solr documents?
> >
> > The lucene file format document
> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> > mentions a limit of about 2 billion document ids.   I assume this is the
> > lucene internal document id and would therefore be a per index/per shard
> > limit.  Is this correct?
> >
> >
> > Tom Burton-West.
> >
> >
> >
> >
>
>


Re: Experience with indexing billions of documents?

2010-04-02 Thread Rich Cariens
A colleague of mine is using native Lucene + some home-grown
patches/optimizations to index over 13B small documents in a 32-shard
environment, which is around 406M docs per shard.

If there's a 2B doc id limitation in Lucene then I assume he's patched it
himself.

On Fri, Apr 2, 2010 at 1:17 PM,  wrote:

> My guess is that you will need to take advantage of Solr 1.5's upcoming
> cloud/cluster renovations and use multiple indexes to comfortably achieve
> those numbers. Hypthetically, in that case, you won't be limited by single
> index docid limitations of Lucene.
>
> > We are currently indexing 5 million books in Solr, scaling up over the
> > next few years to 20 million.  However we are using the entire book as a
> > Solr document.  We are evaluating the possibility of indexing individual
> > pages as there are some use cases where users want the most relevant
> pages
> > regardless of what book they occur in.  However, we estimate that we are
> > talking about somewhere between 1 and 6 billion pages and have concerns
> > over whether Solr will scale to this level.
> >
> > Does anyone have experience using Solr with 1-6 billion Solr documents?
> >
> > The lucene file format document
> > (http://lucene.apache.org/java/3_0_1/fileformats.html#Limitations)
> > mentions a limit of about 2 billion document ids.   I assume this is the
> > lucene internal document id and would therefore be a per index/per shard
> > limit.  Is this correct?
> >
> >
> > Tom Burton-West.
> >
> >
> >
> >
>
>


Re: MoreLikeThis function queries

2010-04-02 Thread Blargy

Bueller? Anyone? :)
-- 
View this message in context: 
http://n3.nabble.com/MoreLikeThis-function-queries-tp692377p693648.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: MoreLikeThis function queries

2010-04-02 Thread Darren Govoni
Its Friday dude. Give it a couple days. ;)

On Fri, 2010-04-02 at 11:50 -0800, Blargy wrote:

> Bueller? Anyone? :)




highlighter issue

2010-04-02 Thread Joe Calderon
hello *, i have a field that is indexing the string "the
ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend]
then they are passed to the edgengram filter, this allows me to match
different user spellings and allows for partial highlighting, however
a token like 'ex' would get generated twice which should be fine
except the highlighter seems to highlight that token twice even though
it has the same offsets (4,6)

is there away to make the highlighter not highlight the same token
twice, or do i have to create a token filter that would dump tokens
with equal text and offsets ?


basically whats happening now is if i search

'the e', i get:
'SeinfeldThe EEx-Girlfriend'

for 'the ex', i get:
'SeinfeldThe ExEx-Girlfriend'

and so on


thx much

--joe


Re: Search accross more than one field (dismax) ignored

2010-04-02 Thread MitchK

Hoss,

thank you for responsing. This behaviour was caused by an unexpected
behaviour of the RessourceLoader caused by an utf-8-BOM encodet file. 
I have mentioned this in another thread on the mail-list, sorry for forget
to say this also here.

Kind regards
- Mitch
-- 
View this message in context: 
http://n3.nabble.com/Search-accross-more-than-one-field-dismax-ignored-tp687935p693759.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: highlighter issue

2010-04-02 Thread Erik Hatcher

Will adding the RemoveDuplicatesTokenFilter(Factory) do the trick here?

Erik

On Apr 2, 2010, at 4:13 PM, Joe Calderon wrote:


hello *, i have a field that is indexing the string "the
ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend]
then they are passed to the edgengram filter, this allows me to match
different user spellings and allows for partial highlighting, however
a token like 'ex' would get generated twice which should be fine
except the highlighter seems to highlight that token twice even though
it has the same offsets (4,6)

is there away to make the highlighter not highlight the same token
twice, or do i have to create a token filter that would dump tokens
with equal text and offsets ?


basically whats happening now is if i search

'the e', i get:
'SeinfeldThe EEx-Girlfriend'

for 'the ex', i get:
'SeinfeldThe ExEx-Girlfriend'

and so on


thx much

--joe




Re: highlighter issue

2010-04-02 Thread Joe Calderon
i had tried it earlier with no effect, when i looked at the source, it
doesnt look at offsets at all, just position increments, so short of
somebody finding a better way i going to create a similar filter that
compared offsets...

On Fri, Apr 2, 2010 at 2:07 PM, Erik Hatcher  wrote:
> Will adding the RemoveDuplicatesTokenFilter(Factory) do the trick here?
>
>        Erik
>
> On Apr 2, 2010, at 4:13 PM, Joe Calderon wrote:
>
>> hello *, i have a field that is indexing the string "the
>> ex-girlfriend" as these tokens: [the, exgirlfriend, ex, girlfriend]
>> then they are passed to the edgengram filter, this allows me to match
>> different user spellings and allows for partial highlighting, however
>> a token like 'ex' would get generated twice which should be fine
>> except the highlighter seems to highlight that token twice even though
>> it has the same offsets (4,6)
>>
>> is there away to make the highlighter not highlight the same token
>> twice, or do i have to create a token filter that would dump tokens
>> with equal text and offsets ?
>>
>>
>> basically whats happening now is if i search
>>
>> 'the e', i get:
>> 'Seinfeld The EEx-Girlfriend'
>>
>> for 'the ex', i get:
>> 'Seinfeld The ExEx-Girlfriend'
>>
>> and so on
>>
>>
>> thx much
>>
>> --joe
>
>


Unable to load MailEntityProcessor or org.apache.solr.handler.dataimport.MailEntityProcessor

2010-04-02 Thread Andrew McCombe
Hi

I am experimenting with Solr to index my gmail and am experiencing an error:

'Unable to load MailEntityProcessor or
org.apache.solr.handler.dataimport.MailEntityProcessor'

I downloaded a fresh 1.4 tgz, extracted it and added the following to
example/solr/config/solrconfig.xml:




  /home/andrew/bin/apache-solr-1.5-dev/example/solr/conf/email-data-config.xml

  

email-data-config.xml containd the following:



   



Whenever I try to import data using /dataimport?command=full-import I
am seeing the error below:

Apr 2, 2010 10:14:51 PM
org.apache.solr.handler.dataimport.DataImporter doFullImport
SEVERE: Full Import failed
org.apache.solr.handler.dataimport.DataImportHandlerException: Unable
to load EntityProcessor implementation for entity:11418758786959
Processing Document # 1
at 
org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:72)
at 
org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:805)
at 
org.apache.solr.handler.dataimport.DocBuilder.buildDocument(DocBuilder.java:536)
at 
org.apache.solr.handler.dataimport.DocBuilder.doFullDump(DocBuilder.java:261)
at 
org.apache.solr.handler.dataimport.DocBuilder.execute(DocBuilder.java:185)
at 
org.apache.solr.handler.dataimport.DataImporter.doFullImport(DataImporter.java:333)
at 
org.apache.solr.handler.dataimport.DataImporter.runCmd(DataImporter.java:391)
at 
org.apache.solr.handler.dataimport.DataImporter$1.run(DataImporter.java:372)
Caused by: java.lang.ClassNotFoundException: Unable to load
MailEntityProcessor or
org.apache.solr.handler.dataimport.MailEntityProcessor
at 
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:966)
at 
org.apache.solr.handler.dataimport.DocBuilder.getEntityProcessor(DocBuilder.java:802)
... 6 more
Caused by: org.apache.solr.common.SolrException: Error loading class
'MailEntityProcessor'
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:373)
at 
org.apache.solr.handler.dataimport.DocBuilder.loadClass(DocBuilder.java:956)
... 7 more
Caused by: java.lang.ClassNotFoundException: MailEntityProcessor
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.net.FactoryURLClassLoader.loadClass(URLClassLoader.java:592)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:357)
... 8 more
Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: start rollback
Apr 2, 2010 10:14:51 PM org.apache.solr.update.DirectUpdateHandler2 rollback
INFO: end_rollback


Am I missing a step somewhere? I have tried this with the standard
apache 1.4, a nightly of 1.5 and also the LucidWorks release and get
the same issue with each.  The wiki isn't very detailed either. My
backbground isn't in Java so a lot of this is new to me.


Regards
Andrew McCombe


Re: MoreLikeThis function queries

2010-04-02 Thread Blargy

Fair enough :)
-- 
View this message in context: 
http://n3.nabble.com/MoreLikeThis-function-queries-tp692377p693872.html
Sent from the Solr - User mailing list archive at Nabble.com.


Related terms/combined terms

2010-04-02 Thread Blargy

Not sure of the exact vocabulary I am looking for so I'll try to explain
myself.

Given a search term is there anyway to return back a list of related/grouped
keywords (based on the current state of the index) for that term. 

For example say I have a sports catalog and I search for "Callaway". Is
there anything that could give me back

"Callaway Driver"
"Callaway Golf Balls"
"Callaway Hat"
"Callaway Glove"

Since these words are always grouped to together/related. Note sure if
something like this is even possible.

Thanks

-- 
View this message in context: 
http://n3.nabble.com/Related-terms-combined-terms-tp694083p694083.html
Sent from the Solr - User mailing list archive at Nabble.com.


Solr caches and nearly static indexes

2010-04-02 Thread Shawn Heisey
My index has a number of shards that are nearly static, each with about 
7 million documents.  By nearly static, I mean that the only changes 
that normally happen to them are document deletions, done with the xml 
update handler.  The process that does these deletions runs once every 
two minutes, and does them with a query on a field other than the one 
that's used for uniqueKey.  Once a day, I will be adding data to these 
indexes with the DIH delta-import.  One of my shards gets all new data 
once every two minutes, but it is less than 5% the size of the others.


The problem that I'm running into is that every time a delete is 
committed, my caches are suddenly invalid and I seem to have two 
options: Spend a lot of time and I/O rewarming them, or suffer with slow 
(3 seconds or longer) search times.  Is there any way to have the index 
keep its caches when the only thing that happens is deletions, then 
invalidate them when it's time to actually add data?  It would have to 
be something I can dynamically change when switching between deletions 
and the daily import.


Thanks,
Shawn