Sort for Retrieved Data

2012-01-20 Thread Bing Li
Dear all,

I have a question when sorting retrieved data from Solr. As I know, Lucene
retrieves data according to the degree of keyword matching on text field
(partial matching).

If I search data by string field (complete matching), how does Lucene sort
the retrieved data?

If I add some filters, such as time, what about the sorting way?

If I just need to top ones, is it proper to just add rows?

If I want to add new sorting ways, how to do that?

Thanks so much!
Bing


How to Sort By a PageRank-Like Complicated Strategy?

2012-01-21 Thread Bing Li
Dear all,

I am using SolrJ to implement a system that needs to provide users with
searching services. I have some questions about Solr searching as follows.

As I know, Lucene retrieves data according to the degree of keyword
matching on text field (partial matching).

But, if I search data by string field (complete matching), how does Lucene
sort the retrieved data?

If I want to add new sorting ways, Solr's function query seems to support
this feature.

However, for a complicated ranking strategy, such PageRank, can Solr
provide an interface for me to do that?

My ranking ways are more complicated than PageRank. Now I have to load all
of matched data from Solr first by keyword and rank them again in my ways
before showing to users. It is correct?

Thanks so much!
Bing


Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-21 Thread Bing Li
Hi, Kai,

Thanks so much for your reply!

If the retrieving is done on a string field, not a text field, a complete
matching approach should be used according to my understanding, right? If
so, how does Lucene rank the retrieved data?

Best regards,
Bing

On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:

> Solr is kind of retrieval step, you can customize the score formula in
> Lucene. But it supposes not to be too complicated, like it's better can be
> factorization. It also regards to the stored information, like
> TF,DF,position, etc. You can do 2nd phase rerank to the top N data you have
> got.
>
> Sent from my iPad
>
> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
>
> > Dear all,
> >
> > I am using SolrJ to implement a system that needs to provide users with
> > searching services. I have some questions about Solr searching as
> follows.
> >
> > As I know, Lucene retrieves data according to the degree of keyword
> > matching on text field (partial matching).
> >
> > But, if I search data by string field (complete matching), how does
> Lucene
> > sort the retrieved data?
> >
> > If I want to add new sorting ways, Solr's function query seems to support
> > this feature.
> >
> > However, for a complicated ranking strategy, such PageRank, can Solr
> > provide an interface for me to do that?
> >
> > My ranking ways are more complicated than PageRank. Now I have to load
> all
> > of matched data from Solr first by keyword and rank them again in my ways
> > before showing to users. It is correct?
> >
> > Thanks so much!
> > Bing
>


Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-22 Thread Bing Li
Dear Shashi,

Thanks so much for your reply!

However, I think the value of PageRank is not a static one. It must update
on the fly. As I know, Lucene index is not suitable to be updated too
frequently. If so, how to deal with that?

Best regards,
Bing


On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant  wrote:

> Lucene has a mechanism to "boost" up/down documents using your custom
> ranking algorithm. So if you come up with something like Pagerank
> you might do something like doc.SetBoost(myboost), before writing to index.
>
>
>
> On Sat, Jan 21, 2012 at 5:07 PM, Bing Li  wrote:
> > Hi, Kai,
> >
> > Thanks so much for your reply!
> >
> > If the retrieving is done on a string field, not a text field, a complete
> > matching approach should be used according to my understanding, right? If
> > so, how does Lucene rank the retrieved data?
> >
> > Best regards,
> > Bing
> >
> > On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:
> >
> >> Solr is kind of retrieval step, you can customize the score formula in
> >> Lucene. But it supposes not to be too complicated, like it's better can
> be
> >> factorization. It also regards to the stored information, like
> >> TF,DF,position, etc. You can do 2nd phase rerank to the top N data you
> have
> >> got.
> >>
> >> Sent from my iPad
> >>
> >> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
> >>
> >> > Dear all,
> >> >
> >> > I am using SolrJ to implement a system that needs to provide users
> with
> >> > searching services. I have some questions about Solr searching as
> >> follows.
> >> >
> >> > As I know, Lucene retrieves data according to the degree of keyword
> >> > matching on text field (partial matching).
> >> >
> >> > But, if I search data by string field (complete matching), how does
> >> Lucene
> >> > sort the retrieved data?
> >> >
> >> > If I want to add new sorting ways, Solr's function query seems to
> support
> >> > this feature.
> >> >
> >> > However, for a complicated ranking strategy, such PageRank, can Solr
> >> > provide an interface for me to do that?
> >> >
> >> > My ranking ways are more complicated than PageRank. Now I have to load
> >> all
> >> > of matched data from Solr first by keyword and rank them again in my
> ways
> >> > before showing to users. It is correct?
> >> >
> >> > Thanks so much!
> >> > Bing
> >>
>


Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-28 Thread Bing Li
Dear Shashi,

As I learned, big data, such as Lucene index, was not suitable to be
updated frequently. Frequent updating must affect the performance and
consistency when Lucene index must be replicated in a large scale cluster.
It is expected such a search engine must work in a write-once & read-many
environment, right? That's what HDFS (Hadoop Distributed File System)
provides. According to my experience, it is really slow when updating a
Lucene Index.

Why did you say I could update Lucene index frequently?

Thanks so much!
Bing

On Mon, Jan 23, 2012 at 11:02 PM, Shashi Kant  wrote:

> You can update the document in the index quite frequently. IDNK what
> your requirement is, another option would be to boost query time.
>
> On Sun, Jan 22, 2012 at 5:51 AM, Bing Li  wrote:
> > Dear Shashi,
> >
> > Thanks so much for your reply!
> >
> > However, I think the value of PageRank is not a static one. It must
> update
> > on the fly. As I know, Lucene index is not suitable to be updated too
> > frequently. If so, how to deal with that?
> >
> > Best regards,
> > Bing
> >
> >
> > On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant 
> wrote:
> >>
> >> Lucene has a mechanism to "boost" up/down documents using your custom
> >> ranking algorithm. So if you come up with something like Pagerank
> >> you might do something like doc.SetBoost(myboost), before writing to
> >> index.
> >>
> >>
> >>
> >> On Sat, Jan 21, 2012 at 5:07 PM, Bing Li  wrote:
> >> > Hi, Kai,
> >> >
> >> > Thanks so much for your reply!
> >> >
> >> > If the retrieving is done on a string field, not a text field, a
> >> > complete
> >> > matching approach should be used according to my understanding, right?
> >> > If
> >> > so, how does Lucene rank the retrieved data?
> >> >
> >> > Best regards,
> >> > Bing
> >> >
> >> > On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:
> >> >
> >> >> Solr is kind of retrieval step, you can customize the score formula
> in
> >> >> Lucene. But it supposes not to be too complicated, like it's better
> can
> >> >> be
> >> >> factorization. It also regards to the stored information, like
> >> >> TF,DF,position, etc. You can do 2nd phase rerank to the top N data
> you
> >> >> have
> >> >> got.
> >> >>
> >> >> Sent from my iPad
> >> >>
> >> >> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
> >> >>
> >> >> > Dear all,
> >> >> >
> >> >> > I am using SolrJ to implement a system that needs to provide users
> >> >> > with
> >> >> > searching services. I have some questions about Solr searching as
> >> >> follows.
> >> >> >
> >> >> > As I know, Lucene retrieves data according to the degree of keyword
> >> >> > matching on text field (partial matching).
> >> >> >
> >> >> > But, if I search data by string field (complete matching), how does
> >> >> Lucene
> >> >> > sort the retrieved data?
> >> >> >
> >> >> > If I want to add new sorting ways, Solr's function query seems to
> >> >> > support
> >> >> > this feature.
> >> >> >
> >> >> > However, for a complicated ranking strategy, such PageRank, can
> Solr
> >> >> > provide an interface for me to do that?
> >> >> >
> >> >> > My ranking ways are more complicated than PageRank. Now I have to
> >> >> > load
> >> >> all
> >> >> > of matched data from Solr first by keyword and rank them again in
> my
> >> >> > ways
> >> >> > before showing to users. It is correct?
> >> >> >
> >> >> > Thanks so much!
> >> >> > Bing
> >> >>
> >
> >
>


How is Data Indexed in HBase?

2012-02-22 Thread Bing Li
Dear all,

I wonder how data in HBase is indexed? Now Solr is used in my system
because data is managed in inverted index. Such an index is suitable to
retrieve unstructured and huge amount of data. How does HBase deal with the
issue? May I replaced Solr with HBase?

Thanks so much!

Best regards,
Bing


Re: Solr & HBase - Re: How is Data Indexed in HBase?

2012-02-22 Thread Bing Li
Mr Gupta,

Thanks so much for your reply!

In my use cases, retrieving data by keyword is one of them. I think Solr is
a proper choice.

However, Solr does not provide a complex enough support to rank. And,
frequent updating is also not suitable in Solr. So it is difficult to
retrieve data randomly based on the values other than keyword frequency in
text. In this case, I attempt to use HBase.

But I don't know how HBase support high performance when it needs to keep
consistency in a large scale distributed system.

Now both of them are used in my system.

I will check out ElasticSearch.

Best regards,
Bing


On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta wrote:

> Bing,
> Its a classic battle on whether to use solr or hbase or a combination of
> both. both systems are very different but there is some overlap in the
> utility. they also differ vastly when it compares to computation power,
> storage needs, etc. so in the end, it all boils down to your use case. you
> need to pick the technology that it best suited to your needs.
> im still not clear on your use case though.
>
> btw, if you haven't started using solr yet - then you might want to
> checkout ElasticSearch. I spent over a week researching between solr and ES
> and eventually chose ES due to its cool merits.
>
> thanks
>
>
> On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu  wrote:
>
>> There is no secondary index support in HBase at the moment.
>>
>> It's on our road map.
>>
>> FYI
>>
>> On Wed, Feb 22, 2012 at 9:28 AM, Bing Li  wrote:
>>
>> > Jacques,
>> >
>> > Yes. But I still have questions about that.
>> >
>> > In my system, when users search with a keyword arbitrarily, the query is
>> > forwarded to Solr. No any updating operations but appending new indexes
>> > exist in Solr managed data.
>> >
>> > When I need to retrieve data based on ranking values, HBase is used.
>> And,
>> > the ranking values need to be updated all the time.
>> >
>> > Is that correct?
>> >
>> > My question is that the performance must be low if keeping consistency
>> in a
>> > large scale distributed environment. How does HBase handle this issue?
>> >
>> > Thanks so much!
>> >
>> > Bing
>> >
>> >
>> > On Thu, Feb 23, 2012 at 1:17 AM, Jacques  wrote:
>> >
>> > > It is highly unlikely that you could replace Solr with HBase.  They're
>> > > really apples and oranges.
>> > >
>> > >
>> > > On Wed, Feb 22, 2012 at 1:09 AM, Bing Li  wrote:
>> > >
>> > >> Dear all,
>> > >>
>> > >> I wonder how data in HBase is indexed? Now Solr is used in my system
>> > >> because data is managed in inverted index. Such an index is suitable
>> to
>> > >> retrieve unstructured and huge amount of data. How does HBase deal
>> with
>> > >> the
>> > >> issue? May I replaced Solr with HBase?
>> > >>
>> > >> Thanks so much!
>> > >>
>> > >> Best regards,
>> > >> Bing
>> > >>
>> > >
>> > >
>> >
>>
>
>


Re: Solr & HBase - Re: How is Data Indexed in HBase?

2012-02-23 Thread Bing Li
Dear Mr Gupta,

Your understanding about my solution is correct. Now both HBase and Solr
are used in my system. I hope it could work.

Thanks so much for your reply!

Best regards,
Bing

On Fri, Feb 24, 2012 at 3:30 AM, T Vinod Gupta wrote:

> regarding your question on hbase support for high performance and
> consistency - i would say hbase is highly scalable and performant. how it
> does what it does can be understood by reading relevant chapters around
> architecture and design in the hbase book.
>
> with regards to ranking, i see your problem. but if you split the problem
> into hbase specific solution and solr based solution, you can achieve the
> results probably. may be you do the ranking and store the rank in hbase and
> then use solr to get the results and then use hbase as a lookup to get the
> rank. or you can put the rank as part of the document schema and index the
> rank too for range queries and such. is my understanding of your scenario
> wrong?
>
> thanks
>
>
> On Wed, Feb 22, 2012 at 9:51 AM, Bing Li  wrote:
>
>> Mr Gupta,
>>
>> Thanks so much for your reply!
>>
>> In my use cases, retrieving data by keyword is one of them. I think Solr
>> is a proper choice.
>>
>> However, Solr does not provide a complex enough support to rank. And,
>> frequent updating is also not suitable in Solr. So it is difficult to
>> retrieve data randomly based on the values other than keyword frequency in
>> text. In this case, I attempt to use HBase.
>>
>> But I don't know how HBase support high performance when it needs to keep
>> consistency in a large scale distributed system.
>>
>> Now both of them are used in my system.
>>
>> I will check out ElasticSearch.
>>
>> Best regards,
>> Bing
>>
>>
>> On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta wrote:
>>
>>> Bing,
>>> Its a classic battle on whether to use solr or hbase or a combination of
>>> both. both systems are very different but there is some overlap in the
>>> utility. they also differ vastly when it compares to computation power,
>>> storage needs, etc. so in the end, it all boils down to your use case. you
>>> need to pick the technology that it best suited to your needs.
>>> im still not clear on your use case though.
>>>
>>> btw, if you haven't started using solr yet - then you might want to
>>> checkout ElasticSearch. I spent over a week researching between solr and ES
>>> and eventually chose ES due to its cool merits.
>>>
>>> thanks
>>>
>>>
>>> On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu  wrote:
>>>
>>>> There is no secondary index support in HBase at the moment.
>>>>
>>>> It's on our road map.
>>>>
>>>> FYI
>>>>
>>>> On Wed, Feb 22, 2012 at 9:28 AM, Bing Li  wrote:
>>>>
>>>> > Jacques,
>>>> >
>>>> > Yes. But I still have questions about that.
>>>> >
>>>> > In my system, when users search with a keyword arbitrarily, the query
>>>> is
>>>> > forwarded to Solr. No any updating operations but appending new
>>>> indexes
>>>> > exist in Solr managed data.
>>>> >
>>>> > When I need to retrieve data based on ranking values, HBase is used.
>>>> And,
>>>> > the ranking values need to be updated all the time.
>>>> >
>>>> > Is that correct?
>>>> >
>>>> > My question is that the performance must be low if keeping
>>>> consistency in a
>>>> > large scale distributed environment. How does HBase handle this issue?
>>>> >
>>>> > Thanks so much!
>>>> >
>>>> > Bing
>>>> >
>>>> >
>>>> > On Thu, Feb 23, 2012 at 1:17 AM, Jacques  wrote:
>>>> >
>>>> > > It is highly unlikely that you could replace Solr with HBase.
>>>>  They're
>>>> > > really apples and oranges.
>>>> > >
>>>> > >
>>>> > > On Wed, Feb 22, 2012 at 1:09 AM, Bing Li  wrote:
>>>> > >
>>>> > >> Dear all,
>>>> > >>
>>>> > >> I wonder how data in HBase is indexed? Now Solr is used in my
>>>> system
>>>> > >> because data is managed in inverted index. Such an index is
>>>> suitable to
>>>> > >> retrieve unstructured and huge amount of data. How does HBase deal
>>>> with
>>>> > >> the
>>>> > >> issue? May I replaced Solr with HBase?
>>>> > >>
>>>> > >> Thanks so much!
>>>> > >>
>>>> > >> Best regards,
>>>> > >> Bing
>>>> > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>


Re: pagerank??

2012-04-04 Thread Bing Li
According to my knowledge, Solr cannot support this.

In my case, I get data by keyword-matching from Solr and then rank the data
by PageRank after that.

Thanks,
Bing

On Wed, Apr 4, 2012 at 6:37 AM, Manuel Antonio Novoa Proenza <
mano...@estudiantes.uci.cu> wrote:

> Hello,
>
> I have in my Solr index , many indexed documents.
>
> Let me know any way or efficient function to calculate the page rank of
> websites indexed.
>
>
> s
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci


How to Transmit and Append Indexes

2010-11-19 Thread Bing Li
Hi, all,

I am working on a distributed searching system. Now I have one server only.
It has to crawl pages from the Web, generate indexes locally and respond
users' queries. I think this is too busy for it to work smoothly.

I plan to use two servers at at least. The jobs to crawl pages and generate
indexes are done by one of them. After that, the new available indexes
should be transmitted to anther one which is responsible for responding
users' queries. From users' point of view, this system must be fast.
However, I don't know how I can get the additional indexes which I can
transmit. After transmission, how to append them to the old indexes? Does
the appending block searching?

Thanks so much for your help!

Bing Li


Is it fine to transmit indexes in this way?

2010-11-19 Thread Bing Li
Hi, all,

Since I didn't find that Lucene presents updated indexes to us, may I
transmit indexes in the following way?

1) One indexing machine, A, is busy with generating indexes;

2) After a certain time, the indexing process is terminated;

3) Then, the new indexes are transmitted to machines which serve users'
queries;

4) It is possible that some index files have the same names. So the
conflicting files should be renamed;

5) After the transmission is done, the transmitted indexes are removed from
A.

6) After the removal, the indexing process is started again on A.

The reason I am trying to do that is to load balancing the search load. One
machine is responsible for generating indexes and the others are responsible
for responding queries.

If the above approaches do not work, may I see the updates of indexes in
Lucene? May I transmit them? And, may I append them to existing indexes?
Does the appending affect the querying?

I am learning Solr. But it seems that Solr does that for me. However, I have
to set up Tomcat to use Solr. I think it is a little bit heavy.

Thanks!
Bing Li


Re: Is it fine to transmit indexes in this way?

2010-11-19 Thread Bing Li
Thanks so much, Gora!

What do you mean by appending? If you mean adding to an existing index
(on reindexing, this would normally mean an update for an existing Solr
document ID, and a create for a new Solr document ID), the best way
probably is not to delete the index on the master server (what you call
machine A). Once the indexing is completed, a commit ensures that new
documents show up for any subsequent queries.

When updates are replicated to slave servers, it is supposed that the
updates are merged with the existing indexes and readings on them can be
done concurrently. If so, the queries must be responded instantly. That's
what I mean "appending". Does it happen in Solr?

Best,
Bing

On Sat, Nov 20, 2010 at 1:58 AM, Gora Mohanty  wrote:

> On Fri, Nov 19, 2010 at 10:53 PM, Bing Li  wrote:
> > Hi, all,
> >
> > Since I didn't find that Lucene presents updated indexes to us, may I
> > transmit indexes in the following way?
> >
> > 1) One indexing machine, A, is busy with generating indexes;
> >
> > 2) After a certain time, the indexing process is terminated;
> >
> > 3) Then, the new indexes are transmitted to machines which serve users'
> > queries;
>
> Just replied to a similar question in another thread. The best way
> is probably to use Solr replication:
> http://wiki.apache.org/solr/SolrReplication
>
> You can set up replication to happen automatically upon commit on the
> master server (where the new index was made). As a commit should
> have been made when indexing is complete on the master server, this
> will then ensure that a new index is replicated on the slave server.
>
> > 4) It is possible that some index files have the same names. So the
> > conflicting files should be renamed;
>
> Replication will handle this for you.
>
> > 5) After the transmission is done, the transmitted indexes are removed
> from
> > A.
> >
> > 6) After the removal, the indexing process is started again on A.
> [...]
>
> These two items you have to do manually, i.e., delete all documents
> on A, and restart the indexing.
>
>
> > And, may I append them to
> existing indexes?
> > Does the appending affect the querying?
> [...]
>
> What do you mean by appending? If you mean adding to an existing index
> (on reindexing, this would normally mean an update for an existing Solr
> document ID, and a create for a new Solr document ID), the best way
> probably is not to delete the index on the master server (what you call
> machine A). Once the indexing is completed, a commit ensures that new
> documents show up for any subsequent queries.
>

> Regards,
> Gora
>


Re: How to Transmit and Append Indexes

2010-11-19 Thread Bing Li
Dear Erick,

Thanks so much for your help! I am new in Solr. So I have no idea about the
version.

But I wonder what are the differences between Solr and Hadoop? It seems that
Solr has done the same as what Hadoop promises.

Best,
Bing

On Sat, Nov 20, 2010 at 2:28 AM, Erick Erickson wrote:

> You haven't said what version of Solr you're using, but you're
> asking about replication, which is built-in.
> See: http://wiki.apache.org/solr/SolrReplication
>
> And no, your slave doesn't block while the update is happening,
> and it automatically switches to the updated index upon
> successful replication.
>
> Older versions of Solr used rsynch & etc.
>
> Best
> Erick
>
> On Fri, Nov 19, 2010 at 10:52 AM, Bing Li  wrote:
>
>> Hi, all,
>>
>> I am working on a distributed searching system. Now I have one server
>> only.
>> It has to crawl pages from the Web, generate indexes locally and respond
>> users' queries. I think this is too busy for it to work smoothly.
>>
>> I plan to use two servers at at least. The jobs to crawl pages and
>> generate
>> indexes are done by one of them. After that, the new available indexes
>> should be transmitted to anther one which is responsible for responding
>> users' queries. From users' point of view, this system must be fast.
>> However, I don't know how I can get the additional indexes which I can
>> transmit. After transmission, how to append them to the old indexes? Does
>> the appending block searching?
>>
>> Thanks so much for your help!
>>
>> Bing Li
>>
>
>


Re: How to Transmit and Append Indexes

2010-11-19 Thread Bing Li
Hi, Gora,

No, I really wonder if Solr is based on Hadoop?

Hadoop is efficient when using on search engines since it is suitable to the
write-once-read-many model. After reading your emails, it looks like Solr's
distributed file system does the same thing. Both of them are good for
searching large indexes in a large scale distributed environment, right?

Thanks!
Bing


On Sat, Nov 20, 2010 at 3:01 AM, Gora Mohanty  wrote:

> On Sat, Nov 20, 2010 at 12:05 AM, Bing Li  wrote:
> > Dear Erick,
> >
> > Thanks so much for your help! I am new in Solr. So I have no idea about
> the
> > version.
>
> The solr/admin/registry.jsp URL on your local Solr installation should show
> you the version at the top.
>
> > But I wonder what are the differences between Solr and Hadoop? It seems
> that
> > Solr has done the same as what Hadoop promises.
> [...]
>
> Er, what? Solr and Hadoop are entirely different applications. Did you
> mean Lucene or Nutch, instead of Hadoop?
>
> Regards,
> Gora
>


Import Data Into Solr

2010-12-02 Thread Bing Li
Hi, all,

I am a new user of Solr. Before using it, all of the data is indexed myself
with Lucene. According to the Chapter 3 of the book, Solr. 1.4 Enterprise
Search Server written by David Smiley and Eric Pugh, data in the formats of
XML, CSV and even PDF, etc, can be imported to Solr.

If I wish to import the Lucene indexes into Solr, may I have any other
approaches? I know that Solr is a serverized Lucene.

Thanks,
Bing Li


Solr Got Exceptions When "schema.xml" is Changed

2010-12-04 Thread Bing Li
Dear all,

I am a new user of Solr. Now I am just trying to try some basic samples.
Solr can be started correctly with Tomcat.

However, when putting a new schema.xml under SolrHome/conf and starting
Tomcat again, I got the following two exceptions.

The Solr cannot be started correctly unless using the initial schema.xml
from Solr.

Why cannot I change the schema.xml?

Thanks so much!
Bing

Dec 5, 2010 4:52:49 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:52)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1146)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

-

SEVERE: Could not start SOLR. Check solr/home property
org.apache.solr.common.SolrException: QueryElevationComponent requires the
schema to have a uniqueKeyFie
ld implemented using StrField
at
org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java
:157)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:508)
at org.apache.solr.core.SolrCore.(SolrCore.java:588)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:273)

at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:254)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:37
2)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:98)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4405)
at
org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5037)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:812)
at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:787)
at
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:570)
at
org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:891)
at
org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:683)
at
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:466)
at
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1267)
at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:308)
at
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
at
org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:89)
at
org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:328)
at
org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:308)
at
org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:1043)
at
org.apache.catalina.core.StandardHost.startInternal(StandardHost.java:738)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at
org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:1035)
at
org.apache.catalina.core.StandardEngine.startInternal(StandardEngine.java:289)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at
org.apache.catalina.core.StandardService.startInternal(StandardService.java:442)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at
org.apache.catalina.core.StandardServer.startInternal(StandardServer.java:674)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at org.apache.catalina.startup.Catalina.start(Catalina.java:596)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apach

SolrHome and Solr Data Dir in solrconfig.xml

2010-12-09 Thread Bing Li
Dear all,

I am a new user of Solr.

When using Solr, SolrHome is set to /home/libing/Solr. When Tomcat is
started, it must read solrconfig.xml to get Solr data dir, which is used to
contain indexes. However, I have no idea how to associate SolrHome with Solr
data dir. So a mistake occurs. All the indexes are put under
$TOMCAT_HOME/bin. This is NOT what I expect. I hope indexes are under
SolrHome.

Could you please give me a hand?

Best,
Bing Li


Indexing and Searching Chinese

2011-01-18 Thread Bing Li
Hi, all,

Now I cannot search the index when querying with Chinese keywords.

Before using Solr, I ever used Lucene for some time. Since I need to crawl
some Chinese sites, I use ChineseAnalyzer in the code to run Lucene.

I know Solr is a server for Lucene. However, I have no idea know how to
configure the analyzer in Solr?

I appreciate so much for your help!

Best,
LB


Indexing and Searching Chinese with SolrNet

2011-01-18 Thread Bing Li
Dear all,

After reading some pages on the Web, I created the index with the following
schema.

..





..

It must be correct, right? However, when sending a query though SolrNet, no
results are returned. Could you tell me what the reason is?

Thanks,
LB


Re: Indexing and Searching Chinese with SolrNet

2011-01-18 Thread Bing Li
Dear Jelsma,

My servlet container is Tomcat 7. I think it should accept Chinese
characters. But I am not sure how to configure it. From the console of
Tomcat, I saw that the Chinese characters in the query are not displayed
normally. However, it is fine in the Solr Admin page.

I am not sure either if SolrNet supports Chinese. If not, how can I interact
with Solr on .NET?

Thanks so much!
LB


On Wed, Jan 19, 2011 at 2:34 AM, Markus Jelsma
wrote:

> Why creating two threads for the same problem? Anyway, is your servlet
> container capable of accepting UTF-8 in the URL? Also, is SolrNet capable
> of
> handling those characters? To confirm, try a tool like curl.
>
> > Dear all,
> >
> > After reading some pages on the Web, I created the index with the
> following
> > schema.
> >
> > ..
> >  > positionIncrementGap="100">
> > 
> >  > class="solr.ChineseTokenizerFactory"/>
> > 
> > 
> > ..
> >
> > It must be correct, right? However, when sending a query though SolrNet,
> no
> > results are returned. Could you tell me what the reason is?
> >
> > Thanks,
> > LB
>


Re: Indexing and Searching Chinese with SolrNet

2011-01-18 Thread Bing Li
Dear Jelsma,

After configuring the Tomcat URIEncoding, Chinese characters can be
processed correctly. I appreciate so much for your help!

Best,
LB

On Wed, Jan 19, 2011 at 3:02 AM, Markus Jelsma
wrote:

> Hi,
>
> Yes but Tomcat might need to be configured to accept, see the wiki for more
> information on this subject.
>
> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
>
> Cheers,
>
> > Dear Jelsma,
> >
> > My servlet container is Tomcat 7. I think it should accept Chinese
> > characters. But I am not sure how to configure it. From the console of
> > Tomcat, I saw that the Chinese characters in the query are not displayed
> > normally. However, it is fine in the Solr Admin page.
> >
> > I am not sure either if SolrNet supports Chinese. If not, how can I
> > interact with Solr on .NET?
> >
> > Thanks so much!
> > LB
> >
> >
> > On Wed, Jan 19, 2011 at 2:34 AM, Markus Jelsma
> >
> > wrote:
> > > Why creating two threads for the same problem? Anyway, is your servlet
> > > container capable of accepting UTF-8 in the URL? Also, is SolrNet
> capable
> > > of
> > > handling those characters? To confirm, try a tool like curl.
> > >
> > > > Dear all,
> > > >
> > > > After reading some pages on the Web, I created the index with the
> > >
> > > following
> > >
> > > > schema.
> > > >
> > > > ..
> > > >
> > > >  > > >
> > > > positionIncrementGap="100">
> > > >
> > > > 
> > > >
> > > >  > > >
> > > > class="solr.ChineseTokenizerFactory"/>
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > ..
> > > >
> > > > It must be correct, right? However, when sending a query though
> > > > SolrNet,
> > >
> > > no
> > >
> > > > results are returned. Could you tell me what the reason is?
> > > >
> > > > Thanks,
> > > > LB
>


SolrJ Tutorial

2011-01-21 Thread Bing Li
Hi, all,

In the past, I always used SolrNet to interact with Solr. It works great.
Now, I need to use SolrJ. I think it should be easier to do that than
SolrNet since Solr and SolrJ should be homogeneous. But I cannot find a
tutorial that is easy to follow. No tutorials explain the SolrJ programming
step by step. No complete samples are found. Could anybody offer me some
online resources to learn SolrJ?

I also noticed Solr Cell and SolrJ POJO. Do you have detailed resources to
them?

Thanks so much!
LB


Re: SolrJ Tutorial

2011-01-22 Thread Bing Li
I got the solution. Attach one complete sample code I made as follows.

Thanks,
LB

package com.greatfree.Solr;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.params.ModifiableSolrParams;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.client.solrj.beans.Field;

import java.net.MalformedURLException;

public class SolrJExample
{
public static void main(String[] args) throws MalformedURLException,
SolrServerException
{
SolrServer solr = new CommonsHttpSolrServer("
http://192.168.210.195:8080/solr/CategorizedHub";);

SolrQuery query = new SolrQuery();
query.setQuery("*:*");
QueryResponse rsp = solr.query(query);
SolrDocumentList docs = rsp.getResults();
System.out.println(docs.getNumFound());

try
{
SolrServer solrScore = new CommonsHttpSolrServer("
http://192.168.210.195:8080/solr/score";);
Score score = new Score();
score.id = "4";
score.type = "modern";
score.name = "iphone";
score.score = 97;
solrScore.addBean(score);
solrScore.commit();
}
catch (Exception e)
{
System.out.println(e.toString());
}

}
}


On Sat, Jan 22, 2011 at 3:58 PM, Lance Norskog  wrote:

> The unit tests are simple and show the steps.
>
> Lance
>
> On Fri, Jan 21, 2011 at 10:41 PM, Bing Li  wrote:
> > Hi, all,
> >
> > In the past, I always used SolrNet to interact with Solr. It works great.
> > Now, I need to use SolrJ. I think it should be easier to do that than
> > SolrNet since Solr and SolrJ should be homogeneous. But I cannot find a
> > tutorial that is easy to follow. No tutorials explain the SolrJ
> programming
> > step by step. No complete samples are found. Could anybody offer me some
> > online resources to learn SolrJ?
> >
> > I also noticed Solr Cell and SolrJ POJO. Do you have detailed resources
> to
> > them?
> >
> > Thanks so much!
> > LB
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


SolrDocumentList Size vs NumFound

2011-01-26 Thread Bing Li
Dear all,

I got a weird problem. The number of searched documents is much more than
10. However, the size of SolrDocumentList is 10 and the getNumFound() is the
exact count of results. When I need to iterate the results as follows, only
10 are displayed. How to get the rest ones?

..
for (SolrDocument doc : docs)
{

System.out.println(doc.getFieldValue(Fields.CATEGORIZED_HUB_TITLE_FIELD) +
": " + doc.getFieldValue(Fields.CATEGORIZED_HUB_URL_FIELD) + "; " +
doc.getFieldValue(Fields.HUB_CATEGORY_NAME_FIELD) + "/" +
doc.getFieldValue(Fields.HUB_PARENT_CATEGORY_NAME_FIELD));
}
..

Could you give me a hand?

Thanks,
LB


Open Too Many Files

2011-02-02 Thread Bing Li
Dear all,

I got an exception when querying the index within Solr. It told me that too
many files are opened. How to handle this problem?

Thanks so much!
LB

[java] org.apache.solr.client.solrj.
SolrServerException: java.net.SocketException: Too many open files
 [java] at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483)
 [java] at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
 [java] at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
 [java] at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
 [java] at com.greatfree.Solr.Broker.Search(Broker.java:145)
 [java] at
com.greatfree.Solr.SolrIndex.SelectHubPageHashByHubKey(SolrIndex.java:116)
 [java] at com.greatfree.Web.HubCrawler.Crawl(Unknown Source)
 [java] at com.greatfree.Web.Worker.run(Unknown Source)
 [java] at java.lang.Thread.run(Thread.java:662)
 [java] Caused by: java.net.SocketException: Too many open files
 [java] at java.net.Socket.createImpl(Socket.java:397)
 [java] at java.net.Socket.(Socket.java:371)
 [java] at java.net.Socket.(Socket.java:249)
 [java] at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
 [java] at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
 [java] at
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
 [java] at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
 [java] at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
 [java] at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
 [java] at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
 [java] at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
 [java] at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)
 [java] ... 8 more
 [java] Exception in thread "Thread-96" java.lang.NullPointerException
 [java] at
com.greatfree.Solr.SolrIndex.SelectHubPageHashByHubKey(SolrIndex.java:117)
 [java] at com.greatfree.Web.HubCrawler.Crawl(Unknown Source)
 [java] at com.greatfree.Web.Worker.run(Unknown Source)
 [java] at java.lang.Thread.run(Thread.java:662)


Re: Solr Out of Memory Error

2011-02-09 Thread Bing Li
Dear Adam,

I also got the OutOfMemory exception. I changed the JAVA_OPTS in catalina.sh
as follows.

   ...
   if [ -z "$LOGGING_MANAGER" ]; then
 JAVA_OPTS="$JAVA_OPTS
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager"
   else
JAVA_OPTS="$JAVA_OPTS -server -Xms8096m -Xmx8096m"
   fi
   ...

Is this change correct? After that, I still got the same exception. The
index is updated and searched frequently. I am trying to change the code to
avoid the frequent updates. I guess only changing JAVA_OPTS does not work.

Could you give me some help?

Thanks,
LB


On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada <
estrada.adam.gro...@gmail.com> wrote:

> Is anyone familiar with the environment variable, JAVA_OPTS? I set
> mine to a much larger heap size and never had any of these issues
> again.
>
> JAVA_OPTS = -server -Xms4048m -Xmx4048m
>
> Adam
>
> On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia 
> wrote:
> > Hi all,
> > By adding more servers do u mean sharding of index.And after sharding ,
> how
> > my query performance will be affected .
> > Will the query execution time increase.
> >
> > Thanks,
> > Isan Fulia.
> >
> > On 19 January 2011 12:52, Grijesh  wrote:
> >
> >>
> >> Hi Isan,
> >>
> >> It seems your index size 25GB si much more compared to you have total
> Ram
> >> size is 4GB.
> >> You have to do 2 things to avoid Out Of Memory Problem.
> >> 1-Buy more Ram ,add at least 12 GB of more ram.
> >> 2-Increase the Memory allocated to solr by setting XMX values.at least
> 12
> >> GB
> >> allocate to solr.
> >>
> >> But if your all index will fit into the Cache memory it will give you
> the
> >> better result.
> >>
> >> Also add more servers to load balance as your QPS is high.
> >> Your 7 Laks data makes 25 GB of index its looking quite high.Try to
> lower
> >> the index size
> >> What are you indexing in your 25GB of index?
> >>
> >> -
> >> Thanx:
> >> Grijesh
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2285779.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Isan Fulia.
> >
>


Detailed Steps for Scaling Solr

2011-02-11 Thread Bing Li
Dear all,

I need to construct a site which supports searching for a large index. I
think scaling Solr is required. However, I didn't get a tutorial which helps
me do that step by step. I only have two resources as references. But both
of them do not tell me the exact operations.

1)
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

2) David Smiley, Eric Pugh; Solr 1.4 Enterprise Search Server

If you have experiences to scale Solr, could you give me such tutorials?

Thanks so much!
LB


My Plan to Scale Solr

2011-02-17 Thread Bing Li
Dear all,

I started to learn how to use Solr three months ago. My experiences are
still limited.

Now I crawl Web pages with my crawler and send the data to a single Solr
server. It runs fine.

Since the potential users are large, I decide to scale Solr. After
configuring replication, a single index can be replicated to multiple
servers.

For shards, I think it is also required. I attempt to split the index
according to the data categories and priorities. After that, I will use the
above replication techniques and get high performance. The following work is
not so difficult.

I noticed some new terms, such as SolrClould, Katta and ZooKeeper. According
to my current understandings, it seems that I can ignore them. Am I right?
What benefits can I get if using them?

Thanks so much!
LB


Selection Between Solr and Relational Database

2011-03-03 Thread Bing Li
Dear all,

I have started to learn Solr for two months. At least right now, my system
runs good in a Solr cluster.

I have a question when implementing one feature in my system. When
retrieving documents by keyword, I believe Solr is faster than relational
database. However, if doing the following operations, I guess the
performance must be lower. Is it right?

What I am trying to do is listed as follows.

1) All of the documents in Solr have one field which is used to
differentiate them; different categories have different value in such a
field, e.g., Group; the documents are classified as "news", "sports",
"entertainment" and so on.

2) Retrieve all of them documents by the field, Group.

3) Besides the field of Group, another field called CreatedTime is also
existed. I will filter the documents retrieved by Group according to the
value of CreatedTime. The filtered documents are the final results I need.

I guess the operation performance is lower than relational database, right?
Could you please give me an explanation to that?

Best regards,
Li Bing


Re: SolrJ Tutorial

2011-03-03 Thread Bing Li
Dear Lance,

Could you tell me where I can find the unit tests code?

I appreciate so much for your help!

Best regards,
LB

On Sat, Jan 22, 2011 at 3:58 PM, Lance Norskog  wrote:

> The unit tests are simple and show the steps.
>
> Lance
>
> On Fri, Jan 21, 2011 at 10:41 PM, Bing Li  wrote:
> > Hi, all,
> >
> > In the past, I always used SolrNet to interact with Solr. It works great.
> > Now, I need to use SolrJ. I think it should be easier to do that than
> > SolrNet since Solr and SolrJ should be homogeneous. But I cannot find a
> > tutorial that is easy to follow. No tutorials explain the SolrJ
> programming
> > step by step. No complete samples are found. Could anybody offer me some
> > online resources to learn SolrJ?
> >
> > I also noticed Solr Cell and SolrJ POJO. Do you have detailed resources
> to
> > them?
> >
> > Thanks so much!
> > LB
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


When Index is Updated Frequently

2011-03-04 Thread Bing Li
Dear all,

According to my experiences, when the Lucene index updated frequently, its
performance must become low. Is it correct?

In my system, most data crawled from the Web is indexed and the
corresponding index will NOT be updated any more.

However, some indexes should be updated frequently like the records in
relational databases. The sizes of the indexes are not so large as the
crawled data. The updated index will NOT be scaled to many other nodes. In
most time, they are located on a very limited number of machines.

In this case, may I use Lucene indexes? Or I need to replace them with
relational databases?

Thanks so much!
LB


Re: When Index is Updated Frequently

2011-03-04 Thread Bing Li
Dear Michael,

Thanks so much for your answer!

I have a question. If Lucene is good at updating, it must more loads on the
Solr cluster. So in my system, I will leave the large amount of crawled data
unchanged for ever. Meanwhile, I use a traditional database to keep mutable
data.

Fortunately, in most Internet systems, the amount of mutable data is much
less than that of immutable one.

How do you think about my solution?

Best,
LB

On Sat, Mar 5, 2011 at 2:45 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Fri, Mar 4, 2011 at 10:09 AM, Bing Li  wrote:
>
> > According to my experiences, when the Lucene index updated frequently,
> its
> > performance must become low. Is it correct?
>
> In fact Lucene can gracefully handle a high rate of updates with low
> latency turnaround on the readers, using the near-real-time (NRT) API
> -- IndexWriter.getReader() (or in soon-to-be 31,
> IndexReader.open(IndexWriter)).
>
> NRT is really something a hybrid of "eventual consistency" and
> "immediate consistency", because it lets your app have full control
> over how quickly changes must be visible by controlling when you
> pull a new NRT reader.
>
> That said, Lucene can't offer true immediate consistency at a high
> update rate -- the time to open a new NRT reader is usually too costly
> to do, eg, for every search.  But eg every 100 msec (say) is
> reasonable (depending on many variables...).
>
> So... for your app you should run some tests and see.  And please
> report back.
>
> (But, unfortunately, NRT hasn't been exposed in Solr yet...).
>
> --
> Mike
>
> http://blog.mikemccandless.com
>