from:"bing"

Multilingual search in multicore solr

2012-01-29 Thread bing

Hi, all, 

I am going to multilingual search in multicore solr. Specifically, the
design of the solr server is like: I have several cores corresponding to
different languages, where each core has its configuration files and data.  

I have following questions: 

1. While indexing a document, I use ExtractingRequestHandler in Tika0.10
(embed in Solr3.5.0) and I can get a field "language_s" after indexing. Is
it possible to get the info of the "language_s" before indexing happens, so
that I can put the document in the corresponding core? 

2. In searching with a query, is it possible that I can use language
detection function to determine the language code of the query, so that I
direct the query to the corresponding core? 

Thanks for your suggestions. 

Note:  In this thread I would like to stick on multicore solr and want to
see whether the problems can be solved. Meanwhile, I am aware that
multilingual search does not necessarily need multicore solr, which I have
learned in previous thread. 
http://lucene.472066.n3.nabble.com/Tika0-10-language-identifier-in-Solr3-5-0-tt3671712.html#none

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multilingual-search-in-multicore-solr-tp3698969p3698969.html
Sent from the Solr - User mailing list archive at Nabble.com.

language specific fields of "text"

2012-01-29 Thread bing

Hi, all, 

In this thread, I would like to ask some technical questions about how the
schema is defined to achieve  language specific fields "text". 

Say, currently I have the filed "text" defined as follows:
text*" type="text_general" indexed="true"
stored="true" multiValued="true"/> 
After indexing a document, I can see a field in the document extracted
correctly. 

My first attempt is to add a filed named "text_en", defined exactly the same
way as "text":
text_en*" type="text_general" indexed="true"
stored="true" multiValued="true"/> 
However, after indexing the same document, why cannot I see the filed
extracted? Is it because "text" is a reserved field that cannot be changed
dynamically? 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/language-specific-fields-of-text-tp3698985p3698985.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multilingual search in multicore solr

2012-01-30 Thread bing

Hi, Erick Erickson, 

Your suggestions are sound. 

For (1), if I use SolrJ as the client to access Solr, then java coding
becomes the most challenging part. Technically, I want to achieve the same
effect with highlighting, faceting search, language detection, etc. Do you
know some example SC that I can refer to? 

For (2), I agree with you on the difficulty in detecting language from just
a few words. Thus, alternatively I can suggest a set of results and let
users to decide. 
You also mentioned score. Say, I have not so many cores, and so for every
query I direct it to all the cores, returned with a set of scores.  Is it
confident to conclude that the highest score gives the most confidence of
the results? 

Thanks.

Best Regards, 
Ni Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multilingual-search-in-multicore-solr-tp3698969p3702041.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: language specific fields of "text"

2012-01-30 Thread bing

Hi, Paul, 

I understand your point of missing "text_en" in the document. It is. Not
"text_en" but "text" exists.
But then it arises the question: isn't it dynamic to add language specific
suffixes to an existing filed "text"?

I am new here. As far as I know, for some field "title", people can create
"title_en" "title_fr" to incorporate different analyzers in the same schema.
Even this, I am not seeing it happens. Thus, I am thinking whether it is
possible I neglect some obvious point? 

"Bing" is very common in the names of Chinese, as there are several Chinese
characters corresponding to the same pronunciation. 

Thanks for reply.

Best Regards, 
Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/language-specific-fields-of-text-tp3698985p3702053.html
Sent from the Solr - User mailing list archive at Nabble.com.

Source code of post in example package of Solr

2012-01-30 Thread bing

Hi, all, 

I am using the following jar to index files in xml format, and I want to
look into the source code. Where can I find it? Thanks.

\apache-solr-3.5.0\example\exampledocs>java -jar post.jar *.xml

Best Regards, 
Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Source-code-of-post-in-example-package-of-Solr-tp3702100p3702100.html
Sent from the Solr - User mailing list archive at Nabble.com.

Indexing content in XML files

2012-01-31 Thread bing

Hi, all, 

I am investigating the indexing in XML files. Currently, I have two
findings:

1. Use DataImportHanlder. This requires to create one more configuration
file for DIH, data-config.xml, which defines the fields specifically for my
XML files. 

2. Use the example package coming with Solr. This only requires to define
the fields in the schema, and no additional configuration file needed. 
\apache-solr-3.5.0\example\exampledocs>java -jar post.jar *.xml

I don't know whether I understand the two methods correctly, but it seems to
me that they are absolutely different. If I want to index XML files with
many self-defined fields, probably with embedded fields, which one makes
more sense? 

Thanks. 

Best
Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-content-in-XML-files-tp3702795p3702795.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Source code of post in example package of Solr

2012-01-31 Thread bing

Hi, iorixxx, Thanks. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Closed-Source-code-of-post-in-example-package-of-Solr-tp3702100p3705333.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multilingual search in multicore solr

2012-01-31 Thread bing

Hi, Erick, 

Thanks for your comment. Though I have some experience in Solr,  I am
completely a newbie in SolrJ, and haven't tried using SolrJ to access Solr.
For now, I have a src package of solr3.5.0, and a SolrJ sc downloaded from
web that I want to incorporate into Solr and have a try. How would I do to
build and run it? Where should I put the sc in the package? Is IDE a must to
do that? 

I cannot find many start-up tutorials about that, thus would be grateful if
any suggestions and hints brought about. 

Best 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multilingual-search-in-multicore-solr-tp3698969p3705556.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing content in XML files

2012-01-31 Thread bing

Hi, all, 

Thanks for the comment. Then I will abandon post.jar, and try to learn SolrJ
instead. 

Best
Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-content-in-XML-files-tp3702795p3705563.html
Sent from the Solr - User mailing list archive at Nabble.com.

Closed -- Re: Multilingual search in multicore solr

2012-02-01 Thread bing

Hi, Erick, 

Thanks for commenting on this thread, and I think my problem has been
solved. I might start another thread raising technical questions about using 
SolrJ. 

Thank you again. 

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multilingual-search-in-multicore-solr-tp3698969p3708757.html
Sent from the Solr - User mailing list archive at Nabble.com.

Fail to compile Java code (trying to use SolrJ with Solr)

2012-02-01 Thread bing

Hi, all, 

I am trying to coding Java so that use SolrJ to access Solr, but failed in
the first attempt. I have some experience in Solr, but I am a newbie of
SolrJ. The following are the description of what I set, what I did, and what
I got. I will be grateful if anyone can bring out some suggestions and point
out my mistakes. 

What I have:
Following are the necessary tools installed
1. Java 1.6.0_26
2. apache-tomcat-6.0.32
3. apache-solr-3.5.0
4. apache-solr-3.5.0-src 
5. apache-maven-2.2.1
6. apache-ant-1.8.2

What I Set:
1. Classpath c:\apache-solr-3.5.0\apache-solr-3.5.0\dist 
Following are the jars might be used and also consisted in the directory
indicated in the classpath: 
apache-solr-solrj-3.5.0.jar
solrj-lib/commons-httpclient-3.1.jar
solrj-lib/commons-codec-1.5.jar
2. Pom.xml in C:\apache-solr-3.5.0-src\apache-solr-3.5.0\
Adding the following dependency:
  
   org.apache.solr
   solr-solrj
   3.5.0
  


What I Did:
1. Try to compile a MySolrJTest.java. 
Following is the sc, simple enough.

import org.apache.solr.client.solrj.SolrServer;

class MySolrjTest
{
public void query(String q)
{
CommonsHttpSolrServer server = null;

try
{
server = new
CommonsHttpSolrServer("http://localhost:8983/solr/";);
}
catch(Exception e)
{
e.printStackTrace();
}
}

public static void main(String[] args)
{
MySolrjTest solrj = new MySolrjTest();
solrj.query(args[0]);
}
}


What I Get:
After I compile the code using the following command, errors arouse:

C:\apache-solr-3.5.0-src>javac MySolrjTest.java
MySolrjTest.java:1: package org.apache.solr.client.solrj does not exist
import org.apache.solr.client.solrj.SolrServer;
   ^
MySolrjTest.java:7: cannot find symbol
symbol  : class CommonsHttpSolrServer
location: class MySolrjTest
CommonsHttpSolrServer server = null;
^
MySolrjTest.java:11: cannot find symbol
symbol  : class CommonsHttpSolrServer
location: class MySolrjTest
server = new
CommonsHttpSolrServer("http://localhost:8983/solr/";);
 ^
3 errors

Best 
Bing 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fail-to-compile-Java-code-trying-to-use-SolrJ-with-Solr-tp3708902p3708902.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to compile Java code (trying to use SolrJ with Solr)

2012-02-01 Thread bing

Hi, all, 

Following the previous topic, if I abandon my own code and try to build a
project with the original package apache-solr-3.5.0-src, I failed again.
Following are the description of some technical details, and I hope someone
can help to point out my mistakes. 

What I Have
Besides the tools mentioned above, I install the following tool:
NetBeans 7 IDE 

What I Set 
2. Pom.xml in C:\apache-solr-3.5.0-src\apache-solr-3.5.0\
Adding the following dependency:
  
   org.apache.solr
   solr-solrj
   3.5.0
  


What I Did:
2. Open the project by loading the original package apache-solr-3.5.0-src,
and try to build it in NetBeans.

What I Get:
2. Following are part of the messages shown in the output :


BUILD FAILURE

Total time: 5:39.460s
Finished at: Thu Feb 02 11:00:45 CST 2012
Final Memory: 28M/129M

Failed to execute goal
org.apache.maven.plugins:maven-surefire-plugin:2.10:test (default-test) on
project lucene-core: There are test failures.

Please refer to
C:\apache-solr-3.5.0-src\apache-solr-3.5.0\lucene\build\surefire-reports for
the individual test results.
-> [Help 1]

To see the full stack trace of the errors, re-run Maven with the -e switch.
Re-run Maven using the -X switch to enable full debug logging.

For more information about the errors and possible solutions, please read
the following articles:
[Help 1]
http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException

After correcting the problems, you can resume the build with the command
  mvn  -rf :lucene-core
'cmd' is not recognized as an internal or external command,
operable program or batch file.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fail-to-compile-Java-code-trying-to-use-SolrJ-with-Solr-tp3708902p3708923.html
Sent from the Solr - User mailing list archive at Nabble.com.

Development inside or outside of Solr?

2012-02-19 Thread bing

Hi, all, 

I am deploying a multicore solr server runing on Tomcat, where I want to
achieve language detection during index/query. 

Solr3.5.0 has a wrapped Tika API that can do language detection. Currently,
the default behavior of Solr3.5.0 is, every time I index a document, and at
mean time Solr call Tika API to give the result of language detection, i.e.
index and detection happens at the same time. However, I hope I can have the
language detection result first, and then I decide which core to put the
document, i.e. detection happens before index. 

There seems that I need to do development in either of the following ways:

1. I might need to do revision of Solr itself, change the default behavior
of Solr; 
2. Or I might write a Java client outside Solr, call the client through
server (JSP maybe) in index/query. 

Can anyone meeting with similar conditions give some suggestions about the
advantages and disad of the two approaches? Any other alternatives? Thank
you. 


Best 
Bing  



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3759680.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Development inside or outside of Solr?

2012-02-20 Thread bing

I have looked into the TikaCLI with -language option, and learned that Tika
can output only the language metadata. It cannot help me to solve my problem
though, as my main concern is whether to change Solr or not.  Thank you all
the same. 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3760131.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Development inside or outside of Solr?

2012-02-22 Thread bing

Hi, François Schiettecatte

Thank you for the reply all the same, but  I choose to stick on Solr
(wrapped with Tika language API) and do changes outside Solr. 

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3768903.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Development inside or outside of Solr?

2012-02-22 Thread bing

Hi, Erick, 

The example is impressive. Thank you. 

For the first, we decide not to do that, as Tika extraction is
time-consuming part in indexing large files, and the dual call make the
situation worse. 

For the second, for now, we choose Dspace to connect to DB, and
discovery(solr) as the index/query. Thus, we might do revisions in dspace. 

Best Regards, 
Bing 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Development-inside-or-outside-of-Solr-tp3759680p3768977.html
Sent from the Solr - User mailing list archive at Nabble.com.

TikaLanguageIdentifierUpdateProcessorFactory(since Solr3.5.0) to be used in Solr3.3.0?

2012-02-23 Thread bing

Hi, all, 

I am using
org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory
(since Solr3.5.0) to do language detection, and it's cool.
 
An issue: if I deploy Solr3.3.0, is it possible to import that factory in
Solr3.5.0 to be used in Solr3.3.0? 

Why I stick on Solr3.3.0 is because I am working on Dspace (discovery) to
call solr, and for now the highest version that Solr can be upgraded to is
3.3.0.

I would hope to do this while keep Dspace + Solr at the most. Say, import
that factory into Solr3.3.0, is it possible? Does any one happen to know
certain way to solve this?

Best Regards, 
Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaLanguageIdentifierUpdateProcessorFactory-since-Solr3-5-0-to-be-used-in-Solr3-3-0-tp3771620p3771620.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to increase Size of Document in solr

2012-02-23 Thread bing

Hi, Suneel, 

There is a configuration in solrconfig.xml that you might need to look at.
Following I set the limit as 2GB. 
 


Best Regards, 
Bing 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-increase-Size-of-Document-in-solr-tp3771813p3771931.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Fail to compile Java code (trying to use SolrJ with Solr)

2012-02-24 Thread bing

Hi, Dmitry 

Thank you. It solved my problem. 

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Fail-to-compile-Java-code-trying-to-use-SolrJ-with-Solr-tp3708902p3772017.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: upgrading to Tika 0.9 on Solr 1.4.1

2012-02-24 Thread bing

Hi, all, 

I tried to upgrade tika0.8 to tika0.10 on solr3.3.0, following the similar
steps, but failed. 

1. Replace the following jars in /contrib/extraction/ 
fontbox-1.6.0, jempbox-1.6.0, pdfbox-1.6.0, tika-core-0.10,
tika-parsers-0.10;

2. Copy all the jars in /contrib/langid/* from solr3.5.0 

3. Copy /dist/apache-solr-langid-3.5.0 from solr3.5.0

4. Configure solrconfig.xml in solr3.3.0, adding the following lib and
definition of updateRequestProcessorChain.

  
  

  
  
   
 
 text,title,author
 language_s
 en
   
   
   
 


Errors:  (typical errors when factory is not found)

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
at 

Anyone tried similar things before. Pls advice. Thank you. 

Best Regards, 
Bing 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-tp2570526p3772177.html
Sent from the Solr - User mailing list archive at Nabble.com.

Failed to upgrade tika0.8 to tika0.10 in solr3.3.0

2012-02-24 Thread bing

Hi, all, 

I tried to upgrade tika0.8 to tika0.10 on solr3.3.0,  but failed. Following
are some technical details. 
Anyone tried similar things before? Pls advice. Thank you. 


1. Replace the following jars in /contrib/extraction/ 
fontbox-1.6.0, jempbox-1.6.0, pdfbox-1.6.0, tika-core-0.10,
tika-parsers-0.10;

2. Copy all the jars in /contrib/langid/* from solr3.5.0 

3. Copy /dist/apache-solr-langid-3.5.0 from solr3.5.0

4. Configure solrconfig.xml in solr3.3.0, adding the following lib and
definition of updateRequestProcessorChain.

  
  

  
  
   
 
 text,title,author
 language_s
 en
   
   
   
 


Errors:  (typical errors when factory is not found)

org.apache.solr.common.SolrException: Error loading class
'org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory'
at
org.apache.solr.core.SolrResourceLoader.findClass(SolrResourceLoader.java:389)
at 



Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Failed-to-upgrade-tika0-8-to-tika0-10-in-solr3-3-0-tp3772180p3772180.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TikaLanguageIdentifierUpdateProcessorFactory(since Solr3.5.0) to be used in Solr3.3.0?

2012-02-26 Thread bing

Hi, Erick, 

My idea is to use Tika0.10 in Dspace1.7.2, which is based on two steps:

1. Upgrade Solr1.4.1 to Solr3.3.0 in Dspace1.7.2 
In the following link, upgraded Solr & Lucene 3.3.0 has been resolved. 
https://jira.duraspace.org/browse/DS-980

2. Upgrade to Tika0.10 in Solr3.3.0 
In the following link, people has tried to upgrade Tika0.8 to Tika0.9.  
http://lucene.472066.n3.nabble.com/upgrading-to-Tika-0-9-on-Solr-1-4-1-td2570526.html

I was thinking, if both the above two steps can be achieved, then maybe I
can get it done. What is your suggestion? 

Thank you. 

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaLanguageIdentifierUpdateProcessorFactory-since-Solr3-5-0-to-be-used-in-Solr3-3-0-tp3771620p3779437.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to define a multivalued string type "langid.langsField" in solrconfig.xml

2012-02-26 Thread bing

Hi, all, 

I am using tika language detection. It is said that, if "langid.langsField"
is set as multivalued string, and then a list of languages can be stored for
the fields specified in "langid.fl". 

Following is how I configure the processor in soleconfig.xml. I tried using
"text" only, and the detected result is language_s="zh_tw"; for
"attr_stream_name", the result is language_s="en". I was expecting, when
adding both "text" and  "attr_stream_name", the result would look like
language_s="en,zh_tw". However, I failed to see the result. 


 
   
   
text,attr_stream_name
 language_s
 true  
 
   
 


I will be grateful if anyone can point my mistake or give some hints how to
do the correct things. Thank you. 

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-define-a-multivalued-string-type-langid-langsField-in-solrconfig-xml-tp3779602p3779602.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TikaLanguageIdentifierUpdateProcessorFactory(since Solr3.5.0) to be used in Solr3.3.0?

2012-02-27 Thread bing

HI, Erick, 

I can write SolrJ client to call Tika, but I am not certain where to invoke
the client. In my case, I work on Dspace to call Solr, and I suppose the
client should be invoked in-between Dspace and Solr. That is, Dspace invokes
SolrJ client when doing index/query,  which call Tika and Solr. Do you think
it is reasonable? 

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaLanguageIdentifierUpdateProcessorFactory-since-Solr3-5-0-to-be-used-in-Solr3-3-0-tp3771620p3782793.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: TikaLanguageIdentifierUpdateProcessorFactory(since Solr3.5.0) to be used in Solr3.3.0?

2012-02-27 Thread bing

Hi, Erick, 

I get your point. Thank you so much. 

Best Regards, 
Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/TikaLanguageIdentifierUpdateProcessorFactory-since-Solr3-5-0-to-be-used-in-Solr3-3-0-tp3771620p3782938.html
Sent from the Solr - User mailing list archive at Nabble.com.

Can solr-langid(Solr3.5.0) detect multiple languages in one text?

2012-03-12 Thread bing

Hi, all, 

I am using solr-langid(Solr3.5.0) to do language detection, and I hope
multiple languages in one text can be detected. 

The example text is: 
咖哩起源於印度。印度民間傳說咖哩是佛祖釋迦牟尼所創，由於咖哩的辛辣與香味可以幫助遮掩羊肉的腥騷，此舉即為用以幫助不吃豬肉與牛肉的印度人。在泰米爾語中，「kari」是「醬」的意思。在馬來西亞，kari也稱dal（當在mamak檔）。早期印度被蒙古人所建立的莫臥兒帝國（Mughal
Empire）所統治過，其間從波斯（現今的伊朗）帶來的飲食習慣，從而影響印度人的烹調風格直到現今。
Curry (plural, Curries) is a generic term primarily employed in Western
culture to denote a wide variety of dishes originating in Indian, Pakistani,
Bangladeshi, Sri Lankan, Thai or other Southeast Asian cuisines. Their
common feature is the incorporation of more or less complex combinations of
spices and herbs, usually (but not invariably) including fresh or dried hot
capsicum peppers, commonly called "chili" or "cayenne" peppers.

I want the text can be separated into two parts, and the part in Chinese
goes to "text_zh-tw" while the other one "text_en". Can I do something like
that? 

Thank you. 

Best Regards, 
Bing 


--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3821210.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

2012-03-13 Thread bing

Hi, Jan Høydahl, 

Forgot to mention, the identifier I use is an existing one wrapped in
Solr3.5.0.,  LangDetectLanguageIdentifier
(http://wiki.apache.org/solr/LanguageDetection). 

For the language identifier, I looked into the sc, and found that the whole
content of a text is parsed before detection, which is why the end result
consists of a specific language instead of multiple languages. Then I can
assume, if the content is processed section by section (or even line by
line), the end result shall consist of multiple languages. So the question
is, can you guys plug this modification of the existing identifier into
Solr? 


Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3821764.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Can solr-langid(Solr3.5.0) detect multiple languages in one text?

2012-03-13 Thread bing

Hi, Tanguy, 



>For the other implementation (
>http://code.google.com/p/language-detection/ ), it seems to be
>performing a first pass on the input, and tries to separate Latin
>characters from the others. If there's more non-Latin characters than
>Latin ones, then it will process the non-Latin characters only for
>language detection.
>Oddly, in the other way non-Latin characters are not stripped from the
>input if there's more Latin characters than non-Latin ones...

The example case does simplify, but it simulates the normal conditions I
need to handle, i.e. normally the task is to detect  non-Lantin  languages,
and mostly separate western and eastern languages. 

>Anyway, LangDetect's implementation ends up with a list of
>probabilities, and only the most accurate one is kept by solr's
>langdetect processor, if the probability satisfies a certain threshold. 

Yes, I agree with you on "a list of probabilities", and I think if those
probabilities are all returned, then my problem has been solved partially. 

>In this very particular case, something simple, based on unicode ranges
>could be used to provide hints on how to chunk the input. Because we
>need to split western and eastern languages, both written in well
>isolated unicode character-ranges.
>Using this, the language identifier could be fed with chunks that are
>mostly made of one language only (presumably), and we could have
>different language identifications for each distinct chunks. 

Intelligent chunk partition might be a different and comprehensive task. Is
it possible that the text is processed line by line (or several lines)? If
detected language changes in-between two continuous lines (or several
lines), it indicates a different language range.


Thank you for the thoughtful comments.  

Best Regards, 
Bing 

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-solr-langid-Solr3-5-0-detect-multiple-languages-in-one-text-tp3821210p3824365.html
Sent from the Solr - User mailing list archive at Nabble.com.

Can I use Field Aliasing/Renaming on Solr3.3?

2012-04-16 Thread bing

Hi, all, 

I am working on Solr3.3. Recently I found out  a new feature (Field
Aliasing/Renaming) in Solr3.6, and I want to use it in Solr3.3. Can  I do
that, and how? 

Thank you. 

Best Regards, 
Bing

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-I-use-Field-Aliasing-Renaming-on-Solr3-3-tp3916103p3916103.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr/Lucene Faceted Search Too Many Unique Values?

2014-01-22 Thread Bing Hua

Hi,

I am going to evaluate some Lucene/Solr capabilities on handling faceted
queries, in particular, with a single facet field that contains large number
(say up to 1 million) of distinct values. Does anyone have some experience
on how lucene performs in this scenario?

e.g. 
Doc1 has tags A B C D 
Doc2 has tags B C D E 
etc etc millions of docs and there can be millions of distinct tag values.

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-Lucene-Faceted-Search-Too-Many-Unique-Values-tp4112860.html
Sent from the Solr - User mailing list archive at Nabble.com.

Sort for Retrieved Data

2012-01-20 Thread Bing Li

Dear all,

I have a question when sorting retrieved data from Solr. As I know, Lucene
retrieves data according to the degree of keyword matching on text field
(partial matching).

If I search data by string field (complete matching), how does Lucene sort
the retrieved data?

If I add some filters, such as time, what about the sorting way?

If I just need to top ones, is it proper to just add rows?

If I want to add new sorting ways, how to do that?

Thanks so much!
Bing

How to Sort By a PageRank-Like Complicated Strategy?

2012-01-21 Thread Bing Li

Dear all,

I am using SolrJ to implement a system that needs to provide users with
searching services. I have some questions about Solr searching as follows.

As I know, Lucene retrieves data according to the degree of keyword
matching on text field (partial matching).

But, if I search data by string field (complete matching), how does Lucene
sort the retrieved data?

If I want to add new sorting ways, Solr's function query seems to support
this feature.

However, for a complicated ranking strategy, such PageRank, can Solr
provide an interface for me to do that?

My ranking ways are more complicated than PageRank. Now I have to load all
of matched data from Solr first by keyword and rank them again in my ways
before showing to users. It is correct?

Thanks so much!
Bing

Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-21 Thread Bing Li

Hi, Kai,

Thanks so much for your reply!

If the retrieving is done on a string field, not a text field, a complete
matching approach should be used according to my understanding, right? If
so, how does Lucene rank the retrieved data?

Best regards,
Bing

On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:

> Solr is kind of retrieval step, you can customize the score formula in
> Lucene. But it supposes not to be too complicated, like it's better can be
> factorization. It also regards to the stored information, like
> TF,DF,position, etc. You can do 2nd phase rerank to the top N data you have
> got.
>
> Sent from my iPad
>
> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
>
> > Dear all,
> >
> > I am using SolrJ to implement a system that needs to provide users with
> > searching services. I have some questions about Solr searching as
> follows.
> >
> > As I know, Lucene retrieves data according to the degree of keyword
> > matching on text field (partial matching).
> >
> > But, if I search data by string field (complete matching), how does
> Lucene
> > sort the retrieved data?
> >
> > If I want to add new sorting ways, Solr's function query seems to support
> > this feature.
> >
> > However, for a complicated ranking strategy, such PageRank, can Solr
> > provide an interface for me to do that?
> >
> > My ranking ways are more complicated than PageRank. Now I have to load
> all
> > of matched data from Solr first by keyword and rank them again in my ways
> > before showing to users. It is correct?
> >
> > Thanks so much!
> > Bing
>

Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-22 Thread Bing Li

Dear Shashi,

Thanks so much for your reply!

However, I think the value of PageRank is not a static one. It must update
on the fly. As I know, Lucene index is not suitable to be updated too
frequently. If so, how to deal with that?

Best regards,
Bing


On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant  wrote:

> Lucene has a mechanism to "boost" up/down documents using your custom
> ranking algorithm. So if you come up with something like Pagerank
> you might do something like doc.SetBoost(myboost), before writing to index.
>
>
>
> On Sat, Jan 21, 2012 at 5:07 PM, Bing Li  wrote:
> > Hi, Kai,
> >
> > Thanks so much for your reply!
> >
> > If the retrieving is done on a string field, not a text field, a complete
> > matching approach should be used according to my understanding, right? If
> > so, how does Lucene rank the retrieved data?
> >
> > Best regards,
> > Bing
> >
> > On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:
> >
> >> Solr is kind of retrieval step, you can customize the score formula in
> >> Lucene. But it supposes not to be too complicated, like it's better can
> be
> >> factorization. It also regards to the stored information, like
> >> TF,DF,position, etc. You can do 2nd phase rerank to the top N data you
> have
> >> got.
> >>
> >> Sent from my iPad
> >>
> >> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
> >>
> >> > Dear all,
> >> >
> >> > I am using SolrJ to implement a system that needs to provide users
> with
> >> > searching services. I have some questions about Solr searching as
> >> follows.
> >> >
> >> > As I know, Lucene retrieves data according to the degree of keyword
> >> > matching on text field (partial matching).
> >> >
> >> > But, if I search data by string field (complete matching), how does
> >> Lucene
> >> > sort the retrieved data?
> >> >
> >> > If I want to add new sorting ways, Solr's function query seems to
> support
> >> > this feature.
> >> >
> >> > However, for a complicated ranking strategy, such PageRank, can Solr
> >> > provide an interface for me to do that?
> >> >
> >> > My ranking ways are more complicated than PageRank. Now I have to load
> >> all
> >> > of matched data from Solr first by keyword and rank them again in my
> ways
> >> > before showing to users. It is correct?
> >> >
> >> > Thanks so much!
> >> > Bing
> >>
>

SolrCell takes InputStream

2012-12-04 Thread Bing Hua

Hi,

While using ContentStreamUpdateRequest up = new
ContentStreamUpdateRequest("/update/extract");

The two ways of adding a file are
up.addFile(File)
up.addContentStream(ContentStream)

However my raw files are stored on some remote storage devices. I am able to
get an InputStream object for the file to be indexed. To me it may seem
awkward to have the file temporarily stored locally. Is there a way of
directly passing the InputStream in (e.g. constructing ContentStream using
the InputStream)?

Thanks.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCell-takes-InputStream-tp4024315.html
Sent from the Solr - User mailing list archive at Nabble.com.

Search match all tokens in Query Text

2013-01-31 Thread Bing Hua

Hello,

I have a field text with type text_general here.















When I query for text:a b, solr returns results that contain only a but not
b. That is, it uses OR operator between the two tokens.

Am I right here? What should I do to force an AND operator between the two
tokens?

Thanks



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-match-all-tokens-in-Query-Text-tp4037758.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Search match all tokens in Query Text

2013-01-31 Thread Bing Hua

Thanks for the quick reply. Seems like you are suggesting to add explicitly
AND operator. I don't think this solves my problem.

I found it  somewhere, and this
works.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Search-match-all-tokens-in-Query-Text-tp4037758p4037762.html
Sent from the Solr - User mailing list archive at Nabble.com.

How to define a lowercase fieldtype without tokenizer

2013-02-14 Thread Bing Hua

Hi,

I don't want the field to be tokenized because Solr doesn't support sorting
on a tokenized field. In order to do case insensitive sorting I need to copy
a field to a lowercase but not tokenized field. How to define this?

I did below but it says I need to specify a tokenizer or a class for
analyzer. 










Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-define-a-lowercase-fieldtype-without-tokenizer-tp4040500.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to define a lowercase fieldtype without tokenizer

2013-02-14 Thread Bing Hua

Works perfectly. Thank you. I didn't know this tokenizer does nothing before
:)



--
View this message in context: 
http://lucene.472066.n3.nabble.com/How-to-define-a-lowercase-fieldtype-without-tokenizer-tp4040500p4040507.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: How to Sort By a PageRank-Like Complicated Strategy?

2012-01-28 Thread Bing Li

Dear Shashi,

As I learned, big data, such as Lucene index, was not suitable to be
updated frequently. Frequent updating must affect the performance and
consistency when Lucene index must be replicated in a large scale cluster.
It is expected such a search engine must work in a write-once & read-many
environment, right? That's what HDFS (Hadoop Distributed File System)
provides. According to my experience, it is really slow when updating a
Lucene Index.

Why did you say I could update Lucene index frequently?

Thanks so much!
Bing

On Mon, Jan 23, 2012 at 11:02 PM, Shashi Kant  wrote:

> You can update the document in the index quite frequently. IDNK what
> your requirement is, another option would be to boost query time.
>
> On Sun, Jan 22, 2012 at 5:51 AM, Bing Li  wrote:
> > Dear Shashi,
> >
> > Thanks so much for your reply!
> >
> > However, I think the value of PageRank is not a static one. It must
> update
> > on the fly. As I know, Lucene index is not suitable to be updated too
> > frequently. If so, how to deal with that?
> >
> > Best regards,
> > Bing
> >
> >
> > On Sun, Jan 22, 2012 at 12:43 PM, Shashi Kant 
> wrote:
> >>
> >> Lucene has a mechanism to "boost" up/down documents using your custom
> >> ranking algorithm. So if you come up with something like Pagerank
> >> you might do something like doc.SetBoost(myboost), before writing to
> >> index.
> >>
> >>
> >>
> >> On Sat, Jan 21, 2012 at 5:07 PM, Bing Li  wrote:
> >> > Hi, Kai,
> >> >
> >> > Thanks so much for your reply!
> >> >
> >> > If the retrieving is done on a string field, not a text field, a
> >> > complete
> >> > matching approach should be used according to my understanding, right?
> >> > If
> >> > so, how does Lucene rank the retrieved data?
> >> >
> >> > Best regards,
> >> > Bing
> >> >
> >> > On Sun, Jan 22, 2012 at 5:56 AM, Kai Lu  wrote:
> >> >
> >> >> Solr is kind of retrieval step, you can customize the score formula
> in
> >> >> Lucene. But it supposes not to be too complicated, like it's better
> can
> >> >> be
> >> >> factorization. It also regards to the stored information, like
> >> >> TF,DF,position, etc. You can do 2nd phase rerank to the top N data
> you
> >> >> have
> >> >> got.
> >> >>
> >> >> Sent from my iPad
> >> >>
> >> >> On Jan 21, 2012, at 1:33 PM, Bing Li  wrote:
> >> >>
> >> >> > Dear all,
> >> >> >
> >> >> > I am using SolrJ to implement a system that needs to provide users
> >> >> > with
> >> >> > searching services. I have some questions about Solr searching as
> >> >> follows.
> >> >> >
> >> >> > As I know, Lucene retrieves data according to the degree of keyword
> >> >> > matching on text field (partial matching).
> >> >> >
> >> >> > But, if I search data by string field (complete matching), how does
> >> >> Lucene
> >> >> > sort the retrieved data?
> >> >> >
> >> >> > If I want to add new sorting ways, Solr's function query seems to
> >> >> > support
> >> >> > this feature.
> >> >> >
> >> >> > However, for a complicated ranking strategy, such PageRank, can
> Solr
> >> >> > provide an interface for me to do that?
> >> >> >
> >> >> > My ranking ways are more complicated than PageRank. Now I have to
> >> >> > load
> >> >> all
> >> >> > of matched data from Solr first by keyword and rank them again in
> my
> >> >> > ways
> >> >> > before showing to users. It is correct?
> >> >> >
> >> >> > Thanks so much!
> >> >> > Bing
> >> >>
> >
> >
>

How is Data Indexed in HBase?

2012-02-22 Thread Bing Li

Dear all,

I wonder how data in HBase is indexed? Now Solr is used in my system
because data is managed in inverted index. Such an index is suitable to
retrieve unstructured and huge amount of data. How does HBase deal with the
issue? May I replaced Solr with HBase?

Thanks so much!

Best regards,
Bing

Re: Solr & HBase - Re: How is Data Indexed in HBase?

2012-02-22 Thread Bing Li

Mr Gupta,

Thanks so much for your reply!

In my use cases, retrieving data by keyword is one of them. I think Solr is
a proper choice.

However, Solr does not provide a complex enough support to rank. And,
frequent updating is also not suitable in Solr. So it is difficult to
retrieve data randomly based on the values other than keyword frequency in
text. In this case, I attempt to use HBase.

But I don't know how HBase support high performance when it needs to keep
consistency in a large scale distributed system.

Now both of them are used in my system.

I will check out ElasticSearch.

Best regards,
Bing


On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta wrote:

> Bing,
> Its a classic battle on whether to use solr or hbase or a combination of
> both. both systems are very different but there is some overlap in the
> utility. they also differ vastly when it compares to computation power,
> storage needs, etc. so in the end, it all boils down to your use case. you
> need to pick the technology that it best suited to your needs.
> im still not clear on your use case though.
>
> btw, if you haven't started using solr yet - then you might want to
> checkout ElasticSearch. I spent over a week researching between solr and ES
> and eventually chose ES due to its cool merits.
>
> thanks
>
>
> On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu  wrote:
>
>> There is no secondary index support in HBase at the moment.
>>
>> It's on our road map.
>>
>> FYI
>>
>> On Wed, Feb 22, 2012 at 9:28 AM, Bing Li  wrote:
>>
>> > Jacques,
>> >
>> > Yes. But I still have questions about that.
>> >
>> > In my system, when users search with a keyword arbitrarily, the query is
>> > forwarded to Solr. No any updating operations but appending new indexes
>> > exist in Solr managed data.
>> >
>> > When I need to retrieve data based on ranking values, HBase is used.
>> And,
>> > the ranking values need to be updated all the time.
>> >
>> > Is that correct?
>> >
>> > My question is that the performance must be low if keeping consistency
>> in a
>> > large scale distributed environment. How does HBase handle this issue?
>> >
>> > Thanks so much!
>> >
>> > Bing
>> >
>> >
>> > On Thu, Feb 23, 2012 at 1:17 AM, Jacques  wrote:
>> >
>> > > It is highly unlikely that you could replace Solr with HBase.  They're
>> > > really apples and oranges.
>> > >
>> > >
>> > > On Wed, Feb 22, 2012 at 1:09 AM, Bing Li  wrote:
>> > >
>> > >> Dear all,
>> > >>
>> > >> I wonder how data in HBase is indexed? Now Solr is used in my system
>> > >> because data is managed in inverted index. Such an index is suitable
>> to
>> > >> retrieve unstructured and huge amount of data. How does HBase deal
>> with
>> > >> the
>> > >> issue? May I replaced Solr with HBase?
>> > >>
>> > >> Thanks so much!
>> > >>
>> > >> Best regards,
>> > >> Bing
>> > >>
>> > >
>> > >
>> >
>>
>
>

Re: Solr & HBase - Re: How is Data Indexed in HBase?

2012-02-23 Thread Bing Li

Dear Mr Gupta,

Your understanding about my solution is correct. Now both HBase and Solr
are used in my system. I hope it could work.

Thanks so much for your reply!

Best regards,
Bing

On Fri, Feb 24, 2012 at 3:30 AM, T Vinod Gupta wrote:

> regarding your question on hbase support for high performance and
> consistency - i would say hbase is highly scalable and performant. how it
> does what it does can be understood by reading relevant chapters around
> architecture and design in the hbase book.
>
> with regards to ranking, i see your problem. but if you split the problem
> into hbase specific solution and solr based solution, you can achieve the
> results probably. may be you do the ranking and store the rank in hbase and
> then use solr to get the results and then use hbase as a lookup to get the
> rank. or you can put the rank as part of the document schema and index the
> rank too for range queries and such. is my understanding of your scenario
> wrong?
>
> thanks
>
>
> On Wed, Feb 22, 2012 at 9:51 AM, Bing Li  wrote:
>
>> Mr Gupta,
>>
>> Thanks so much for your reply!
>>
>> In my use cases, retrieving data by keyword is one of them. I think Solr
>> is a proper choice.
>>
>> However, Solr does not provide a complex enough support to rank. And,
>> frequent updating is also not suitable in Solr. So it is difficult to
>> retrieve data randomly based on the values other than keyword frequency in
>> text. In this case, I attempt to use HBase.
>>
>> But I don't know how HBase support high performance when it needs to keep
>> consistency in a large scale distributed system.
>>
>> Now both of them are used in my system.
>>
>> I will check out ElasticSearch.
>>
>> Best regards,
>> Bing
>>
>>
>> On Thu, Feb 23, 2012 at 1:35 AM, T Vinod Gupta wrote:
>>
>>> Bing,
>>> Its a classic battle on whether to use solr or hbase or a combination of
>>> both. both systems are very different but there is some overlap in the
>>> utility. they also differ vastly when it compares to computation power,
>>> storage needs, etc. so in the end, it all boils down to your use case. you
>>> need to pick the technology that it best suited to your needs.
>>> im still not clear on your use case though.
>>>
>>> btw, if you haven't started using solr yet - then you might want to
>>> checkout ElasticSearch. I spent over a week researching between solr and ES
>>> and eventually chose ES due to its cool merits.
>>>
>>> thanks
>>>
>>>
>>> On Wed, Feb 22, 2012 at 9:31 AM, Ted Yu  wrote:
>>>
>>>> There is no secondary index support in HBase at the moment.
>>>>
>>>> It's on our road map.
>>>>
>>>> FYI
>>>>
>>>> On Wed, Feb 22, 2012 at 9:28 AM, Bing Li  wrote:
>>>>
>>>> > Jacques,
>>>> >
>>>> > Yes. But I still have questions about that.
>>>> >
>>>> > In my system, when users search with a keyword arbitrarily, the query
>>>> is
>>>> > forwarded to Solr. No any updating operations but appending new
>>>> indexes
>>>> > exist in Solr managed data.
>>>> >
>>>> > When I need to retrieve data based on ranking values, HBase is used.
>>>> And,
>>>> > the ranking values need to be updated all the time.
>>>> >
>>>> > Is that correct?
>>>> >
>>>> > My question is that the performance must be low if keeping
>>>> consistency in a
>>>> > large scale distributed environment. How does HBase handle this issue?
>>>> >
>>>> > Thanks so much!
>>>> >
>>>> > Bing
>>>> >
>>>> >
>>>> > On Thu, Feb 23, 2012 at 1:17 AM, Jacques  wrote:
>>>> >
>>>> > > It is highly unlikely that you could replace Solr with HBase.
>>>>  They're
>>>> > > really apples and oranges.
>>>> > >
>>>> > >
>>>> > > On Wed, Feb 22, 2012 at 1:09 AM, Bing Li  wrote:
>>>> > >
>>>> > >> Dear all,
>>>> > >>
>>>> > >> I wonder how data in HBase is indexed? Now Solr is used in my
>>>> system
>>>> > >> because data is managed in inverted index. Such an index is
>>>> suitable to
>>>> > >> retrieve unstructured and huge amount of data. How does HBase deal
>>>> with
>>>> > >> the
>>>> > >> issue? May I replaced Solr with HBase?
>>>> > >>
>>>> > >> Thanks so much!
>>>> > >>
>>>> > >> Best regards,
>>>> > >> Bing
>>>> > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: pagerank??

2012-04-04 Thread Bing Li

According to my knowledge, Solr cannot support this.

In my case, I get data by keyword-matching from Solr and then rank the data
by PageRank after that.

Thanks,
Bing

On Wed, Apr 4, 2012 at 6:37 AM, Manuel Antonio Novoa Proenza <
mano...@estudiantes.uci.cu> wrote:

> Hello,
>
> I have in my Solr index , many indexed documents.
>
> Let me know any way or efficient function to calculate the page rank of
> websites indexed.
>
>
> s
>
> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS
> INFORMATICAS...
> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION
>
> http://www.uci.cu
> http://www.facebook.com/universidad.uci
> http://www.flickr.com/photos/universidad_uci

How to Transmit and Append Indexes

2010-11-19 Thread Bing Li

Hi, all,

I am working on a distributed searching system. Now I have one server only.
It has to crawl pages from the Web, generate indexes locally and respond
users' queries. I think this is too busy for it to work smoothly.

I plan to use two servers at at least. The jobs to crawl pages and generate
indexes are done by one of them. After that, the new available indexes
should be transmitted to anther one which is responsible for responding
users' queries. From users' point of view, this system must be fast.
However, I don't know how I can get the additional indexes which I can
transmit. After transmission, how to append them to the old indexes? Does
the appending block searching?

Thanks so much for your help!

Bing Li

Is it fine to transmit indexes in this way?

2010-11-19 Thread Bing Li

Hi, all,

Since I didn't find that Lucene presents updated indexes to us, may I
transmit indexes in the following way?

1) One indexing machine, A, is busy with generating indexes;

2) After a certain time, the indexing process is terminated;

3) Then, the new indexes are transmitted to machines which serve users'
queries;

4) It is possible that some index files have the same names. So the
conflicting files should be renamed;

5) After the transmission is done, the transmitted indexes are removed from
A.

6) After the removal, the indexing process is started again on A.

The reason I am trying to do that is to load balancing the search load. One
machine is responsible for generating indexes and the others are responsible
for responding queries.

If the above approaches do not work, may I see the updates of indexes in
Lucene? May I transmit them? And, may I append them to existing indexes?
Does the appending affect the querying?

I am learning Solr. But it seems that Solr does that for me. However, I have
to set up Tomcat to use Solr. I think it is a little bit heavy.

Thanks!
Bing Li

Re: Is it fine to transmit indexes in this way?

2010-11-19 Thread Bing Li

Thanks so much, Gora!

What do you mean by appending? If you mean adding to an existing index
(on reindexing, this would normally mean an update for an existing Solr
document ID, and a create for a new Solr document ID), the best way
probably is not to delete the index on the master server (what you call
machine A). Once the indexing is completed, a commit ensures that new
documents show up for any subsequent queries.

When updates are replicated to slave servers, it is supposed that the
updates are merged with the existing indexes and readings on them can be
done concurrently. If so, the queries must be responded instantly. That's
what I mean "appending". Does it happen in Solr?

Best,
Bing

On Sat, Nov 20, 2010 at 1:58 AM, Gora Mohanty  wrote:

> On Fri, Nov 19, 2010 at 10:53 PM, Bing Li  wrote:
> > Hi, all,
> >
> > Since I didn't find that Lucene presents updated indexes to us, may I
> > transmit indexes in the following way?
> >
> > 1) One indexing machine, A, is busy with generating indexes;
> >
> > 2) After a certain time, the indexing process is terminated;
> >
> > 3) Then, the new indexes are transmitted to machines which serve users'
> > queries;
>
> Just replied to a similar question in another thread. The best way
> is probably to use Solr replication:
> http://wiki.apache.org/solr/SolrReplication
>
> You can set up replication to happen automatically upon commit on the
> master server (where the new index was made). As a commit should
> have been made when indexing is complete on the master server, this
> will then ensure that a new index is replicated on the slave server.
>
> > 4) It is possible that some index files have the same names. So the
> > conflicting files should be renamed;
>
> Replication will handle this for you.
>
> > 5) After the transmission is done, the transmitted indexes are removed
> from
> > A.
> >
> > 6) After the removal, the indexing process is started again on A.
> [...]
>
> These two items you have to do manually, i.e., delete all documents
> on A, and restart the indexing.
>
>
> > And, may I append them to
> existing indexes?
> > Does the appending affect the querying?
> [...]
>
> What do you mean by appending? If you mean adding to an existing index
> (on reindexing, this would normally mean an update for an existing Solr
> document ID, and a create for a new Solr document ID), the best way
> probably is not to delete the index on the master server (what you call
> machine A). Once the indexing is completed, a commit ensures that new
> documents show up for any subsequent queries.
>

> Regards,
> Gora
>

Re: How to Transmit and Append Indexes

2010-11-19 Thread Bing Li

Dear Erick,

Thanks so much for your help! I am new in Solr. So I have no idea about the
version.

But I wonder what are the differences between Solr and Hadoop? It seems that
Solr has done the same as what Hadoop promises.

Best,
Bing

On Sat, Nov 20, 2010 at 2:28 AM, Erick Erickson wrote:

> You haven't said what version of Solr you're using, but you're
> asking about replication, which is built-in.
> See: http://wiki.apache.org/solr/SolrReplication
>
> And no, your slave doesn't block while the update is happening,
> and it automatically switches to the updated index upon
> successful replication.
>
> Older versions of Solr used rsynch & etc.
>
> Best
> Erick
>
> On Fri, Nov 19, 2010 at 10:52 AM, Bing Li  wrote:
>
>> Hi, all,
>>
>> I am working on a distributed searching system. Now I have one server
>> only.
>> It has to crawl pages from the Web, generate indexes locally and respond
>> users' queries. I think this is too busy for it to work smoothly.
>>
>> I plan to use two servers at at least. The jobs to crawl pages and
>> generate
>> indexes are done by one of them. After that, the new available indexes
>> should be transmitted to anther one which is responsible for responding
>> users' queries. From users' point of view, this system must be fast.
>> However, I don't know how I can get the additional indexes which I can
>> transmit. After transmission, how to append them to the old indexes? Does
>> the appending block searching?
>>
>> Thanks so much for your help!
>>
>> Bing Li
>>
>
>

Re: How to Transmit and Append Indexes

2010-11-19 Thread Bing Li

Hi, Gora,

No, I really wonder if Solr is based on Hadoop?

Hadoop is efficient when using on search engines since it is suitable to the
write-once-read-many model. After reading your emails, it looks like Solr's
distributed file system does the same thing. Both of them are good for
searching large indexes in a large scale distributed environment, right?

Thanks!
Bing

On Sat, Nov 20, 2010 at 3:01 AM, Gora Mohanty  wrote:

> On Sat, Nov 20, 2010 at 12:05 AM, Bing Li  wrote:
> > Dear Erick,
> >
> > Thanks so much for your help! I am new in Solr. So I have no idea about
> the
> > version.
>
> The solr/admin/registry.jsp URL on your local Solr installation should show
> you the version at the top.
>
> > But I wonder what are the differences between Solr and Hadoop? It seems
> that
> > Solr has done the same as what Hadoop promises.
> [...]
>
> Er, what? Solr and Hadoop are entirely different applications. Did you
> mean Lucene or Nutch, instead of Hadoop?
>
> Regards,
> Gora
>

Import Data Into Solr

2010-12-02 Thread Bing Li

Hi, all,

I am a new user of Solr. Before using it, all of the data is indexed myself
with Lucene. According to the Chapter 3 of the book, Solr. 1.4 Enterprise
Search Server written by David Smiley and Eric Pugh, data in the formats of
XML, CSV and even PDF, etc, can be imported to Solr.

If I wish to import the Lucene indexes into Solr, may I have any other
approaches? I know that Solr is a serverized Lucene.

Thanks,
Bing Li

Solr Got Exceptions When "schema.xml" is Changed

2010-12-04 Thread Bing Li

Dear all,

I am a new user of Solr. Now I am just trying to try some basic samples.
Solr can be started correctly with Tomcat.

However, when putting a new schema.xml under SolrHome/conf and starting
Tomcat again, I got the following two exceptions.

The Solr cannot be started correctly unless using the initial schema.xml
from Solr.

Why cannot I change the schema.xml?

Thanks so much!
Bing

Dec 5, 2010 4:52:49 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.NullPointerException
at
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.core.QuerySenderListener.newSearcher(QuerySenderListener.java:52)
at org.apache.solr.core.SolrCore$3.call(SolrCore.java:1146)
at
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

-

SEVERE: Could not start SOLR. Check solr/home property
org.apache.solr.common.SolrException: QueryElevationComponent requires the
schema to have a uniqueKeyFie
ld implemented using StrField
at
org.apache.solr.handler.component.QueryElevationComponent.inform(QueryElevationComponent.java
:157)
at
org.apache.solr.core.SolrResourceLoader.inform(SolrResourceLoader.java:508)
at org.apache.solr.core.SolrCore.(SolrCore.java:588)
at
org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:137)
at
org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
at
org.apache.catalina.core.ApplicationFilterConfig.initFilter(ApplicationFilterConfig.java:273)

at
org.apache.catalina.core.ApplicationFilterConfig.getFilter(ApplicationFilterConfig.java:254)
at
org.apache.catalina.core.ApplicationFilterConfig.setFilterDef(ApplicationFilterConfig.java:37
2)
at
org.apache.catalina.core.ApplicationFilterConfig.(ApplicationFilterConfig.java:98)
at
org.apache.catalina.core.StandardContext.filterStart(StandardContext.java:4405)
at
org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5037)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at
org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:812)
at
org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:787)
at
org.apache.catalina.core.StandardHost.addChild(StandardHost.java:570)
at
org.apache.catalina.startup.HostConfig.deployWAR(HostConfig.java:891)
at
org.apache.catalina.startup.HostConfig.deployWARs(HostConfig.java:683)
at
org.apache.catalina.startup.HostConfig.deployApps(HostConfig.java:466)
at
org.apache.catalina.startup.HostConfig.start(HostConfig.java:1267)
at
org.apache.catalina.startup.HostConfig.lifecycleEvent(HostConfig.java:308)
at
org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
at
org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:89)
at
org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:328)
at
org.apache.catalina.util.LifecycleBase.setState(LifecycleBase.java:308)
at
org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:1043)
at
org.apache.catalina.core.StandardHost.startInternal(StandardHost.java:738)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at
org.apache.catalina.core.ContainerBase.startInternal(ContainerBase.java:1035)
at
org.apache.catalina.core.StandardEngine.startInternal(StandardEngine.java:289)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at
org.apache.catalina.core.StandardService.startInternal(StandardService.java:442)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at
org.apache.catalina.core.StandardServer.startInternal(StandardServer.java:674)
at
org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:140)
at org.apache.catalina.startup.Catalina.start(Catalina.java:596)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

SolrHome and Solr Data Dir in solrconfig.xml

2010-12-09 Thread Bing Li

Dear all,

I am a new user of Solr.

When using Solr, SolrHome is set to /home/libing/Solr. When Tomcat is
started, it must read solrconfig.xml to get Solr data dir, which is used to
contain indexes. However, I have no idea how to associate SolrHome with Solr
data dir. So a mistake occurs. All the indexes are put under
$TOMCAT_HOME/bin. This is NOT what I expect. I hope indexes are under
SolrHome.

Could you please give me a hand?

Best,
Bing Li

Indexing and Searching Chinese

2011-01-18 Thread Bing Li

Hi, all,

Now I cannot search the index when querying with Chinese keywords.

Before using Solr, I ever used Lucene for some time. Since I need to crawl
some Chinese sites, I use ChineseAnalyzer in the code to run Lucene.

I know Solr is a server for Lucene. However, I have no idea know how to
configure the analyzer in Solr?

I appreciate so much for your help!

Best,
LB

Indexing and Searching Chinese with SolrNet

2011-01-18 Thread Bing Li

Dear all,

After reading some pages on the Web, I created the index with the following
schema.

..





..

It must be correct, right? However, when sending a query though SolrNet, no
results are returned. Could you tell me what the reason is?

Thanks,
LB

Re: Indexing and Searching Chinese with SolrNet

2011-01-18 Thread Bing Li

Dear Jelsma,

My servlet container is Tomcat 7. I think it should accept Chinese
characters. But I am not sure how to configure it. From the console of
Tomcat, I saw that the Chinese characters in the query are not displayed
normally. However, it is fine in the Solr Admin page.

I am not sure either if SolrNet supports Chinese. If not, how can I interact
with Solr on .NET?

Thanks so much!
LB

On Wed, Jan 19, 2011 at 2:34 AM, Markus Jelsma
wrote:

> Why creating two threads for the same problem? Anyway, is your servlet
> container capable of accepting UTF-8 in the URL? Also, is SolrNet capable
> of
> handling those characters? To confirm, try a tool like curl.
>
> > Dear all,
> >
> > After reading some pages on the Web, I created the index with the
> following
> > schema.
> >
> > ..
> >  > positionIncrementGap="100">
> > 
> >  > class="solr.ChineseTokenizerFactory"/>
> > 
> > 
> > ..
> >
> > It must be correct, right? However, when sending a query though SolrNet,
> no
> > results are returned. Could you tell me what the reason is?
> >
> > Thanks,
> > LB
>

Re: Indexing and Searching Chinese with SolrNet

2011-01-18 Thread Bing Li

Dear Jelsma,

After configuring the Tomcat URIEncoding, Chinese characters can be
processed correctly. I appreciate so much for your help!

Best,
LB

On Wed, Jan 19, 2011 at 3:02 AM, Markus Jelsma
wrote:

> Hi,
>
> Yes but Tomcat might need to be configured to accept, see the wiki for more
> information on this subject.
>
> http://wiki.apache.org/solr/SolrTomcat#URI_Charset_Config
>
> Cheers,
>
> > Dear Jelsma,
> >
> > My servlet container is Tomcat 7. I think it should accept Chinese
> > characters. But I am not sure how to configure it. From the console of
> > Tomcat, I saw that the Chinese characters in the query are not displayed
> > normally. However, it is fine in the Solr Admin page.
> >
> > I am not sure either if SolrNet supports Chinese. If not, how can I
> > interact with Solr on .NET?
> >
> > Thanks so much!
> > LB
> >
> >
> > On Wed, Jan 19, 2011 at 2:34 AM, Markus Jelsma
> >
> > wrote:
> > > Why creating two threads for the same problem? Anyway, is your servlet
> > > container capable of accepting UTF-8 in the URL? Also, is SolrNet
> capable
> > > of
> > > handling those characters? To confirm, try a tool like curl.
> > >
> > > > Dear all,
> > > >
> > > > After reading some pages on the Web, I created the index with the
> > >
> > > following
> > >
> > > > schema.
> > > >
> > > > ..
> > > >
> > > >  > > >
> > > > positionIncrementGap="100">
> > > >
> > > > 
> > > >
> > > >  > > >
> > > > class="solr.ChineseTokenizerFactory"/>
> > > >
> > > > 
> > > >
> > > > 
> > > >
> > > > ..
> > > >
> > > > It must be correct, right? However, when sending a query though
> > > > SolrNet,
> > >
> > > no
> > >
> > > > results are returned. Could you tell me what the reason is?
> > > >
> > > > Thanks,
> > > > LB
>

SolrJ Tutorial

2011-01-21 Thread Bing Li

Hi, all,

In the past, I always used SolrNet to interact with Solr. It works great.
Now, I need to use SolrJ. I think it should be easier to do that than
SolrNet since Solr and SolrJ should be homogeneous. But I cannot find a
tutorial that is easy to follow. No tutorials explain the SolrJ programming
step by step. No complete samples are found. Could anybody offer me some
online resources to learn SolrJ?

I also noticed Solr Cell and SolrJ POJO. Do you have detailed resources to
them?

Thanks so much!
LB

Re: SolrJ Tutorial

2011-01-22 Thread Bing Li

I got the solution. Attach one complete sample code I made as follows.

Thanks,
LB

package com.greatfree.Solr;

import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.client.solrj.response.QueryResponse;
import org.apache.solr.common.params.ModifiableSolrParams;
import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.common.SolrDocumentList;
import org.apache.solr.client.solrj.beans.Field;

import java.net.MalformedURLException;

public class SolrJExample
{
public static void main(String[] args) throws MalformedURLException,
SolrServerException
{
SolrServer solr = new CommonsHttpSolrServer("
http://192.168.210.195:8080/solr/CategorizedHub";);

SolrQuery query = new SolrQuery();
query.setQuery("*:*");
QueryResponse rsp = solr.query(query);
SolrDocumentList docs = rsp.getResults();
System.out.println(docs.getNumFound());

try
{
SolrServer solrScore = new CommonsHttpSolrServer("
http://192.168.210.195:8080/solr/score";);
Score score = new Score();
score.id = "4";
score.type = "modern";
score.name = "iphone";
score.score = 97;
solrScore.addBean(score);
solrScore.commit();
}
catch (Exception e)
{
System.out.println(e.toString());
}

}
}


On Sat, Jan 22, 2011 at 3:58 PM, Lance Norskog  wrote:

> The unit tests are simple and show the steps.
>
> Lance
>
> On Fri, Jan 21, 2011 at 10:41 PM, Bing Li  wrote:
> > Hi, all,
> >
> > In the past, I always used SolrNet to interact with Solr. It works great.
> > Now, I need to use SolrJ. I think it should be easier to do that than
> > SolrNet since Solr and SolrJ should be homogeneous. But I cannot find a
> > tutorial that is easy to follow. No tutorials explain the SolrJ
> programming
> > step by step. No complete samples are found. Could anybody offer me some
> > online resources to learn SolrJ?
> >
> > I also noticed Solr Cell and SolrJ POJO. Do you have detailed resources
> to
> > them?
> >
> > Thanks so much!
> > LB
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

SolrDocumentList Size vs NumFound

2011-01-26 Thread Bing Li

Dear all,

I got a weird problem. The number of searched documents is much more than
10. However, the size of SolrDocumentList is 10 and the getNumFound() is the
exact count of results. When I need to iterate the results as follows, only
10 are displayed. How to get the rest ones?

..
for (SolrDocument doc : docs)
{

System.out.println(doc.getFieldValue(Fields.CATEGORIZED_HUB_TITLE_FIELD) +
": " + doc.getFieldValue(Fields.CATEGORIZED_HUB_URL_FIELD) + "; " +
doc.getFieldValue(Fields.HUB_CATEGORY_NAME_FIELD) + "/" +
doc.getFieldValue(Fields.HUB_PARENT_CATEGORY_NAME_FIELD));
}
..

Could you give me a hand?

Thanks,
LB

Open Too Many Files

2011-02-02 Thread Bing Li

Dear all,

I got an exception when querying the index within Solr. It told me that too
many files are opened. How to handle this problem?

Thanks so much!
LB

[java] org.apache.solr.client.solrj.
SolrServerException: java.net.SocketException: Too many open files
 [java] at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:483)
 [java] at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
 [java] at
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:89)
 [java] at
org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:118)
 [java] at com.greatfree.Solr.Broker.Search(Broker.java:145)
 [java] at
com.greatfree.Solr.SolrIndex.SelectHubPageHashByHubKey(SolrIndex.java:116)
 [java] at com.greatfree.Web.HubCrawler.Crawl(Unknown Source)
 [java] at com.greatfree.Web.Worker.run(Unknown Source)
 [java] at java.lang.Thread.run(Thread.java:662)
 [java] Caused by: java.net.SocketException: Too many open files
 [java] at java.net.Socket.createImpl(Socket.java:397)
 [java] at java.net.Socket.(Socket.java:371)
 [java] at java.net.Socket.(Socket.java:249)
 [java] at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
 [java] at
org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
 [java] at
org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
 [java] at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpConnectionAdapter.open(MultiThreadedHttpConnectionManager.java:1361)
 [java] at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
 [java] at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
 [java] at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
 [java] at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
 [java] at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)
 [java] ... 8 more
 [java] Exception in thread "Thread-96" java.lang.NullPointerException
 [java] at
com.greatfree.Solr.SolrIndex.SelectHubPageHashByHubKey(SolrIndex.java:117)
 [java] at com.greatfree.Web.HubCrawler.Crawl(Unknown Source)
 [java] at com.greatfree.Web.Worker.run(Unknown Source)
 [java] at java.lang.Thread.run(Thread.java:662)

Re: Solr Out of Memory Error

2011-02-09 Thread Bing Li

Dear Adam,

I also got the OutOfMemory exception. I changed the JAVA_OPTS in catalina.sh
as follows.

   ...
   if [ -z "$LOGGING_MANAGER" ]; then
 JAVA_OPTS="$JAVA_OPTS
-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager"
   else
JAVA_OPTS="$JAVA_OPTS -server -Xms8096m -Xmx8096m"
   fi
   ...

Is this change correct? After that, I still got the same exception. The
index is updated and searched frequently. I am trying to change the code to
avoid the frequent updates. I guess only changing JAVA_OPTS does not work.

Could you give me some help?

Thanks,
LB


On Wed, Jan 19, 2011 at 10:05 PM, Adam Estrada <
estrada.adam.gro...@gmail.com> wrote:

> Is anyone familiar with the environment variable, JAVA_OPTS? I set
> mine to a much larger heap size and never had any of these issues
> again.
>
> JAVA_OPTS = -server -Xms4048m -Xmx4048m
>
> Adam
>
> On Wed, Jan 19, 2011 at 3:29 AM, Isan Fulia 
> wrote:
> > Hi all,
> > By adding more servers do u mean sharding of index.And after sharding ,
> how
> > my query performance will be affected .
> > Will the query execution time increase.
> >
> > Thanks,
> > Isan Fulia.
> >
> > On 19 January 2011 12:52, Grijesh  wrote:
> >
> >>
> >> Hi Isan,
> >>
> >> It seems your index size 25GB si much more compared to you have total
> Ram
> >> size is 4GB.
> >> You have to do 2 things to avoid Out Of Memory Problem.
> >> 1-Buy more Ram ,add at least 12 GB of more ram.
> >> 2-Increase the Memory allocated to solr by setting XMX values.at least
> 12
> >> GB
> >> allocate to solr.
> >>
> >> But if your all index will fit into the Cache memory it will give you
> the
> >> better result.
> >>
> >> Also add more servers to load balance as your QPS is high.
> >> Your 7 Laks data makes 25 GB of index its looking quite high.Try to
> lower
> >> the index size
> >> What are you indexing in your 25GB of index?
> >>
> >> -
> >> Thanx:
> >> Grijesh
> >> --
> >> View this message in context:
> >>
> http://lucene.472066.n3.nabble.com/Solr-Out-of-Memory-Error-tp2280037p2285779.html
> >> Sent from the Solr - User mailing list archive at Nabble.com.
> >>
> >
> >
> >
> > --
> > Thanks & Regards,
> > Isan Fulia.
> >
>

Detailed Steps for Scaling Solr

2011-02-11 Thread Bing Li

Dear all,

I need to construct a site which supports searching for a large index. I
think scaling Solr is required. However, I didn't get a tutorial which helps
me do that step by step. I only have two resources as references. But both
of them do not tell me the exact operations.

1)
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

2) David Smiley, Eric Pugh; Solr 1.4 Enterprise Search Server

If you have experiences to scale Solr, could you give me such tutorials?

Thanks so much!
LB

My Plan to Scale Solr

2011-02-17 Thread Bing Li

Dear all,

I started to learn how to use Solr three months ago. My experiences are
still limited.

Now I crawl Web pages with my crawler and send the data to a single Solr
server. It runs fine.

Since the potential users are large, I decide to scale Solr. After
configuring replication, a single index can be replicated to multiple
servers.

For shards, I think it is also required. I attempt to split the index
according to the data categories and priorities. After that, I will use the
above replication techniques and get high performance. The following work is
not so difficult.

I noticed some new terms, such as SolrClould, Katta and ZooKeeper. According
to my current understandings, it seems that I can ignore them. Am I right?
What benefits can I get if using them?

Thanks so much!
LB

Selection Between Solr and Relational Database

2011-03-03 Thread Bing Li

Dear all,

I have started to learn Solr for two months. At least right now, my system
runs good in a Solr cluster.

I have a question when implementing one feature in my system. When
retrieving documents by keyword, I believe Solr is faster than relational
database. However, if doing the following operations, I guess the
performance must be lower. Is it right?

What I am trying to do is listed as follows.

1) All of the documents in Solr have one field which is used to
differentiate them; different categories have different value in such a
field, e.g., Group; the documents are classified as "news", "sports",
"entertainment" and so on.

2) Retrieve all of them documents by the field, Group.

3) Besides the field of Group, another field called CreatedTime is also
existed. I will filter the documents retrieved by Group according to the
value of CreatedTime. The filtered documents are the final results I need.

I guess the operation performance is lower than relational database, right?
Could you please give me an explanation to that?

Best regards,
Li Bing

Re: SolrJ Tutorial

2011-03-03 Thread Bing Li

Dear Lance,

Could you tell me where I can find the unit tests code?

I appreciate so much for your help!

Best regards,
LB

On Sat, Jan 22, 2011 at 3:58 PM, Lance Norskog  wrote:

> The unit tests are simple and show the steps.
>
> Lance
>
> On Fri, Jan 21, 2011 at 10:41 PM, Bing Li  wrote:
> > Hi, all,
> >
> > In the past, I always used SolrNet to interact with Solr. It works great.
> > Now, I need to use SolrJ. I think it should be easier to do that than
> > SolrNet since Solr and SolrJ should be homogeneous. But I cannot find a
> > tutorial that is easy to follow. No tutorials explain the SolrJ
> programming
> > step by step. No complete samples are found. Could anybody offer me some
> > online resources to learn SolrJ?
> >
> > I also noticed Solr Cell and SolrJ POJO. Do you have detailed resources
> to
> > them?
> >
> > Thanks so much!
> > LB
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>

When Index is Updated Frequently

2011-03-04 Thread Bing Li

Dear all,

According to my experiences, when the Lucene index updated frequently, its
performance must become low. Is it correct?

In my system, most data crawled from the Web is indexed and the
corresponding index will NOT be updated any more.

However, some indexes should be updated frequently like the records in
relational databases. The sizes of the indexes are not so large as the
crawled data. The updated index will NOT be scaled to many other nodes. In
most time, they are located on a very limited number of machines.

In this case, may I use Lucene indexes? Or I need to replace them with
relational databases?

Thanks so much!
LB

Re: When Index is Updated Frequently

2011-03-04 Thread Bing Li

Dear Michael,

Thanks so much for your answer!

I have a question. If Lucene is good at updating, it must more loads on the
Solr cluster. So in my system, I will leave the large amount of crawled data
unchanged for ever. Meanwhile, I use a traditional database to keep mutable
data.

Fortunately, in most Internet systems, the amount of mutable data is much
less than that of immutable one.

How do you think about my solution?

Best,
LB

On Sat, Mar 5, 2011 at 2:45 AM, Michael McCandless <
luc...@mikemccandless.com> wrote:

> On Fri, Mar 4, 2011 at 10:09 AM, Bing Li  wrote:
>
> > According to my experiences, when the Lucene index updated frequently,
> its
> > performance must become low. Is it correct?
>
> In fact Lucene can gracefully handle a high rate of updates with low
> latency turnaround on the readers, using the near-real-time (NRT) API
> -- IndexWriter.getReader() (or in soon-to-be 31,
> IndexReader.open(IndexWriter)).
>
> NRT is really something a hybrid of "eventual consistency" and
> "immediate consistency", because it lets your app have full control
> over how quickly changes must be visible by controlling when you
> pull a new NRT reader.
>
> That said, Lucene can't offer true immediate consistency at a high
> update rate -- the time to open a new NRT reader is usually too costly
> to do, eg, for every search.  But eg every 100 msec (say) is
> reasonable (depending on many variables...).
>
> So... for your app you should run some tests and see.  And please
> report back.
>
> (But, unfortunately, NRT hasn't been exposed in Solr yet...).
>
> --
> Mike
>
> http://blog.mikemccandless.com
>

how often do you boys restart your tomcat?

2011-07-26 Thread Bing Yu

I find that, if I do not restart the master's tomcat for some days,
the load average will keep rising to a high level, solr become slow
and unstable, so I add a crontab to restart the tomcat everyday.

do you boys restart your tomcat ? and is there any way to avoid restart tomcat?

Re: how often do you boys restart your tomcat?

2011-07-26 Thread Bing Yu

I want to let system do the job instead of system adminm, beause I'm lazy ~ ^__^

But I just want a better way to fix the problem. restart server will
cause some other problem like I need to rebuild the changes happened
during the restart.

2011/7/27 Dave Hall :
> On 27/07/11 11:42, Bing Yu wrote:
>>
>> do you boys restart your tomcat ? and is there any way to avoid restart
>> tomcat?
>
> Our female sysadmin takes care of managing our server.
>

I can't pass the unit test when compile from apache-solr-3.3.0-src

2011-07-28 Thread Bing Yu

I just goto apache-solr-3.3.0/solr and run 'ant test'

I find that the junit test will always fail, and told me ’BUILD FAILED‘

but if I type 'ant dist', I can get a apache-solr-3.3-SNAPSHOT.war
with no warning.

Is it a problem just me？

my server:Centos 5.6 64bit／apache-ant-1.8.2　/junit and jdk （both
jrocket and sun jdk1.6 fails）

Multiple Embedded Servers Pointing to single solrhome/index

2012-08-06 Thread Bing Hua

Hi,

I'm trying to use two embedded solr servers pointing to a same solrhome /
index. So that's something like

System.setProperty("solr.solr.home", "SomeSolrDir");
CoreContainer.Initializer initializer = new
CoreContainer.Initializer();
CoreContainer coreContainer = initializer.initialize();
m_server = new EmbeddedSolrServer(coreContainer, "");

on both applications. The problem is, after I have done one add+commit
SolrInputDocument on one embedded server, the other server would fail to
obtain write lock any more. I'm thinking there must be a way of releasing
write lock so other servers may pick up. Is there an API that does so?

Any inputs are appreciated.
Bing

Re: Multiple Embedded Servers Pointing to single solrhome/index

2012-08-07 Thread Bing Hua

Thanks Lance. The use case is to have a cluster of nodes which runs the same
application with EmbeddedSolrServer on each of them, and they all point to
the same index on NFS. Every application is designed equal, meaning that
everyone may index and/or search. 

In such way, after every commit the writer needs to be closed for other
nodes' availability.

Do you see any issues of this use case? Is the EmbeddedSolrServer able to
release its write lock without shutting down?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Embedded-Servers-Pointing-to-single-solrhome-index-tp3999451p3999591.html
Sent from the Solr - User mailing list archive at Nabble.com.

Does Solr support 'Value Search'?

2012-08-07 Thread Bing Hua

Hi folks,

Just wondering if there is a query handler that simply takes a query string
and search on all/part of fields for field values?

e.g. 
q=*admin*

Response may look like
author: [admin, system_admin, sub_admin]
last_modifier: [admin, system_admin, sub_admin]
doctitle: [AdminGuide, AdminManual]



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr index storage strategy on FileSystem

2012-08-07 Thread Bing Hua

Hi folks,

With StandardDirectoryFactory, index is stored under data/index in forms of
frq, tim, tip and a few other files. While index grows larger, more files
are generated and sometimes it merges a few of them. It's like there're some
kinds of separation and merging strategies there.

My question is, are the separation / merging strategies configurable?
Basically I want to add a size limit for any individual file. Is it feasible
without changing solr core code?

Thanks!
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr-index-storage-strategy-on-FileSystem-tp3999661.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does Solr support 'Value Search'?

2012-08-08 Thread Bing Hua

Thanks for the response but wait... Is it related to my question searching
for field values? I was not asking how to use wildcards though. 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p3999817.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does Solr support 'Value Search'?

2012-08-08 Thread Bing Hua

Not quite understand but I'd explain the problem I had. The response would
contain only fields and a list of field values that match the query.
Essentially it's querying for field values rather than documents. The
underlying use case would be, when typing in a quick search box, the drill
down menu may contain matches on authors, on doctitles, and potentially on
other fields.

Still thanks for your response and hopefully I'm making it clearer.
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p327.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple Embedded Servers Pointing to single solrhome/index

2012-08-09 Thread Bing Hua

Makes sense. Thank you.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Embedded-Servers-Pointing-to-single-solrhome-index-tp3999451p4000180.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Does Solr support 'Value Search'?

2012-08-09 Thread Bing Hua

Thanks Kuli and Mikhail,

Using either termcomponent or suggester I could get some suggested terms but
it's still confusing me how to get the respective field names. In order to
get that, Use TermComponent I'll need to do a term query to every possible
field. Similar things as using SpellCheckComponent. CopyField won't help
since I want the original field name.

Any suggestions?
Bing 



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Does-Solr-support-Value-Search-tp3999654p4000267.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Multiple Embedded Servers Pointing to single solrhome/index

2012-08-09 Thread Bing Hua

I agree. We chose embedded to minimize the maintenance cost of http solr
servers.

One more concern. Even if I have only one node doing indexing, other nodes
need to reopen index reader periodically to catch up with new changes,
right? Is there a solr request that does this?

Thanks,
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-Embedded-Servers-Pointing-to-single-solrhome-index-tp3999451p4000269.html
Sent from the Solr - User mailing list archive at Nabble.com.

Multiple SpellCheckComponents

2012-08-09 Thread Bing Hua

Hello,

Background is that I want to use both Suggest and SpellCheck features in a
single query to have alternatives returned at one time. Right now I can only
specify one of them using spellcheck.dictionary at query time. 

  

  default
  ..



  suggest
  

  

Am I able to use two separate SpellCheckComponents for these two and add
them to a same searchhandler to achieve this? I tried and seems like one is
overwriting the other.




--
View this message in context: 
http://lucene.472066.n3.nabble.com/Multiple-SpellCheckComponents-tp4000272.html
Sent from the Solr - User mailing list archive at Nabble.com.

SpellCheckComponent Collation query

2012-08-09 Thread Bing Hua

Hello,

>From spell check component I'm able to get the collation query and its # of
hits. Is it possible to have solr execute the collated query automatically
and return doc search results without resending it on client side?

Thanks,
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SpellCheckComponent-Collation-query-tp4000273.html
Sent from the Solr - User mailing list archive at Nabble.com.

Tlog vs. buffer + softcommit.

2012-08-09 Thread Bing Hua

Hello,

I'm a bit confused with the purpose of Transaction Logs (Update Logs) in
Solr.

My understanding is, update request comes in, first the new item is put in
RAM buffer as well as T-Log. After a soft commit happens, the new item
becomes searchable but not hard committed in stable storage. Configuring
soft commit interval to 1 sec achieves NRT.

Then what exactly T-Log is doing in this scenario? Why is it there and under
what circumstances is it being cleared? 

I tried to search for online documentations but no success. Trying to get
something from source code. Any hints would be appreciated.

Thanks,
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tlog-vs-buffer-softcommit-tp4000330.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tlog vs. buffer + softcommit.

2012-08-10 Thread Bing Hua

Thanks for the information. It definitely helps a lot. There're
numDeletesToKeep = 1000; numRecordsToKeep = 100; in UpdateLog so this should
probably be what you're referring to. 

However when I was doing indexing the total size of TLogs kept on
increasing. It doesn't sound like the case where there's a cap for number of
documents? Also for peersync, can I find some intro online?



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tlog-vs-buffer-softcommit-tp4000330p4000503.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Tlog vs. buffer + softcommit.

2012-08-10 Thread Bing Hua

I remember I did set the 15sec autocommit and still saw the Tlogs growing
unboundedly. But sounds like theoretically it should not if I index in a
constant rate. I'll probably try it again sometime.

For the peersync, I think solr cloud now uses push-replication over pull.
Hmm, it makes sense to keep an amount of Tlogs for peers to sync up.

Thanks,
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Tlog-vs-buffer-softcommit-tp4000330p4000509.html
Sent from the Solr - User mailing list archive at Nabble.com.

Solr4.0 Partially update document

2012-08-13 Thread Bing Hua

Hi,

Several days ago I came across some solrj test code on partially updating
document field values. Sadly I forgot where that was. In Solr 4.0, "/update"
is able to take in document id and fields as hashmaps like

"id": "doc1"
"field1": {"set":"new_value"}

Just trying to figure out what's the solrj client code that does this.

Thanks for any help on this,
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-0-Partially-update-document-tp4000875.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Solr4.0 Partially update document

2012-08-13 Thread Bing Hua

Got it at 

https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/solrj/src/test/org/apache/solr/client/solrj/SolrExampleTests.java

Problem solved.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Solr4-0-Partially-update-document-tp4000875p4000878.html
Sent from the Solr - User mailing list archive at Nabble.com.

Getting Suggestions without Search Results

2012-08-13 Thread Bing Hua

Hi,

I'm having a spell check component that does auto-complete suggestions. It
is part of "last-components" of my /select search handler. So apart from
normal search results I also get a list of suggestions.

Now I want to split things up. Is there a way that I can only get
suggestions of a query without getting the normal search results? I may need
to create a new handler for this. Can anyone please give me some ideas on
that?

Thanks,
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-Suggestions-without-Search-Results-tp4000968.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Getting Suggestions without Search Results

2012-08-14 Thread Bing Hua

Great comments. Thanks to you all.
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Getting-Suggestions-without-Search-Results-tp4000968p4001192.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Indexing thousands file on solr

2012-08-14 Thread Bing Hua

You may write a client using solrj and loop through all files in that folder.
Something like,

ContentStreamUpdateRequest up = new
ContentStreamUpdateRequest("/update/extract");
up.addFile(new File(fileLocation), null);
ModifiableSolrParams p = new ModifiableSolrParams();
p.add("literal.id", str);
...
up.setParams(p);
server.request(up);

Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-thousands-file-on-solr-tp4001050p4001196.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Are there any comparisons of Elastic Search specifically with SOLR 4?

2012-08-14 Thread Bing Hua

Most of existing comparisons were done on Solr3.x or earlier against ES.
After Solr4 added those cloud concepts similar to ES's, there are really
less differences. Solr is more heavier loaded and was not designed for
maximize elasticity In my opinion. It's not hard to decide which way to go
as long as you have a preference on better scalability or better stability &
online supports.

Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Are-there-any-comparisons-of-Elastic-Search-specifically-with-SOLR-4-tp4000889p4001237.html
Sent from the Solr - User mailing list archive at Nabble.com.

Send plain text file to solr for indexing

2012-08-30 Thread Bing Hua

Hello,

I used to use solrcell, which has built-in tika support to handle both
extraction and indexing of raw documents. Now I got another text extraction
provider to convert raw document to a plain text txt file so I want to let
solr bypass that extraction phase. Is there a way I can send the plain txt
file to solr to simply index that as a fulltext field without doing
extraction on that file?

Thanks,
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Send plain text file to solr for indexing

2012-08-31 Thread Bing Hua

So in order to use solrcell I'll have to add a number of dependent libraries,
which is one of what I'm trying to avoid. The second thing is, solrcell
still parses the plain text files and I don't want it to make any change to
those of my exported files.

Any ideas?
Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515p4004753.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Send plain text file to solr for indexing

2012-08-31 Thread Bing Hua

Thanks Mr.Yagami. I'll look into that.

Jack, for the latter two options, they both require reading the entire text
file into memory, right?

Bing



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Send-plain-text-file-to-solr-for-indexing-tp4004515p4004772.html
Sent from the Solr - User mailing list archive at Nabble.com.

93 matches

Mail list logo